CtxKG is our method for generating knowledge graphs directly from text. Its goal is to process multiple documents in order to create graphs for each one, which are then combined to form a connected network that functions as one single graph.
It contains four stages—triple generation, graph generation, graph reduction and bridge building—and is available in both English and Portuguese.
It is based on AutoKG1 and it uses CoreNLP's OpenIE implementation2, our own OpenIE implementation for Portuguese3, BERT4 and BERTimbau5.
We recommend that you use Python 3.9
(though 3.10
and 3.11
should also work) and that you have Poetry installed.
poetry install
The server is the main way to run and manage the knowledge graph generation process.
Start the Flask server by running:
flask --app src.app run
For the debug mode, run:
flask --app src.app run --debug
The application contains x main sections:
- The home page, where you can choose between the English and the Portuguese version.
- The batch page, where you can see all batches that have been created for that language, as well as create a new batch.
- The graph list pages, where you can see the list of graphs (both base and reduced) that have been created for a batch.
- The graph display page, where graphs can be inspected visually, with nodes/entities being represented by circles and edges/relationships being represented by lines.
If you wish, you may run any of the four stages directly from the CLI. This is done by calling any of the four modules using python -m
and passing the relevant CLI arguments.
To correctly set up the documents for the knowledge graph generation, you must add all TXT files to the documents
directory, within your desired language, inside a folder with the name you want for your document group. For example, if you have a set of documents in English about history, you should create a new directory named History
inside the documents / en
directory and create copies of your documents in it.
Once the documents are in the correct directory, you may run python -m src.ctxkg.builders.build_triples
to convert each document into a set of relationship triples. The triples are stored in the triples
directory. You may include the following CLI arguments:
-l
/--language
: the language of the documents you want to process. Can be eitheren
orpt-BR
, depending on where the documents are. Not passing this argument will cause all documents in both languages to be processed.-n
/--name
: name of the group (e.g.History
in the last example). Must be within the selected language's directory. Not passing this argument will cause all documents within the selected language to be processed.
With the generated triples, building the base knowledge graphs is done by running python -m src.ctxkg.builders.build_graphs
. You may include the following CLI arguments:
--small
/--medium
/--big
: the size of the BERT encoder (English only). Defaults tosmall
.-t
/--threshold
: the minimum cosine similarity for two entities to be considered synonyms. Defaults to0.8
.-r
/--ratio
: the ratio between the base entity encoding and the triple encoding for the generation of the final entity encoding. Defaults to1.0
(i.e. only base entity encoding).-l
/--language
: the language of the documents you want to process. Can be eitheren
orpt-BR
, depending on where the documents are. If not set, a dialog will open and request that you select a group.-n
/--name
: name of the group you want to process. If not set, a dialog will open and request that you select one.-b
/--batch
: impacts how many entity encodings at processed at a time by the GPU. Defaults to300
. Probably will not need to be changed.
To run the graph reduction stage, execute python -m src.ctxkg.builders.clean_graphs
. In this stage, synonyms are merged into a single entity, which is the one among the synonyms that is the most recurring in the graph. You may include the following CLI arguments:
-l
/--language
: the language of the graph group you want to reduce. Can be eitheren
orpt-BR
, depending on where the documents are. If not set, a dialog will open and request that you select the group directory.-n
/--name
: name of the group. If not set, a dialog will open and request that you select the group directory.
To generate connections between different graphs, run python -m src.ctxkg.builders.build_bridges
. This is the most time-consuming step, as all individual graphs are compared to each other. You may include the following CLI arguments:
--small
/--medium
/--big
: the size of the BERT encoder (English only). Defaults tosmall
.-t
/--threshold
: the minimum cosine between two entities for a bridge to be established. Defaults to0.8
.-r
/--ratio
: the ratio between the base entity encoding and the triple encoding for the generation of the final entity encoding. Defaults to1.0
(i.e. only base entity encoding).-l
/--language
: the language of the graph group you want to reduce. Can be eitheren
orpt-BR
, depending on where the documents are. If not set, a dialog will open and request that you select the group directory.-n
/--name
: name of the group. If not set, a dialog will open and request that you select the group directory.