KoRe: Compact Knowledge Representations for Large Language Models

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

Dataset

The model is primarily trained on the Tri-Rex dataset, which provides factual statements with associated knowledge graph contexts Other datasets used are the test split of the web-QSP data mapped to WikiData entities, and the simple questions dataset. GrailQA was not used for the results in the paper as it required a long time to preprocess, but it can be used for evaluation and training as well.

Data preparation

To prepare the datasets, run the create_hf_datasets.py script with the appropriate configuration file. The script will use the base path of the search for the following dataset files:

TriRex_v1.tar (+lite) webpage
TRExStar_v1.tar (+lite) webpage
TrexBite_v1.tar (+lite) webpage
grailqa_v1.0_train.json & grailqa_v1.0_dev.json (the test set lacks fields needed for our preprocessing) from this archive
webqsp.examples.test.wikidata.json (used only for evaluation): you can get it from the folder input in the following zip (link from Github)
train, val and test splits of the simple questions dataset mapped to WD. These files need to be downloaded and placed in the base path provided in the configuration file.

NOTE: some of the steps require internet access, if your compute nodes do not have it you can try to run the scipt for the lite version of the data first, then you should be able to run the full version without internet access.

Model

The model architecture can be divided into the following components:

Graph Encoder: Graph Neural Network to encode the 1-hop sub-graph into a fixed-size vector representation
Residual Vector Quantization: To compress the graph representation into a compact discrete set of tokens
Base Language Model: Qwen3-8B (configurable) as the backbone LLM to inject the knowledge tokens into
LoRA: LoRA fine-tuning with optional full model tuning

Model Configurations

The model hyperparameters are controlled through YAML configuration files, the ones used for training and evaluation are available in the configs/ directory and can be modified as needed.

Training

To train the model, run the train.py script with the appropriate configuration file. To run the training using accelerate, take a look at launchers/train.sh for an example of how to launch the training on multiple GPUs. The training script will handle the data loading, model initialization, and optimization process according to the specified configuration. Accelerate and deepspeed configuration examples can be found in the configs/ directory as well, and can be modified to fit your specific hardware setup and requirements.

Evaluation

To evaluate the model, run the evaluate.py script with the appropriate configuration file. The evaluation script will compute the relevant metrics for each dataset and output the results. Similarly to training, you can use accelerate to run the evaluation on multiple GPUs, and the configuration examples can be found in the configs/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
KG_LM		KG_LM
configs		configs
launchers		launchers
resources		resources
.gitignore		.gitignore
README.md		README.md
create_hf_datasets.py		create_hf_datasets.py
embed_graphs.py		embed_graphs.py
environment.yml		environment.yml
evaluate.py		evaluate.py
generate_grailQA.py		generate_grailQA.py
generate_simplequestions.py		generate_simplequestions.py
generate_webQSP.py		generate_webQSP.py
inference.py		inference.py
prepare_env.sh		prepare_env.sh
requirements.txt		requirements.txt
sweep.py		sweep.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KoRe: Compact Knowledge Representations for Large Language Models

Dataset

Data preparation

Model

Model Configurations

Training

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KoRe: Compact Knowledge Representations for Large Language Models

Dataset

Data preparation

Model

Model Configurations

Training

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages