This repository is the official implementation of LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval.
Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations—inheritance, alias, and composition—from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.
- We use neo4j-community-3.5.13 as the database for graph data storage. Download the Windows version or macOS/Linux version. Please follow the official manual for installation.
- And then config the Neo4j URI, user name and password in sources\config.ini
- To use the Neo4j database more efficiently, you can create indexes. The corresponding statements are as follows:
CALL db.index.fulltext.createNodeIndex("root_sentence_lemma_index", ["_ROOT_"], ["sentenceLemma"]);The python version is 3.10.x. And to install requirements:
pip install -r requirements.txt
- HotpotQA dataset can download from http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.json
- Musique dataset can download from https://huggingface.co/datasets/bdsaglam/musique/blob/main/musique_ans_v1.0_dev.jsonl
If you need to test an online model, you must prepare the corresponding API_KEY and URL, then input them into the sources\config.ini file and configure the relevant parameters. For local models, only the corresponding URL is required. The Deepseek model must be configured as the answer matching model. If you want to test the performance of the LLama3 model, you also need to configure the LLama3 model.
Please run the learning stage first. Then run the answering stage.
To evaluate HotpotQA, run:
python evaluate/eval.py --dataset hotpot --model deepseek --stage learn --path path/to/hotpot_dev_distractor_v1.json
python evaluate/eval.py --dataset hotpot --model deepseek --stage answer --path path/to/hotpot_dev_distractor_v1.json
python evaluate/eval.py --dataset hotpot --model llama --stage learn --path path/to/hotpot_dev_distractor_v1.json
python evaluate/eval.py --dataset hotpot --model llama --stage answer --path path/to/hotpot_dev_distractor_v1.jsonTo evaluate Musique, run:
python evaluate/eval.py --dataset musique --model deepseek --stage learn --path path/to/musique_data_v1.0/musique_ans_v1.0_dev.jsonl
python evaluate/eval.py --dataset musique --model deepseek --stage answer --path path/to/musique_data_v1.0/musique_ans_v1.0_dev.jsonl
python evaluate/eval.py --dataset musique --model llama --stage learn --path path/to/musique_data_v1.0/musique_ans_v1.0_dev.jsonl
python evaluate/eval.py --dataset musique --model llama --stage answer --path path/to/musique_data_v1.0/musique_ans_v1.0_dev.jsonlTo evaluate Reflection, run:
python tests/test_reflection.pyOur model achieves the following performance on :
| Model / Dataset | HotpotQA | Musique | ||||
|---|---|---|---|---|---|---|
| Model | Deepseek v3-0324 | Llama-3.3-70B-Instruct-AWQ | AVG | Deepseek v3-0324 | Llama-3.3-70B-Instruct-AWQ | AVG |
| Language Graph Model | 89.46% | 87.06% | 88.26% | 68.13% | 63.07% | 65.60% |
| GraphRAG 1 | 88.55% | 82.59% | 85.57% | 64.98% | 63.16% | 64.07% |
| GraphRAG 2 | 86.90% | 69.21% | 78.06% | 48.98% | 48.61% | 48.79% |
| LightRAG 2 | 87.94% | 76.34% | 82.14% | 65.36% | 50.33% | 57.84% |
| FastRAG 3 | 72.66% | 72.26% | 72.46% | 39.91% | 36.51% | 38.21% |
| Dify | 68.53% | 43.64% | 56.09% | 52.32% | 18.27% | 35.29% |
We analyze the contribution of each component via ablation on HotpotQA (DeepSeek v3-0324). The maximum input size was reduced from 120,000 to 30,000 characters. Figure shows that F1 varies only mildly (std 0.009) and Recall remains stable (std 0.0038). The best F1 (89.46%) occurs at 60,000 with Recall 99.09%, indicating robustness to context budget.
Language Graph Model source is under the MIT License.

