GitHub - MinhVuong2000/LLMReasonCert: Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).

Direct Evaluation of CoT in Multi-hop Reasoning with Knowledge Graphs

Official Implementation of "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs".

Has been accepted at ACL2024 Findings.

Aiming evaluate not only final answers but also intermediate steps in the CoT reasoning capabilities of LLMs in multi-hop question answering, the paper proposed 2 evaluation modules:

Discriminative: assess LLMs' knowledge of reasoning
Generative: assess the accuracy of the generated CoT by utilizing knowledge graphs (KGs).

In addition, we do ablation studies to evaluate the fine-grain CoT generation to calculate edit-distance & reasoning errors.

Requirements

conda create --name llm-reasoning-cert python=3.8
conda activate llm-reasoning-cert

pip install -r requirements.txt

Datasets

The paper uses 2 datasets: CWQ and GrailQA as initiate datasets for experiments.

Then, extract subgraph and ground-truth reasoning path based on SPARQL.

Final datasets used for the paper are uploaded into HuggingFace: (Note: update later)

Preprocess for each dataset:

Aim: create subgraphs for querying ground-truth reasoning path & creating VectorDB

Create subgraphs

Code at ./preprocess_data

Create subgraph from the raw-subgraph via the detail implementation in preprocess's readme
Get groundtruth reasoning path via the subgraph, answer entities and topic entities

python ./preprocess_data/ground_truth_paths.py

Rearrange questions according to the number of edge of groundtruth reasoning path

python ./preprocess_data/splitted_ground_truth_paths.py

We only use questions >=2 hops in the corresponding reasoning path.

Create VectorDB

FAISS & sentence-transformers/all-mpnet-base-v2 are used to create VectorDB before retrieving

DATASET='cwq' # 'grail_qa
sbatch scripts/gen-cert/extract_triplet.sh $DATASET

you can setup addition arguments:

embed_model_name. Default is sentence-transformers/all-mpnet-base-v2
top_k. Default is 10
device. Default is cpu

!Note: remember re-setup them in ./generative-cert.py#L228

Framework

Set your OpenAI api key & Huggingface key (if needed) in .env (check file .env.example as the example).

Discriminative Mode

    sh scripts/submit_discriminative_cert.sh

Generative Mode

Stage1: LLM prompting for structured answer

ChatGPT

sh scripts/gen-cert/llm_prompting.sh

HF models: Llama2 7B/13B/70B chat-hf, Mistral-7B-Instruct-v0.1, Qwen-14B-Chat, Vicuna-33b-v1.3

sh generative_cert/scripts/fitcluster/script.sh

Stage 2 & 3: Retrieval & Evaluation

Main result

sh scripts/gen-cert/job_eval_llm.sh

The fine-grained generative evaluation: edit-distance score

sh scripts/gen-cert/job_eval_llm_finegrained.sh
python finegrained_analysis.py

Run the analysis for reasoning errors

python finegrained_analysis.py

Results

Citation

If you find this paper or the repo useful for your work, please consider citing the paper

@misc{nguyen2024direct,
    title={Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs},
    author={Minh-Vuong Nguyen and Linhao Luo and Fatemeh Shiri and Dinh Phung and Yuan-Fang Li and Thuy-Trang Vu and Gholamreza Haffari},
    year={2024},
    eprint={2402.11199},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
extract_subgraph		extract_subgraph
extract_triplet		extract_triplet
figures		figures
generative_cert		generative_cert
llms		llms
preprocess_data		preprocess_data
scripts		scripts
virtuoso_db		virtuoso_db
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
certify_fact.py		certify_fact.py
discriminative-cert.py		discriminative-cert.py
discriminative_prompts.py		discriminative_prompts.py
evaluate_results.py		evaluate_results.py
finegrained_analysis.py		finegrained_analysis.py
finegrained_gen_cert.py		finegrained_gen_cert.py
generative-cert.py		generative-cert.py
llm_generation.py		llm_generation.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Direct Evaluation of CoT in Multi-hop Reasoning with Knowledge Graphs

Requirements

Datasets

Preprocess for each dataset:

Create subgraphs

Create VectorDB

Framework

Discriminative Mode

Generative Mode

Stage1: LLM prompting for structured answer

Stage 2 & 3: Retrieval & Evaluation

Results

Citation

About

Releases

Packages

Languages

MinhVuong2000/LLMReasonCert

Folders and files

Latest commit

History

Repository files navigation

Direct Evaluation of CoT in Multi-hop Reasoning with Knowledge Graphs

Requirements

Datasets

Preprocess for each dataset:

Create subgraphs

Create VectorDB

Framework

Discriminative Mode

Generative Mode

Stage1: LLM prompting for structured answer

Stage 2 & 3: Retrieval & Evaluation

Results

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages