CONCRETE

Source code for the COLING'22 paper: CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval.

We propose a fact checking framework augmented with cross-lingual retrieval for zero-shot cross-lingual fact checking, as depicted in the above figure.

Dependencies

All required packages are listed in requirements.txt.

conda create -n concrete python=3.7
conda activate concrete
pip install -r requirements.txt

Data

Please create a data folder under the root directory of this project and download the X-Fact data from their repo under this data directory (an example path of the tsv file is /data/x-fact/train.all.tsv). Please also create another directory CORA/mDPR/retrieved_docs and download the retrieved multilingual passages from here into this folder (an example path of the json file is retrieved_docs/zeroshot.xict.json). These passages were retrieved from a multilingual passage collection using the proposed cross-lingual retriever.

Cross-lingual Retrieval

We develop our cross-lingual retriever based on mDPR. Our cross-lingual scripts can be found under CORA/mDPR. The BBC passages in the passage collection can be downloaded from here. This folder contains 7 sub-folders, each denotes passages in a different langauge. The articles are downloaded using news-please. If you want to crawl news articles with a custom range of news, please refer to this example script provided by news-please. Our crawling script is also provided in the root directory of this repo commoncrawl.py. Below, we illustrate our scripts for running X-ICT and how to perform inference. (This step is not needed if you only want to experiment with our retrieved passages, which is described in the previous section.)

Create X-ICT samples

Since machine translation is computationally expensive if performed on the fly, we pre-computed all translation and created X-ICT samples before running X-ICT. To do so, head over to CORA/mDPR and run

for idx in 0 1 2;
    python create_ict_samples.py --passage_dir ../../data/bbc_passages/ --out_file ../../data/bbc_passages/all_ict_samples.jsonl --num_shards 4 --shard_id $idx

python create_ict_samples.py --passage_dir ../../data/bbc_passages/ --out_file ../../data/bbc_passages/all_ict_samples-trans100.jsonl --num_shards 4 --shard_id 3 --is_eval

This will create X-ICT samples split into 4 shards, stored under bbc_passages/. Note that we use the last shard to do evaluation, hence the --is_eval argument in computing the last shard.

Learning X-ICT

To learn X-ICT, simply run the following script

bash run_xict.sh

This will store the checkpoint files under CORA/mDPR/xict_outputs. Optionally, we provided the trained checkpoints of X-ICT.

Inference

To perform retrieval, we first compute the representations for all passages,

bash generate_multilingual_embeddings.sh

This will generate embeddings in .pkl files and store them under CORA/mDPR/embeddings_multilingual.

Then we perform retrieval by querying CONCRETE with each claim in X-Fact.

bash dense_retriever.sh

Retrieval-augmented Fact Checking

Training

To train our cross-lingual fact checker, go to the src directory via cd src. Then, run the following command:

python train.py --output_dir outputs --batch_size 2 --eval_batch_size 2 --max_epoch 12 --accumulate_step 32 --model_name bert-base-multilingual-cased

This will train our fact checking framework augmented with passages retrieved from our proposed retriever, CONCERETE.

We provide the checkpoint of the trained model for reproduction purposes. The weights can be downloaded here.

Evaluation

Following the X-Fact paper, we use Macro F1 as the evaluation metric. To run evaluation on trained models, execute the test.py script as follows:

python test.py --checkpoint_path PATH_TO_CHECKPOINT --test_path ../data/x-fact/zeroshot.tsv --model_name bert-base-multilingual-cased

Here, PATH_TO_CHECKPOINT is the path to the checkpoint you have just trained, or you can use the weights we provided above.

@inproceedings{huang-etal-2022-concrete,
    title = "CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval",
    author = "Huang, Kung-Hsiang  and
      Zhai, ChengXiang  and
      Ji, Heng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.86",
    pages = "1024--1035",
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
CORA/mDPR		CORA/mDPR
src		src
LICENSE		LICENSE
README.md		README.md
commoncrawl.py		commoncrawl.py
framework_overview.png		framework_overview.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORA/mDPR

CORA/mDPR

src

src

LICENSE

LICENSE

README.md

README.md

commoncrawl.py

commoncrawl.py

framework_overview.png

framework_overview.png

requirements.txt

requirements.txt

Repository files navigation

CONCRETE

Dependencies

Data

Cross-lingual Retrieval

Create X-ICT samples

Learning X-ICT

Inference

Retrieval-augmented Fact Checking

Training

Evaluation

About

Releases

Packages

Languages

License

khuangaf/CONCRETE

Folders and files

Latest commit

History

Repository files navigation

CONCRETE

Dependencies

Data

Cross-lingual Retrieval

Create X-ICT samples

Learning X-ICT

Inference

Retrieval-augmented Fact Checking

Training

Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages