Skip to content

zaidsheikh/MultiDoc-MultiLingual

 
 

Repository files navigation

MultiDoc-MultiLingual

Setup

Install Rouge Package

Follow the README.md under the multilingual_rouge_scoring directory.

Dataset Augmentation

We use Google Search Engine to find more input articles for each event from the WCEP with the following steps:

Extract Keywords

We use keyBERT to extract keywords by running:

$ python dataset_collection/keywords_extraction_keyBERT.py \
    --file_name "cantonese_crawl.jsonl" \
    --data_dir "./Multi-Doc-Sum/Mtl_data" \
    --output_dir "./Multi-Doc-Sum/keywords_extraction_keyBERT"

The meaning of the arguments are as follows:

  • data_dir dataset directory of the original crawled data from WCEP.
  • file_name a specific file of a certain language under the dataset directory.
  • output_dir output directory of the extracted keywords.

Google Search

We run the following command to search Google:

$ python dataset_collection/google_search.py \
    --my_api_key $GOOGLE_SEARCH_API_KEY \
    --my_cse_id $CUSTOM_SEARCH_ENGINE_ID \
    --file_name "cantonese_crawl.jsonl" \
    --data_dir "./Multi-Doc-Sum/Mtl_data" \
    --keywords_dir "./Multi-Doc-Sum/keywords_extraction_keyBERT" \
    --data_aug_dir "./Multi-Doc-Sum/Mtl_data_aug" 

The meaning of the arguments are as follows:

  • MY_API_KEY your google search API key.
  • MY_CSE_ID your Custom Search Engine ID.
  • data_dir dataset directory of the original crawled data from WCEP.
  • file_name a specific file of a certain language under the dataset directory.
  • output_dir output directory of the extracted keywords.

Dataset Cleaning

A clean dataset is generated by filtering the source documents using the ORACLE method. The first step is to calculate the ORACLE score of each source document with the summary:

$ python dataset_collection/filter_source_documents/filter_oracle_get_score.py \
    --data-dir "./Multi-Doc-Sum/Mtl_data/doc_extraction" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/scored" \
    --input-file-name "cantonese_extracted.jsonl" \
    --output-file-name "cantonese_scored.jsonl"

The second step is to filter out the source docuemnts with ORACLE score bellow a threshold:

$ python dataset_collection/filter_source_documents/filter_oracle.py \
    --threshold 7 \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/scored" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/filtered" \
    --input-file-name "cantonese_scored.jsonl" \
    --output-file-name "cantonese_filtered.jsonl"

Split Dataset

Both the noisy and the clean datasets are randomly split into 80%, 10%, and 10% training, validation, and test sets, respectively:

$ python dataset_collection/split.py \
    --lang cantonese\
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/orig" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/orig" \
    --input-file-name "cantonese_scored.jsonl" 

$ python dataset_collection/split.py \
    --lang cantonese \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/filtered" \
    --output-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered" \
    --input-file-name "cantonese_filtered.jsonl" 

Baselines

Heuristic Baseline

Heuristic Baselines are calculated by:

$ python baselines/heuristic/get_heuristic_baseline_result.py \
    --input-file-name "cantonese_test.jsonl" \
    --data-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered/" \
    --lang cantonese \
    --output-dir "./baseline_results/clean_dataset"

TextRank Baseline

TextRank Baselines are calculated by:

Mt5 Baseline

  • Prepare the dataset to train mt5 models:
$ python baselines/mt5/prepare_dataset.py \
    --input-dir "./Multi-Doc-Sum/Mtl_data_aug_filtered/split/filtered" \
    --output-dir "data/first_sentences" \
    --language "cantonese"

Evaluation

Here is an example evaluation.py to use evaluation metrics: BERTScore and T5Score. To run T5Score, a T5Score model should be downloaded from T5Score-summ to directory ./model/T5Score/.

Docker image

  • We have also provided a docker image with all dependencies pre-installed in order to make it easier to run the above scripts. Here's how to run the training pipeline inside a docker container:
docker pull zs12/multidoc_multilingual:v0.3.1

# train single langauge mt5 model
./dockerfiles/docker_train_mt5.sh prepared_dataset/individual/EN/ output/

# train multilingual mt5 model
./dockerfiles/docker_train_mt5.sh prepared_dataset/multilingual/ output/ multi

# Run inference/prediction using a trained multilingual mt5 model
# data_dir/ should contain files named test.source and test.target
./dockerfiles/docker_predict_with_generate.sh model_dir/ data_dir/ output_dir/

# Run a prediction server that accepts requests from localhost:4123
./dockerfiles/docker_prediction_server.sh model_dir/
# upload a file to summarize (one doc/passage per line)
curl --data-binary @test.source localhost:4123

  • To run the docker container via ClearML:
# train single langauge mt5 model
./clearml_scripts/clearml_train_mt5.sh prepared_dataset/individual/EN/ output/

# train multilingual mt5 model
./clearml_scripts/clearml_train_mt5.sh prepared_dataset/multilingual/ output/ multi

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.3%
  • Shell 14.1%
  • Dockerfile 0.6%