## 前置動作

設定 Elasticsearch 相關的環境變數

In [1]:
%env ES_HOSTS=https://localhost:9200
%env ES_CA_CERTS=/etc/elasticsearch/certs/http_ca.crt
%env ES_USERNAME=elastic
%env ES_PASSWORD=

env: ES_HOSTS=https://localhost:9200
env: ES_CA_CERTS=/etc/elasticsearch/certs/http_ca.crt
env: ES_USERNAME=elastic
env: ES_PASSWORD=


下載資料集與模型權重

In [5]:
from huggingface_hub import snapshot_download

snapshot_download('ShinoharaHare/AICUP-2023-Spring-NLP', local_dir='.')

處理訓練資料集和 Wiki 資料集

In [None]:
!python scripts/utils/create_claim_dataset.py \
    --data_files="['data/raw/public_train_0316.jsonl','data/raw/public_train_0522.jsonl']" \
    --output_dir="data/claim_dataset"

!python scripts/utils/create_wiki_dataset.py \
    --data_dir="data/raw/wiki-pages" \
    --output_dir="data/wiki_dataset"

## 訓練 Pairwise Ranking Sentence Retriever

準備訓練資料集

In [None]:
!python scripts/sentence_retrieval/prepare_pairwise_ranking_dataset.py \
    --claim_dataset_path="data/claim_dataset" \
    --wiki_dataset_path="data/wiki_dataset" \
    --output_dir="data/sentence_retrieval/pairwise_ranking" \
    --top_k=3 \
    --min_score=10.0 \
    --return_by_noun=True \
    --merge_adjacent=True \
    --return_unmerged=True

開始訓練

In [None]:
!python scripts/sentence_retrieval/train_pairwise_ranking.py \
    --dataset_path="data/sentence_retrieval/pairwise_ranking" \
    --claim_dataset_path="data/claim_dataset" \
    --wiki_dataset_path="data/wiki_dataset" \
    --name="pairwise-ranking-sentence-retriever" \
    --max_epochs=3 \
    --val_check_interval=2000

## 訓練 Classifier Claim Verifier

準備訓練資料集

In [None]:
!python scripts/claim_verification/prepare_classifier_dataset_prsr.py \
    --claim_dataset_path="data/claim_dataset" \
    --wiki_dataset_path="data/wiki_dataset" \
    --output_dir="data/claim_verification/classifier_prsr" \
    --top_k=3 \
    --min_score=10.0 \
    --return_by_noun=True \
    --merge_adjacent=True \
    --return_unmerged=True \
    --sentence_retriever_path="sentence_retriever/e8hneqtg/e2.weights.ckpt"

開始訓練

In [None]:
!python scripts/claim_verification/train_classifier.py \
    --dataset_path="data/claim_verification/classifier_prsr" \
    --name="classifier-claim-verifier_megatron-bert-1.3b" \
    --max_epochs=10 \
    --val_check_interval=2500

## 預測

使用 Pairwise Ranking Sentence Retriever + Classifier Claim Verifier 進行預測

In [None]:
!python scripts/e2e/prsr_ccv_tmp.py \
    --wiki_dataset_path="data/wiki_dataset" \
    --test_data_path="data/raw/public_private_test_data.jsonl" \
    --sr_path="sentence_retriever/e8hneqtg/e2.weights.ckpt" \
    --cv_path="claim_verifier/7xet7u1m/e9.weights.ckpt" \
    --precision=16 \
    --sr_batch_size=64 \
    --cv_batch_size=32 \
    --cv_max_length=512 \
    --top_k=3 \
    --min_score=10.0 \
    --return_by_noun=True \
    --merge_adjacent=True \
    --return_unmerged=True