# **System Requirements Setup**

In [None]:
!git clone https://github.com/305909/charbert.git

In [None]:
!bash charbert/requirements.sh

# **Output Path Setup**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os

def direct(path):
  os.makedirs(path, exist_ok = True)

In [None]:
PATH = '/content/drive/MyDrive/NLP/output'

# **Baseline Evaluation**

- Fine-Tuning the pre-trained language model (bert-base-cased) on the English Wikipedia dataset via Masked Language Modeling (MLM) approach to enhance the model’s comprehension performance in English.

In [None]:
OUTPUT = PATH + '/wikipedia-en'
direct(OUTPUT)

!python3 /content/charbert/CharBERT/LM.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_data_file /content/charbert/data/wikipedia/en_train.txt \
    --eval_data_file  /content/charbert/data/wikipedia/en_validation.txt \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --term_vocab /content/charbert/data/dict/term_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --mlm_probability 0.10 \
    --input_nraws 1000 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 10000 \
    --block_size 384 \
    --overwrite_output_dir \
    --mlm \
    --output_dir {OUTPUT}

- Fine-Tuning the model on the SQuAD dataset via Question Answering approach to refine the model’s performance in English QA tasks.

In [None]:
OUTPUT = PATH + '/SQuAD'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/SQuAD/SQuAD_train.json \
    --predict_file /content/charbert/data/SQuAD/SQuAD_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}

# **Domain Adaptation**

- Fine-Tuning the pre-trained language model (bert-base-cased) on the PubMED dataset via Masked Language Modeling (MLM) approach to enhance the model’s comprehension performance in medical knowledge.

In [None]:
OUTPUT = PATH + '/PubMED'
direct(OUTPUT)

!python3 /content/charbert/CharBERT/LM.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_data_file /content/charbert/data/PubMED/PubMED_train.txt \
    --eval_data_file  /content/charbert/data/PubMED/PubMED_validation.txt \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --term_vocab /content/charbert/data/dict/term_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --mlm_probability 0.10 \
    --input_nraws 1000 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 10000 \
    --block_size 384 \
    --overwrite_output_dir \
    --mlm \
    --output_dir {OUTPUT}

- Fine-Tuning the model on the BioASQ dataset via Question Answering approach to refine the model’s performance in medical QA tasks.

In [None]:
OUTPUT = PATH + '/BioASQ'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/BioASQ/BioASQ_train.json \
    --predict_file /content/charbert/data/BioASQ/BioASQ_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}

# **Multilingual Context**

- Fine-Tuning the pre-trained multilingual language model (bert-base-multilingual-cased) on both English and German Wikipedia datasets via Masked Language Modeling (MLM) approach to enhance the model’s comprehension performance in English and German.

In [None]:
OUTPUT = PATH + '/wikipedia-en-de'
direct(OUTPUT)

!python3 /content/charbert/CharBERT/LM.py \
    --model_type bert \
    --model_name_or_path bert-base-multilingual-cased \
    --do_train \
    --do_eval \
    --train_data_file /content/charbert/data/wikipedia/en_de_train.txt \
    --eval_data_file  /content/charbert/data/wikipedia/en_de_validation.txt \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --term_vocab /content/charbert/data/dict/term_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --mlm_probability 0.10 \
    --input_nraws 1000 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 10000 \
    --block_size 384 \
    --overwrite_output_dir \
    --mlm \
    --output_dir {OUTPUT}

- Fine-Tuning the model on the MLQA dataset via Question Answering approach to refine the model’s performance in English and German QA tasks.

In [None]:
OUTPUT = PATH + '/MLQA'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-multilingual-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/MLQA/MLQA_train.json \
    --predict_file /content/charbert/data/MLQA/MLQA_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}

#**Noise Resilience Evaluation**

- Fine-Tuning the model on the adversarial version of the SQuAD dataset via Question Answering approach to refine the model’s performance in English QA tasks with morphological variations and typos in the data.

In [None]:
OUTPUT = PATH + '/SQuAD-attack'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/SQuAD/char_SQuAD_train.json \
    --predict_file /content/charbert/data/SQuAD/char_SQuAD_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}

- Fine-Tuning the model on the adversarial version of the BioASQ dataset via Question Answering approach to refine the model’s performance in medical QA tasks with morphological variations and typos in the data.

In [None]:
OUTPUT = PATH + 'BioASQ-attack'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/BioASQ/char_BioASQ_train.json \
    --predict_file /content/charbert/data/BioASQ/char_BioASQ_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}

- Fine-Tuning the model on the adversarial version of the MLQA dataset via Question Answering approach to refine the model’s performance in English and German QA tasks with morphological variations and typos in the data.

In [None]:
OUTPUT = PATH + 'MLQA-attack'
direct(OUTPUT)

!python /content/charbert/CharBERT/SQuAD.py \
    --model_type bert \
    --model_name_or_path bert-base-multilingual-cased \
    --do_train \
    --do_eval \
    --train_file /content/charbert/data/MLQA/char_MLQA_train.json \
    --predict_file /content/charbert/data/MLQA/char_MLQA_validation.json \
    --char_vocab /content/charbert/data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 384 \
    --overwrite_output_dir \
    --doc_stride 128 \
    --output_dir {OUTPUT}