- Introduction
- Requirements
- Datasets
- Training model
- Testing model
- Testing model with SemEval scripts
- Project structure
- Pretrained RoBERTa
- Credits
Scope of this project is to is to use the RoBERTa model to determine interpretable semantic textual similarity (iSTS) between two sentences. Given two sentences of text, s1 and s2, the STS systems compute how similar s1 and s2 are, returning a similarity score. Although the score is useful for many tasks, it does not allow to know which parts of the sentences are equivalent in meaning (or very close in meaning) and which not. The aim of interpretable STS is to explore whether systems are able to explain WHY they think the two sentences are related / unrelated, adding an explanatory layer to the similarity score. The explanatory layer consists of an alignment of chunks across the two sentences, where alignments are annotated with a similarity score and a relation label. Task is inspired by SemEval competition.
pip install -r requirements.txt
All needed datasets are in data/datasets
folder. Data has been downloaded from 2015 and 2016 SemEval competition site and processed. If needed datasets can be reproduced with following commands:
./download_semeval_data.sh
python tools/create_csv.py
To train model with default parameters run following command:
python tools/train_net.py
To change any parametr command can be run with additional arguments, for example to set max_epochs
to 100 and learning rate
to 0.0002 run following command:
python tools/train_net.py SOLVER.MAX_EPOCHS 100 SOLVER.BASE_LR 0.0002
All the available parameters are defined in net_config/defaults.py
file
Parameters can also be loaded from file with --config_file
:
python tools/train_net.py --config_file configs/roberta_config.yml
To test model path to model weights has to be provided:
python .\tools\test_net.py TEST.WEIGHT "output/29052021142858_model.pt"
Changing other parameters works just like in training.
NOTE: There are 4 test datasets:
- images
- headline
- answers-students
- all above combined (default)
At default 4 metrics are calculated:
- F1 score for similarity values
- F1 score for relation labels
- Pearson correlation for similarity values
- Pearson correlation for relation labels
For model trained with default parameters results were following:
Metric | Train set score | Test set score |
---|---|---|
F1 score for similarity values |
0.773 | 0.724 |
F1 score for relation labels |
0.933 | 0.742 |
Pearson correlation for similarity values |
0.887 | 0.811 |
Pearson correlation for relation labels |
0.889 | 0.725 |
Trained model that achived these results is available to download under this link.
There are 3 perl scripts created by competition organizers (dowloaded and saved in tests
folder):
- evalF1_no_penalty.pl
- evalF1_penalty.pl
- wellformed.pl
To use them, output files need to be in specific format. To create files with gold standard file structure use following command:
python .\tools\create_wa.py TEST.WEIGHT "output/29052021142858_model.pt" DATASETS.TEST_WA "data/datasets/STSint.testinput.answers-students.wa"
This will create the .wa file with predictions (keeping the gold standard .wa file structure) in output
directory. To evalute this file perl interpreter has to be installed on the machine.
perl .\evalF1_penalty.pl data/datasets/STSint.testinput.answers-students.wa output/STSint.testinput.answers-student_predicted.wa --debug=0
├── net_config
│ └── defaults.py - here's the default config file.
│
│
├── configs
│ └── roberta_config.yml - here's the specific config file for specific model or dataset.
│
│
├── data
│ └── datasets - here's the datasets folder that is responsible for all data handling.
│ └── build.py - here's the file to make dataloader.
│
│
├── engine
│ ├── trainer.py - this file contains the train loops.
│ └── inference.py - this file contains the inference process.
│
│
├── modeling - this folder contains roberta modeel.
│ └── roberta_ists.py
│
│
├── solver - this folder contains optimizer.
│ └── build.py
│
│
├── tools - here's the train/test model functionality.
│ └── train_net.py - this file is responsible for the whole training pipeline.
| └── test_net.py - this file is responsible for the whole testing pipeline.
| └── create_cvs.py - this file creates .csv files.
| └── create_wa.py - this file creates .wa files.
|
└── tests - here are SemEval files to test model performance
│ └── evalF1_no_penalty.pl
| └── evalF1_penalty.pl
| └── wellformed.pl
|
|
└── utils
│ └── logger.py
Base RoBERTa model can be replaced with pretrained one before the proper training. Follow the instructions under this link. To fine-tune model on STS-B GLUE task modify command from 3):
TOTAL_NUM_UPDATES=3598 # 10 epochs through RTE for bsz 16
WARMUP_UPDATES=214 # 6 percent of the number of updates
LR=2e-05 # Peak LR for polynomial LR scheduler.
NUM_CLASSES=1
MAX_SENTENCES=16 # Batch size.
ROBERTA_PATH=/content/drive/MyDrive/NLP/fairseq/roberta.large/model.pt
CUDA_VISIBLE_DEVICES=0 fairseq-train STS-B-bin/ \
--restore-file $ROBERTA_PATH \
--max-positions 512 \
--batch-size $MAX_SENTENCES \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
--max-epoch 10 \
--regression-target --best-checkpoint-metric loss \
--find-unused-parameters;
Repo template - L1aoXingyu/Deep-Learning-Project-Template