# Paraphrase Classification Model

## Data Loading

Load the training and validation data into dataframes

In [5]:
import pandas as pd

train_df = pd.read_csv('./data/paws/train.tsv', sep='\t')[['sentence1', 'sentence2', 'label']]
val_df = pd.read_csv('./data/paws/dev.tsv', sep='\t')[['sentence1', 'sentence2', 'label']]
train_df.head()


Unnamed: 0,sentence1,sentence2,label
0,"In Paris , in October 1560 , he secretly met t...","In October 1560 , he secretly met with the Eng...",0
1,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...,1
2,"There are also specific discussions , public p...","There are also public discussions , profile sp...",0
3,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...,1
4,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...,1


Use the dataframes to create huggingface datasets

In [6]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

## Tokenize Data

Load the tokenizer of the model that will be fine-tuned

In [7]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = 'sentence-transformers/all-distilroberta-v1'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

This tokenize function tokenizes both sentences and concatenates them with a seperator token

In [8]:
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True)

Apply the tokenize function to each dataset

In [9]:
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer)

100%|██████████| 50/50 [00:04<00:00, 12.43ba/s]
100%|██████████| 8/8 [00:00<00:00, 12.52ba/s]


## Load model

Load the huggingface model for sequence classification with 2 labels

In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to('cuda')

Some weights of the model checkpoint at sentence-transformers/all-distilroberta-v1 were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-distilroberta-v1 and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias']
You should probably TRAIN t

## Get Trainer Config

Load the accuracy and f1 metrics and create a function that applies them to pass to the trainer

In [12]:
import numpy as np
from datasets import load_metric

metric = load_metric('accuracy', 'f1')

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Create the TrainingArguments. This specifies checkpoint path, batch_size, epochs, when to apply the metrics, etc.

In [14]:
from transformers import TrainingArguments
batch_size = 64
training_args = TrainingArguments('./data/models/paraphrase_distilbert_1', per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=5, evaluation_strategy='epoch', report_to="wandb")

# Train the model
Create the Trainer with everything created so far and run the train function.

In [15]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
train_output = trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running training *****
  Num examples = 49401
  Num Epochs = 5
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3860
 13%|█▎        | 500/3860 [02:17<14:18,  3.91it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-500
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-500\config.json


{'loss': 0.5628, 'learning_rate': 4.352331606217617e-05, 'epoch': 0.65}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-500\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-500\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-500\special_tokens_map.json
 20%|██        | 772/3860 [03:31<13:02,  3.95it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 64

 20%|██        | 772/3860 [03:43<13:02,  3.95it/s]

{'eval_loss': 0.34087324142456055, 'eval_accuracy': 0.8545, 'eval_runtime': 11.7322, 'eval_samples_per_second': 681.884, 'eval_steps_per_second': 10.654, 'epoch': 1.0}


 26%|██▌       | 1000/3860 [04:43<11:47,  4.04it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-1000
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-1000\config.json


{'loss': 0.3139, 'learning_rate': 3.704663212435233e-05, 'epoch': 1.3}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-1000\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-1000\special_tokens_map.json
 39%|███▉      | 1500/3860 [06:57<10:26,  3.76it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-1500
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-1500\config.json


{'loss': 0.2326, 'learning_rate': 3.05699481865285e-05, 'epoch': 1.94}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-1500\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-1500\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-1500\special_tokens_map.json
 40%|████      | 1544/3860 [07:10<09:27,  4.08it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 64

 40%|████      | 1544/3860 [07:22<09:27,  4.08it/s]

{'eval_loss': 0.2979626953601837, 'eval_accuracy': 0.892875, 'eval_runtime': 11.6615, 'eval_samples_per_second': 686.015, 'eval_steps_per_second': 10.719, 'epoch': 2.0}


 52%|█████▏    | 2000/3860 [09:24<08:46,  3.53it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-2000
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-2000\config.json


{'loss': 0.1639, 'learning_rate': 2.4093264248704665e-05, 'epoch': 2.59}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-2000\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-2000\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-2000\special_tokens_map.json
 60%|██████    | 2316/3860 [10:50<06:24,  4.02it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 64

 60%|██████    | 2316/3860 [11:01<06:24,  4.02it/s]

{'eval_loss': 0.3627687394618988, 'eval_accuracy': 0.8925, 'eval_runtime': 11.5903, 'eval_samples_per_second': 690.235, 'eval_steps_per_second': 10.785, 'epoch': 3.0}


 65%|██████▍   | 2500/3860 [11:50<05:53,  3.84it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-2500
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-2500\config.json


{'loss': 0.1344, 'learning_rate': 1.761658031088083e-05, 'epoch': 3.24}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-2500\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-2500\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-2500\special_tokens_map.json
 78%|███████▊  | 3000/3860 [14:04<03:49,  3.74it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-3000
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-3000\config.json


{'loss': 0.1079, 'learning_rate': 1.1139896373056995e-05, 'epoch': 3.89}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-3000\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-3000\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-3000\special_tokens_map.json
 80%|████████  | 3088/3860 [14:29<03:19,  3.87it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 64

 80%|████████  | 3088/3860 [14:40<03:19,  3.87it/s]

{'eval_loss': 0.4107610583305359, 'eval_accuracy': 0.903375, 'eval_runtime': 11.5786, 'eval_samples_per_second': 690.928, 'eval_steps_per_second': 10.796, 'epoch': 4.0}


 91%|█████████ | 3500/3860 [16:28<01:33,  3.84it/s]Saving model checkpoint to ./data/models/paraphrase_distilbert_1\checkpoint-3500
Configuration saved in ./data/models/paraphrase_distilbert_1\checkpoint-3500\config.json


{'loss': 0.0821, 'learning_rate': 4.663212435233161e-06, 'epoch': 4.53}


Model weights saved in ./data/models/paraphrase_distilbert_1\checkpoint-3500\pytorch_model.bin
tokenizer config file saved in ./data/models/paraphrase_distilbert_1\checkpoint-3500\tokenizer_config.json
Special tokens file saved in ./data/models/paraphrase_distilbert_1\checkpoint-3500\special_tokens_map.json
100%|██████████| 3860/3860 [18:04<00:00,  3.97it/s]The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence1, sentence2.
***** Running Evaluation *****
  Num examples = 8000
  Batch size = 64

100%|██████████| 3860/3860 [18:15<00:00,  3.97it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 3860/3860 [18:15<00:00,  3.52it/s]

{'eval_loss': 0.42276668548583984, 'eval_accuracy': 0.90225, 'eval_runtime': 11.5793, 'eval_samples_per_second': 690.89, 'eval_steps_per_second': 10.795, 'epoch': 5.0}
{'train_runtime': 1095.7058, 'train_samples_per_second': 225.43, 'train_steps_per_second': 3.523, 'train_loss': 0.21426511626169473, 'epoch': 5.0}





In [16]:
train_output

TrainOutput(global_step=3860, training_loss=0.21426511626169473, metrics={'train_runtime': 1095.7058, 'train_samples_per_second': 225.43, 'train_steps_per_second': 3.523, 'train_loss': 0.21426511626169473, 'epoch': 5.0})