In [None]:
!pip install transformers datasets

## Fechting Dataset
We need Wiki D/Similar dataset (wiki-d-similar.zip) to perform training or evaluation.

It should be at the same directory as this notebook.

You can get this dataset from [Sentence Transformers](https://github.com/m3hrdadfi/sentence-transformers)

If you're using colab you can use next section to upload the file directly or mount your drive if you uploaded into your Google Drive, otherwise you can ignore the next section.


In [None]:
# 1. Upload Directly
#from google.colab import files
#files.upload()
#
# 2. Mounting Google Drive
#from google.colab import drive
#drive.mount('/content/gdrive')
#!cp gdrive/MyDrive/wiki-d-similar.zip .

Extracting dataset from zip format.

In [None]:
!mkdir nli
!cp wiki-d-similar.zip nli/
!7z x nli/wiki-d-similar.zip -onli/


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 32603485 bytes (32 MiB)

Extracting archive: nli/wiki-d-similar.zip
--
Path = nli/wiki-d-similar.zip
Type = zip
Physical Size = 32603485

  0%     22% 1 - wiki-d-similar/wiki-d-similar.csv                                           38% 1 - wiki-d-similar/wiki-d-similar.csv                                           50% 2 - wiki-d-similar/test.csv                                 60% 4 - wiki-d-similar/train.csv                                 

Next cell will load the dataset, since our delimiter is tab instead of Comma (,) we have to declare it using ```delimiter="\t"```

In [None]:
from datasets import load_dataset

data_files = {"train": "train.csv", "test": "test.csv", "dev": "dev.csv"}

dataset = load_dataset("nli/wiki-d-similar", data_files=data_files, delimiter="\t")

Using custom data configuration wiki-d-similar-132f132706a1ec58


Downloading and preparing dataset csv/wiki-d-similar to /root/.cache/huggingface/datasets/csv/wiki-d-similar-132f132706a1ec58/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/wiki-d-similar-132f132706a1ec58/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Labels in wikinli dataset are "similar" and "dissimilar".

We have to map them to the corresponding id, the function in the next section will do the job:


In [None]:
def label2id(example):
  if example["Label"] == "similar":
    example["label"] = 1
  else:
    example["label"] = 0
  return example
dataset = dataset.map(label2id)

  0%|          | 0/126628 [00:00<?, ?ex/s]

  0%|          | 0/5497 [00:00<?, ?ex/s]

  0%|          | 0/5277 [00:00<?, ?ex/s]

There are a few things that we should be aware of when feeding data into our model, fortunately tokenizer and data collator from transformers library would take care of this, we just need to init them properly. 

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "HooshvareLab/bert-fa-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["Sentence1"], example["Sentence2"], truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True).shuffle(seed=42)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading:   0%|          | 0.00/440 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

  0%|          | 0/127 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

We need to define a function to compute metric at each ```eval_steps``` batch of our training:

In [None]:
from datasets import load_metric
import numpy as np

def compute_metrics(eval_preds):
  metric = load_metric('accuracy')
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

## Load Pretrained Model
We need to load weights from bert-fa-base-uncased, the model is trained on some self-supervised fashion language modeling using attention mask, prediction next sentence, etc.

We also need to change the head of our model to be usabale for our classification problem, The huggingface library once again handles this very well for us with a new classification head with random weights. Number of labels should be specified.

In [None]:
from transformers import AutoModelForSequenceClassification
checkpoint = "HooshvareLab/bert-fa-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/624M [00:00<?, ?B/s]

Some weights of the model checkpoint at HooshvareLab/bert-fa-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification w

## Gradual Unfreezing
Pretrained model is trained on a huge dataset and is currently storing that 
knowledge in it's weights, however the new head that we just added on top of 
that is initialized randomly.

If we treat all of these weights equally we will lose that background knowledge, Next section codes will freeze the pretrained model at this point to only train the new part of model that we have just initialized.

In [None]:
for name, param in model.named_parameters():
     if name.startswith("bert"): 
        param.requires_grad = False
        print(name)
     else:
       print("NO", name)

bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.enc

## Training Parameters
We would initialize training arguments in the next section.

**Things to notice**

**save_steps** should be high because the default parameter with eat the free space of hard disk for no reason.

**label_smoothing_factor** we have talked about this parameter in the Medium, fortunately once again huggingface have a implementation for this and we don't need to build some fancy DataLoader. Why 0.1? It's just a rule of thumb, I didn't have much reasource to do hyperparameter search on this.

**fp16** is the most fun part, it halves the precision and doubles up speed, which is very usefull in most cases.

**num_train_epochs** I've choosen 1 cause we have maybe 10 times more data than a lot of datasets. Spoiler Alert more than 1 epoch isn't effective I've tired that.

In [None]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments("wikinli-trainer",
                                  evaluation_strategy="steps",
                                  per_device_eval_batch_size=32,
                                  per_device_train_batch_size=32,
                                  eval_steps=300,
                                  save_steps=7915,
                                  learning_rate=5e-5,
                                  label_smoothing_factor=0.1,
                                  fp16=True,
                                  num_train_epochs=1)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["dev"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp half precision backend


When the magic happens, let's do first part of our training:

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 126628
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 3958


Step,Training Loss,Validation Loss,Accuracy
300,No log,0.655496,0.631798
600,0.665700,0.649131,0.635399
900,0.665700,0.647186,0.639189
1200,0.657100,0.646199,0.639378
1500,0.656200,0.644583,0.642031
1800,0.656200,0.644306,0.647906
2100,0.653100,0.643215,0.6462
2400,0.653100,0.642512,0.645253
2700,0.648400,0.642251,0.648475
3000,0.648300,0.6422,0.645442


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5277
  Batch size = 32


Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5277
  Batch size = 32
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5277
  Batch size = 32
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have

TrainOutput(global_step=3958, training_loss=0.654458576528155, metrics={'train_runtime': 618.505, 'train_samples_per_second': 204.732, 'train_steps_per_second': 6.399, 'total_flos': 8448066376589520.0, 'train_loss': 0.654458576528155, 'epoch': 1.0})

OK, Not bad. We've got 0.648 without even touching pretrained weights.

I've saved the model in the following secion, but It's not necessary. We still have trainings to do.

In [None]:
trainer.save_model('models/finetuned_only_on_classifier648')

Saving model checkpoint to models/finetuned_only_on_classifier648
Configuration saved in models/finetuned_only_on_classifier648/config.json
Model weights saved in models/finetuned_only_on_classifier648/pytorch_model.bin
tokenizer config file saved in models/finetuned_only_on_classifier648/tokenizer_config.json
Special tokens file saved in models/finetuned_only_on_classifier648/special_tokens_map.json


OK, let's unfreeze everything to do a training in every layer.

In [None]:
for name, param in model.named_parameters():
  param.requires_grad = True
  print(name)

bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.enc

Nothing new in here we are just defining things one more time.


In [None]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments("wikinli-trainer_part2",
                                  evaluation_strategy="steps",
                                  per_device_eval_batch_size=32,
                                  per_device_train_batch_size=32,
                                  eval_steps=300,
                                  save_steps=7915,
                                  learning_rate=5e-5,
                                  label_smoothing_factor=0.1,
                                  fp16=True,
                                  num_train_epochs=5)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["dev"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp half precision backend


Lets do training one more time.

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 126628
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 19790


Step,Training Loss,Validation Loss,Accuracy
300,No log,0.603477,0.68808
600,0.621000,0.585574,0.713474
900,0.621000,0.577543,0.719727
1200,0.601100,0.573106,0.719917
1500,0.599300,0.575564,0.719348
1800,0.599300,0.566857,0.734698
2100,0.590100,0.55824,0.739435
2400,0.590100,0.558321,0.747015
2700,0.578000,0.58348,0.743604
3000,0.579000,0.570843,0.722001


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5277
  Batch size = 32
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Article Title, Label, Article Link, Sentence1, Sentence2. If Article Title, Label, Article Link, Sentence1, Sentence2 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5277
  Batch size = 32
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have

KeyboardInterrupt: ignored

## Boom!

We've reached the state of the art result, comparable to the previous model released based on this dataset.

See [evaluation](https://colab.research.google.com/github/DemoVersion/persian-nli-trainer/blob/main/notebooks/evaluation.ipynb) for results.