**INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install datasets transformers[sentencepiece]

In [22]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES:
import numpy as np
import torch
import transformers
import datasets
from datasets import load_dataset
from datasets import load_metric
from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments
from transformers import Trainer

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**PROCESSING THE DATA:**

In [6]:
#@ PROCESSING THE DATA:
checkpoint = "bert-base-uncased"                                        # Initialization. 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)                   # Initializing Tokenizer. 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)  # Initializing Sequence Model. 
sequences = [
        "I've been waiting for a HuggingFace course my whole life",
        "This course is amazing!"
]                                                                       # Text Sequences. 
batch = tokenizer(sequences, padding=True, truncation=True, 
                  return_tensors="pt")                                  # Getting Batch of Tensors. 
batch["labels"] = torch.tensor([1, 1])                                  # Initializing Labels. 

#@ INITIALIZING MODEL TRAINING PARAMETERS:
optimizer = AdamW(model.parameters())                                   # Initializing Optimizer. 
loss = model(**batch).loss                                              # Initializing Loss. 
loss.backward()                                                         # Initializing Back Propagation. 
optimizer.step()                                                        # Updating Parameters. 

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

**GETTING THE DATASET:**
- In this notebook, we will use MRPC (Microsoft Research Paraphrase Corpus) dataset introduced by William B. Dolan and Chris Brockett. The dataset consist of 5801 pairs of sentences, with a label indicating if they are paraphrases or not. It is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks. 

In [8]:
#@ GETTING THE DATASET:
raw_datasets = load_dataset("glue", "mrpc")             # Getting MRPC Dataset. 
raw_datasets                                            # Inspecting Dataset. 

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [9]:
#@ INSPECTING TRAINING DATASET: 
raw_train_dataset = raw_datasets["train"]               # Training Dataset. 
raw_train_dataset[15]                                    # Inspection. 

{'idx': 16,
 'label': 0,
 'sentence1': 'Rudder was most recently senior vice president for the Developer & Platform Evangelism Business .',
 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .'}

In [10]:
#@ INSPECTING TYPE OF COLUMNS:
raw_train_dataset.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

**PREPROCESSING THE DATASET:**
- To preprocess the dataset, we will convert the text to numbers the model can make sense of, with the help of tokenizer. 

In [11]:
#@ INITIALIZING TOKENIZATION:
tokenized_1 = tokenizer(raw_datasets["train"]["sentence1"])                  # Tokenization. 
tokenized_2 = tokenizer(raw_datasets["train"]["sentence2"])                  # Tokenization. 

In [12]:
#@ IMPLEMENTING TOKENIZER:
inputs = tokenizer("This is a first sentence", 
                   "This is a second sentence")                              # Tokenization. 
inputs

{'input_ids': [101, 2023, 2003, 1037, 2034, 6251, 102, 2023, 2003, 1037, 2117, 6251, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [14]:
#@ DEFINING TOKENIZATION FUNCTION:
def tokenize_function(example):                                               # Defining Function. 
    return tokenizer(example["sentence1"], example["sentence2"], 
                     truncation=True)                                         # Implementation of Tokenizer. 

#@ IMPLEMENTATION OF FUNCTION:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)        # Initializing Tokenization. 
tokenized_datasets                                                            # Inspection. 

  0%|          | 0/4 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-c44a7f7bf29f5b58.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-3e01f4e4c8ccd211.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

**DYNAMIC PADDING**
- The function that is responsible for putting together samples inside a batch is called a `collate function`. **Dynamic Padding** means the samples in the batch should all be padded to the maximum length inside the batch. 

In [15]:
#@ IMPLEMENTATION OF COLLATOR FUNCTION: 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)    # Initialization. 

#@ IMPLEMENTATION OF COLLATOR FUNCTION: INITIALIZATION: 
samples = tokenized_datasets["train"][:8]  
samples = {k:v for k,v in samples.items() if 
           k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [16]:
#@ IMPLEMENTATION OF COLLATOR FUNCTION: 
batch = data_collator(samples)                  # Implementation. 
{k:v.shape for k, v in batch.items()}           # Inspection. 

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

**TRAINING:**

In [17]:
#@ INITIALIZING TRAINING:
training_args = TrainingArguments(checkpoint)                                           # Initializing Training Arguments. 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)    # Initializing Classification Model. 


#@ INITIALIZING TRAINER:
trainer = Trainer(model, training_args, train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"], 
                  data_collator=data_collator, tokenizer=tokenizer)                     # Initializing Trainer.
trainer.train()                                                                         # Initializing Training.  

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Step,Training Loss
500,0.5626
1000,0.366


Saving model checkpoint to bert-base-uncased/checkpoint-500
Configuration saved in bert-base-uncased/checkpoint-500/config.json
Model weights saved in bert-base-uncased/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bert-base-uncased/checkpoint-500/tokenizer_config.json
Special tokens file saved in bert-base-uncased/checkpoint-500/special_tokens_map.json
Saving model checkpoint to bert-base-uncased/checkpoint-1000
Configuration saved in bert-base-uncased/checkpoint-1000/config.json
Model weights saved in bert-base-uncased/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in bert-base-uncased/checkpoint-1000/tokenizer_config.json
Special tokens file saved in bert-base-uncased/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1377, training_loss=0.40232000856531125, metrics={'train_runtime': 397.7197, 'train_samples_per_second': 27.668, 'train_steps_per_second': 3.462, 'total_flos': 405470580750720.0, 'train_loss': 0.40232000856531125, 'epoch': 3.0})

**EVALUATION:**

In [19]:
#@ INITIALIZING MODEL EVALUATION:
predictions = trainer.predict(tokenized_datasets["validation"])         # Getting Predictions. 
print(predictions.predictions.shape, predictions.label_ids.shape)       # Inspecting Predictions. 

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 408
  Batch size = 8


(408, 2) (408,)


In [24]:
#@ INSPECTING MODEL PREDICTION:
preds = np.argmax(predictions.predictions, axis=-1)                     # Getting Maximum Index.
metric = load_metric("glue", "mrpc")                                    # Initializing Metrics. 
metric.compute(predictions=preds, references=predictions.label_ids)     # Computing Metrices. 

{'accuracy': 0.8676470588235294, 'f1': 0.9072164948453608}

In [31]:
#@ DEFINING FUNCTION FOR COMPUTING METRICS:
def compute_metrics(eval_preds):                                        # Defining Function. 
    metric = load_metric("glue", "mrpc")                                # Initializing Metrics. 
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)                            # Getting Maximum Index. 
    return metric.compute(predictions=predictions, references=labels)    # Getting Metrices. 

In [32]:
#@ DEFINING NEW TRAINER: 
training_args = TrainingArguments("test-trainer", 
                                  evaluation_strategy="epoch")          # Initializing Training Arguments. 
trainer = Trainer(model, training_args, 
                  train_dataset=tokenized_datasets["train"],            # Initializing Training Datasets. 
                  eval_dataset=tokenized_datasets["validation"],        # Initializing Validation Datasets. 
                  data_collator=data_collator, tokenizer=tokenizer, 
                  compute_metrics=compute_metrics)                      # Initializing Trainer. 
trainer.train()                                                         # Training the Model. 

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.838388,0.835784,0.879713
2,0.156100,0.802094,0.848039,0.892734
3,0.088100,0.924342,0.835784,0.886248


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: sentence2, idx, sentence1. If sentence2, idx, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely i

TrainOutput(global_step=1377, training_loss=0.10326220942478553, metrics={'train_runtime': 398.7433, 'train_samples_per_second': 27.597, 'train_steps_per_second': 3.453, 'total_flos': 405470580750720.0, 'train_loss': 0.10326220942478553, 'epoch': 3.0})