**INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ INSTALLING DEPENDENCIES: UNCOMMENT BELOW: 
# !pip install datasets transformers[sentencepiece]

In [19]:
#@ DOWNLOADING LIBRARIES AND DEPENDENCIES:
import torch
import transformers
import datasets
from datasets import load_dataset
from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments
from transformers import Trainer

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**PROCESSING THE DATA:**

In [6]:
#@ PROCESSING THE DATA:
checkpoint = "bert-base-uncased"                                        # Initialization. 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)                   # Initializing Tokenizer. 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)  # Initializing Sequence Model. 
sequences = [
        "I've been waiting for a HuggingFace course my whole life",
        "This course is amazing!"
]                                                                       # Text Sequences. 
batch = tokenizer(sequences, padding=True, truncation=True, 
                  return_tensors="pt")                                  # Getting Batch of Tensors. 
batch["labels"] = torch.tensor([1, 1])                                  # Initializing Labels. 

#@ INITIALIZING MODEL TRAINING PARAMETERS:
optimizer = AdamW(model.parameters())                                   # Initializing Optimizer. 
loss = model(**batch).loss                                              # Initializing Loss. 
loss.backward()                                                         # Initializing Back Propagation. 
optimizer.step()                                                        # Updating Parameters. 

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

**GETTING THE DATASET:**
- In this notebook, we will use MRPC (Microsoft Research Paraphrase Corpus) dataset introduced by William B. Dolan and Chris Brockett. The dataset consist of 5801 pairs of sentences, with a label indicating if they are paraphrases or not. It is one of the 10 datasets composing the GLUE benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks. 

In [8]:
#@ GETTING THE DATASET:
raw_datasets = load_dataset("glue", "mrpc")             # Getting MRPC Dataset. 
raw_datasets                                            # Inspecting Dataset. 

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [9]:
#@ INSPECTING TRAINING DATASET: 
raw_train_dataset = raw_datasets["train"]               # Training Dataset. 
raw_train_dataset[15]                                    # Inspection. 

{'idx': 16,
 'label': 0,
 'sentence1': 'Rudder was most recently senior vice president for the Developer & Platform Evangelism Business .',
 'sentence2': 'Senior Vice President Eric Rudder , formerly head of the Developer and Platform Evangelism unit , will lead the new entity .'}

In [10]:
#@ INSPECTING TYPE OF COLUMNS:
raw_train_dataset.features

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}

**PREPROCESSING THE DATASET:**
- To preprocess the dataset, we will convert the text to numbers the model can make sense of, with the help of tokenizer. 

In [11]:
#@ INITIALIZING TOKENIZATION:
tokenized_1 = tokenizer(raw_datasets["train"]["sentence1"])                  # Tokenization. 
tokenized_2 = tokenizer(raw_datasets["train"]["sentence2"])                  # Tokenization. 

In [12]:
#@ IMPLEMENTING TOKENIZER:
inputs = tokenizer("This is a first sentence", 
                   "This is a second sentence")                              # Tokenization. 
inputs

{'input_ids': [101, 2023, 2003, 1037, 2034, 6251, 102, 2023, 2003, 1037, 2117, 6251, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
#@ DEFINING TOKENIZATION FUNCTION:
def tokenize_function(example):                                               # Defining Function. 
    return tokenizer(example["sentence1"], example["sentence2"], 
                     truncation=True)                                         # Implementation of Tokenizer. 

#@ IMPLEMENTATION OF FUNCTION:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)        # Initializing Tokenization. 
tokenized_datasets                                                            # Inspection. 

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0ff16674a2d578cc.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-ec43bc18d628756c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-5e84305c8ce4ef89.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

**DYNAMIC PADDING**
- The function that is responsible for putting together samples inside a batch is called a `collate function`. **Dynamic Padding** means the samples in the batch should all be padded to the maximum length inside the batch. 

In [16]:
#@ IMPLEMENTATION OF COLLATOR FUNCTION: 
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)    # Initialization. 

#@ IMPLEMENTATION OF COLLATOR FUNCTION: INITIALIZATION: 
samples = tokenized_datasets["train"][:8]  
samples = {k:v for k,v in samples.items() if 
           k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [17]:
#@ IMPLEMENTATION OF COLLATOR FUNCTION: 
batch = data_collator(samples)                  # Implementation. 
{k:v.shape for k, v in batch.items()}           # Inspection. 

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

**TRAINING:**

In [20]:
#@ INITIALIZING TRAINING:
training_args = TrainingArguments(checkpoint)                                           # Initializing Training Arguments. 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)    # Initializing Classification Model. 


#@ INITIALIZING TRAINER:
trainer = Trainer(model, training_args, train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"], 
                  data_collator=data_collator, tokenizer=tokenizer)                     # Initializing Trainer.
trainer.train()                                                                         # Initializing Training.  

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Step,Training Loss
500,0.5333
1000,0.3454


Saving model checkpoint to bert-base-uncased/checkpoint-500
Configuration saved in bert-base-uncased/checkpoint-500/config.json
Model weights saved in bert-base-uncased/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bert-base-uncased/checkpoint-500/tokenizer_config.json
Special tokens file saved in bert-base-uncased/checkpoint-500/special_tokens_map.json
Saving model checkpoint to bert-base-uncased/checkpoint-1000
Configuration saved in bert-base-uncased/checkpoint-1000/config.json
Model weights saved in bert-base-uncased/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in bert-base-uncased/checkpoint-1000/tokenizer_config.json
Special tokens file saved in bert-base-uncased/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1377, training_loss=0.37288119469159914, metrics={'train_runtime': 416.9757, 'train_samples_per_second': 26.39, 'train_steps_per_second': 3.302, 'total_flos': 405470580750720.0, 'train_loss': 0.37288119469159914, 'epoch': 3.0})