<a href="https://colab.research.google.com/github/KevinLolochum/BERT-MODELS/blob/main/DistillBERT_For_Sentiment_Analysis_in_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**DistilBERT**

* As explained in the explanation, DistilBERT is a smaller version of BERT.
* It preseves the task accuracy of BERT while being faster and requiring less comppute resources.
* We will see these claims first hand in this task.

1. **Downloading model and libraries**

In [None]:
!pip install transformers

In [3]:
# Libraries and transformers models
import torch
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm.notebook import tqdm
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.utils.data import TensorDataset


2. **Instantiate word tokenizer**

* Similar to BERT.  We will initiate the tokenizer using the base cased finetuned model.

In [None]:
# Instantiate model using 
MODEL_NAME = "distilbert-base-cased"
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

3. **Loading the data and preprocessing**

* We will use the Quara questions dataset. 
* This a classification task used to determine whether a question is offensive or not. 

In [5]:
# Loading the Quora questions dataset.

df = pd.read_csv('https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip', compression= 'zip', low_memory= False)

df.shape

(1306122, 3)

In [6]:
# Splitting the data into training and validation.
# We will stratify to make sure that the classes are proportnately represented.
from sklearn.model_selection import train_test_split

train_df, remaining = train_test_split(df, random_state = 42, train_size = 0.0075, stratify = df.target.values)
valid_df, _ = train_test_split(remaining, random_state =42, train_size = 0.00075, stratify = remaining.target.values)
train_df.shape, valid_df.shape

((9795, 3), (972, 3))

In [7]:
# exploring the columns in the data

train_df.head(2)

Unnamed: 0,qid,question_text,target
24766,04dad2b5f2eb9d7b9584,Why are unhealthy relationships so desirable?,0
1184991,e8389c8a9fc7db099491,Which war changed the course of history of the...,0


In [8]:
# Dictionary to store labels for each example/index.

possible_labels = train_df.target.unique()
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

4. **Tokenizing the data**

In [12]:
# Tokenizing the questions in the dataset

encoded_data_train = tokenizer.batch_encode_plus(
    train_df.question_text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    padding = 'max_length',
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    valid_df.question_text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    padding = 'max_length',
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(train_df.target.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(valid_df.target.values)


* TensorDataset streamlines the data generation process so that loading does not become a bottleneck.  
* [StanFord article](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel) if you want to set your create your own generator.



In [13]:
# Parsing inputs through the generator.

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

5. **Model instantiate model**

In [None]:
# For classification, we don't need the hidden stats or output attentions

model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME,
                                                      num_labels=len(train_df.target.unique()),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

6. **Creating data loader.**

* The dataloader allows for map/iterable style iteration through the dataset during training.
* The sampler class, specify the sequence of keys/indices used in data loading.



In [15]:
# Creating a data looader and specifying sampler.
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

7. **Data scheduler and optimizer**

* We will use the same parameters as suggested by the authors of the DistilBERT paper (Same as BERT).

In [16]:

from transformers import AdamW, get_linear_schedule_with_warmup

epochs = 10

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)


*   Defining performance metrics

In [17]:
from sklearn.metrics import f1_score

# defining the accuracy score
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

# Accuracy per class
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

8.**Training and Evaluation** 

In [18]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [20]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [21]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=307.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 0.19150088959849812
Validation loss: 0.13098876908301346
F1 Score (Weighted): 0.9496622209657303


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=307.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 0.11937134310372587
Validation loss: 0.12822644082048246
F1 Score (Weighted): 0.9479490242931102


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=307.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.09077647526320737
Validation loss: 0.12636700769766204
F1 Score (Weighted): 0.9521959421192845


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=307.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.06452248478687524
Validation loss: 0.12184388140937494
F1 Score (Weighted): 0.9527983090324057


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=307.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.03967405256401165
Validation loss: 0.17266173461534745
F1 Score (Weighted): 0.9585499214297147


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=307.0, style=ProgressStyle(description_widt…


Epoch 6
Training loss: 0.02801150223448818
Validation loss: 0.1770238999874451
F1 Score (Weighted): 0.9562643725772176


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=307.0, style=ProgressStyle(description_widt…


Epoch 7
Training loss: 0.018525522642223624
Validation loss: 0.20767316551439674
F1 Score (Weighted): 0.9597564366531107


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=307.0, style=ProgressStyle(description_widt…


Epoch 8
Training loss: 0.012308976224586458
Validation loss: 0.21568118938387582
F1 Score (Weighted): 0.9584725190969332


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=307.0, style=ProgressStyle(description_widt…


Epoch 9
Training loss: 0.008746433648370291
Validation loss: 0.21920630228593044
F1 Score (Weighted): 0.9579722261571124


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=307.0, style=ProgressStyle(description_wid…


Epoch 10
Training loss: 0.00867491633481883
Validation loss: 0.2227819039225353
F1 Score (Weighted): 0.9566959842922081



9. **Conclusion.**

* DistilBERT trains about **twice** as fast as BERT (***73 vs 36 minutes***).
* The **accuracy** and the **loss** are almost indistigushable with BERT.
* Like BERT, DistilBERT does very well in classification tasks.
*



In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [None]:
accuracy_per_class(predictions, true_vals)

Class: 0
Accuracy: 898/912

Class: 1
Accuracy: 37/60

