# Transformers: Revolutionizing Natural Language Processing

Transformers models have emerged as the cutting-edge technology in the field of natural language processing. They have significantly advanced the capabilities of NLP models, achieving state-of-the-art performance on various tasks.

[Hugging Face](https://huggingface.co/) offers one of the most practical and comprehensive [libraries](https://huggingface.co/docs/transformers/main/en/index) for working with transformers and pre-trained models.

**Hugging Face Transformers**: This library provides a user-friendly and powerful toolkit for working with transformers. It offers access to a wide range of pre-trained transformer models, making it easier than ever to leverage the latest advancements in NLP. With Hugging Face Transformers, you can quickly implement transformer-based solutions for tasks such as text classification, machine translation, and question-answering.

In [None]:
!pip install transformers datasets 2>&1
!pip install accelerate -U  2>&1
!pip install transformers[torch]  2>&1

We will use the Hugging Face Transformers library to implement a transformer-based text classifier.  
First, we will load our data and prepare it for use with torch and the transformers library. 

In [None]:
import pandas as pd

train_df = pd.read_csv('train.csv', names=["label", "title", "text"]).sample(40000)
test_df = pd.read_csv('test.csv', names=["label", "title", "text"]).sample(2000)
train_text, train_labels = train_df["text"], train_df['label']-1
test_text, test_labels = test_df["text"], test_df['label']-1

We will use the `distilbert-base-uncased` model, which is a distilled version of the popular [BERT](https://arxiv.org/abs/1810.04805) model. Distilled models are smaller and faster than their full-size counterparts, making them ideal for applications with limited computational resources.  
BERT uses a particular tokenization method called [WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt), which is different from the tokenization method we used previously. Therefore, we will need to use the `DistilBertTokenizer` class to tokenize our data.  

In [None]:
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

class NlpDataset(Dataset):
    def __init__(self,data,labels,tokenizer):
        self.data = data.to_list()
        self.labels = labels.tolist()
        self.encodings = tokenizer(self.data, truncation=True, padding=True)

    def __getitem__(self,idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx],dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = NlpDataset(train_text, train_labels, tokenizer)
test_dataset = NlpDataset(test_text,test_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

Look at a sample yielded by your train loader.  
Its a dictionary with the following keys:
- input_ids: the tokenized text
- attention_mask: a mask to ignore padding
- labels: the labels

Input_ids are the tokenized text and labels are the labels.  
The attention mask is a mask to ignore padding. Indeed, the model will pad the sequences to have the same length. The attention mask will tell the model to ignore the padding.  

In [None]:
next(iter(train_loader))

We will now instanciate our model and wrapp it into a Pytorch module.
In this practical session we will freeze the model and only train the last layer.  
This is not the best practice when using BERT but we will do it for the sake of simplicity.  
Take a particular attention to the `forward` method.  
The model will return a tuple with the logits, the hidden states and the attentions.  
The logits are the outputs of the last layer.
The hidden states are the outputs of all the layers and will contain the embeddings of the text.
The attentions are the attention weights of the model.

In [None]:
from transformers import  DistilBertForSequenceClassification
import torch.nn as nn

class BertClf(nn.Module):

    def __init__(self, distilbert):

        super(BertClf, self).__init__()

        self.distilbert = distilbert
        for name, param in distilbert.named_parameters():
            if not "classifier" in name:
                param.requires_grad = False

    def forward(self, sent_id, mask):

        out = self.distilbert(sent_id, attention_mask=mask)
        logits = out.logits
        attn = out.attentions
        hidden_states = out.hidden_states


        return logits,hidden_states,attn

distilbert = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                                  num_labels=4,
                                                                  output_attentions=True,
                                                                  output_hidden_states=True)

model = BertClf(distilbert)

Now complete the following training and testing functions.

In [None]:
from tqdm.notebook import tqdm

def train_bert(model, optimizer, dataloader, epochs):
  model.train()
  for epoch in range(epochs):
    running_loss = 0.0
    running_corrects = 0
    total = 0
    t = tqdm(dataloader)
    for i, batch in enumerate(t):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        preds, _, _ = model(...)
        loss = criterion(...)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        _, predicted = preds.max(1)
        running_corrects += predicted.eq(labels).sum().item()
        total += labels.size(0)
        running_loss += loss.item()

        t.set_description(f"epoch:{epoch} loss: {(running_loss / (i+1)):.4f} current accuracy:{round(running_corrects / total * 100, 2)}%")

def test_bert(model, dataloader):
    model.eval()
    test_corrects = 0
    total = 0
    with torch.no_grad():
      for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        preds, _, _ = model(...)
        _, predicted = preds.max(1)
        test_corrects += predicted.eq(labels).sum().item()
        total += labels.size(0)
    return test_corrects / total

We will now compare the performance of our transformer-based model with the previous models.  

In [None]:
from transformers import AdamW

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = AdamW(model.parameters(),lr = 1e-5)
criterion  = ...
n_epochs = 1

train_bert(model, optimizer, train_loader, n_epochs)
test_bert(model, test_loader)

The following function get the embeddings of each text.
BERT use a speacial token `[CLS]` to represent the whole text.  
We will use the hidden states of this token as the embeddings of the text.

In [None]:
import numpy as np
test_loader = DataLoader(test_dataset, batch_size=1)


def get_embeddings(model, dataloader):
    model.eval()
    embeddings = []
    labels = []
    with torch.no_grad():
      for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels.append(batch["labels"].item())

        _, emb, _ = model(input_ids,mask=attention_mask)
        last_layer_cls = emb[-1][:,0,:]
        embeddings.append(last_layer_cls.squeeze(0).squeeze(0))
    embeddings = np.array([e.cpu().numpy() for e in embeddings])
    return embeddings, labels

embeddings, labels = get_embeddings(model, test_loader)

Now plot the embeddings of each text on a t-SNE.  
What do you think of these representations?

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns

...

We can use[ **bertviz** library](https://github.com/jessevig/bertviz) to visualize the attention weights of the model.
Remember that the attention weights represent the relation between the tokens.
Attention weights are very useful to understand how the model works and what it focuses on.

In [None]:
!pip install bertviz

Here is a text example:

In [None]:
from bertviz import model_view, head_view

sentence = test_df["text"].iloc[33]
tokenized = tokenizer(sentence)
print(sentence)
print(tokenized)

We will use the `model_view` function to visualize the attention weights of the model.  
Here, each row represents a layer of the model and each column represents a head of the model.
The darker the color, the higher the attention weight.  
You may observe that some attention heads catche different relations between words.


In [None]:
inputs = torch.tensor(tokenized["input_ids"]).unsqueeze(0).to(device)
mask = torch.tensor(tokenized["attention_mask"]).unsqueeze(0).to(device)
outputs = model(inputs,mask = mask)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])
model_view(attention, tokens)

The ```head_view``` function allows to visualize the attention weights of a particular head of the model.

In [None]:
head_view(attention, tokens)

That's it for this practical session.
By now, you should have a good understanding of how to encode text data and use it to train a machine learning model.  
You should also be familiar with the latest advancements in NLP, including transformers and pre-trained models.   
We hope you enjoyed this practical session and that you will be able to apply what you've learned to your own projects.