<a href="https://colab.research.google.com/github/GammelNuuk/pythonprojects/blob/main/02_encoder_transformers_nlp_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoder Only Transformer for NLP Tasks
Encoder-only models focus solely on the **encoder part of the transformer architecture**. They are primarily designed for NLP tasks involving understanding and representing input text, such as classification, named entity recognition, and more. These models are typically pre-trained on large datasets to create rich contextual embeddings and then fine-tuned on specific tasks.

### BERT-ology
BERT, or **Bi-Directional Encoder Representations from Transformers**, was presented by Devlin et al., a team at Google AI in 2018. BERT makes use of a transformer-style encoder with a different number of encoder blocks depending on the model size. The key contribution from this set of models is the masked language modeling objective during the pre-training phase, where some tokens in the input are masked, and the model is trained to predict them. Key works in this group of architectures are **BERT**, **RoBERTa** (or optimized BERT), **DistilBERT** (lighter and more efficient BERT), **ELECTRA** and **ALBERT**.


In this notebook we will work through three different NLP tasks of Question Answering, Classification and Masked Language Model. We will:
+ Leverage the ``huggingface`` library
+ Have a quick look at various task specific datasets
+ Explore pretrained vs fine-tuned models on various tasks
+ Bonus: We will leverage ``berviz`` to understand the internals of the different model



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Setup Required Libraries

In [1]:
# Remember to restart kernel afte completion.
!pip3 install transformers==4.42.4
!pip3 install datasets==2.20.0
!pip3 install bertviz==1.4.0



In [6]:
# To avoide ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
# !pip uninstall numpy
# !pip install numpy==1.26.4

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Would remove:
    /usr/local/bin/f2py
    /usr/local/lib/python3.11/dist-packages/numpy-1.26.4.dist-info/*
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libgfortran-040039e1.so.5.0.0
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libquadmath-96973f99.so.0.0.0
    /usr/local/lib/python3.11/dist-packages/numpy/*
Proceed (Y/n)? [31mERROR: Operation cancelled by user[0m[31m


# Imports

### HuggingFace Pipelines
The ``huggingface`` library provides preset pipelines for popular NLP tasks where it provides a nice abstraction while taking care of under the hood details of pre/post processing, tokenisation and preparing model inputs)

In [2]:
import torch
import transformers
from transformers import pipeline

## Configs

In [3]:
# Let us define some configs/constants
DISTILBET_BASE_UNCASED_CHECKPOINT = "distilbert/distilbert-base-uncased"
DISTILBET_QA_CHECKPOINT = "distilbert/distilbert-base-uncased-distilled-squad"
DISTILBET_CLASSIFICATION_CHECKPOINT = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [4]:
if torch.cuda.is_available():
    DEVICE = 'cuda'
    Tensor = torch.cuda.FloatTensor
    LongTensor = torch.cuda.LongTensor
    DEVICE_ID = 0
elif torch.backends.mps.is_available():
    DEVICE = 'mps'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = 0
else:
    DEVICE = 'cpu'
    Tensor = torch.FloatTensor
    LongTensor = torch.LongTensor
    DEVICE_ID = -1
print(f"Backend Accelerator Device={DEVICE}")

Backend Accelerator Device=cpu


## Time To Test Some Models

### Predicting the Masked Token
This was a unique objective when BERT was originally introduced as compared to usual NLP tasks such as classification. The objective requires us to prepare a dataset where we mask a certain percentage of input tokens and train the model to learn to predict those tokens. This objective turns out to be very effective in helping the model learn the nuances of language.

In this first task we will test the pre-trained model against this objective itself. The model outputs a bunch of things such as the predicted token, encoded index of the predicted token/word along with a score which indicates the model's confidence.

In [5]:
mlm_pipeline = pipeline(
    'fill-mask',
    model=DISTILBET_BASE_UNCASED_CHECKPOINT,
    device=DEVICE_ID
)
mlm_pipeline("Earth is a [MASK] in our solar system")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'score': 0.4104360342025757,
  'token': 4774,
  'token_str': 'planet',
  'sequence': 'earth is a planet in our solar system'},
 {'score': 0.05731069669127464,
  'token': 5871,
  'token_str': 'satellite',
  'sequence': 'earth is a satellite in our solar system'},
 {'score': 0.030489657074213028,
  'token': 4920,
  'token_str': 'hole',
  'sequence': 'earth is a hole in our solar system'},
 {'score': 0.02207716926932335,
  'token': 2732,
  'token_str': 'star',
  'sequence': 'earth is a star in our solar system'},
 {'score': 0.019248871132731438,
  'token': 4231,
  'token_str': 'moon',
  'sequence': 'earth is a moon in our solar system'}]

### Sentiment Classification/Text Classification
This is a classic NLP task where the objective to assign a label of positive/negative to each input datapoint. This can be easily extended to a multi-class classification task with more labels such as very negative, negative, neutral, positive and very positive, etc. But for our dataset, this is a binary classification task.

We will leverage the sentiment-analysis pipeline with both the _pretrained_ and _fine-tuned_ versions of **DistilBERT**. The fine-tuned version has been specifically optimised for this task although not specifically this dataset. Let us explore how these two models perform.

In [6]:
# the prefix ft stands for fine-tuned
classification_ft_pipeline = pipeline(
    'sentiment-analysis',
    model=DISTILBET_CLASSIFICATION_CHECKPOINT,
    device=DEVICE_ID
)

# the prefix pt stands for pretrained (not pytorch ;) )
classification_pt_pipeline = pipeline(
    'sentiment-analysis',
    model=DISTILBET_BASE_UNCASED_CHECKPOINT,
    device=DEVICE_ID
)

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
SAMPLE_SA_INPUT = "What a messy place! I am never coming here again!"
pretrained_sa_results = classification_pt_pipeline(SAMPLE_SA_INPUT)
finetuned_sa_results = classification_ft_pipeline(SAMPLE_SA_INPUT)

In [8]:
# pretty convincingly negative, look at the score
print(f"Predictions from Fine-Tuned Model={finetuned_sa_results}")
# the pre-trained model does the job but check out the score.
#It could land in trouble for complex sentences
print(f"Predictions from Pretrained Model={pretrained_sa_results}")

Predictions from Fine-Tuned Model=[{'label': 'NEGATIVE', 'score': 0.9995848536491394}]
Predictions from Pretrained Model=[{'label': 'LABEL_0', 'score': 0.556006669998169}]


In [9]:
# the fine-tuned model contains a mapping for ease of label conversion
classification_ft_pipeline.model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

### Question Answering
This is an interesting NLP task and quite complex one as well. For this task, the model is provided input consisting of the context along with a question and it predicts the answer by selecting text from the context. The training setup for this task is a bit involved process, the following is an overview:
- The training input as triplet of context, question and answer
- This is transformed as combined input of the form ``[CLS]question[SEP]context[SEP]`` or ``[CLS]contex[SEP]question[SEP]`` with answer acting as the label
- The model is trained to predict the start and end indices of the the corresponding answer for each input.


For our current setting, we will leverage both _pretrained_ and _fine-tuned_ versions of **DistilBERT** via the _question-answering_ pipeline and understand the performance difference.

In [10]:
qa_ft_pipeline = pipeline(
    'question-answering',
    model=DISTILBET_QA_CHECKPOINT,
    device=DEVICE_ID
)
qa_pt_pipeline = pipeline(
    'question-answering',
    model=DISTILBET_BASE_UNCASED_CHECKPOINT,
    device=DEVICE_ID
)

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# we use a snippet about BERT like models from the chapter itself
context = """The key contribution from this set of models is the masked language modeling objective during the pre-training phase, where some tokens in the input are masked, and the model is trained to predict them (we will cover these in the upcoming section). Key works in this group of architectures are BERT, RoBERTa (or optimized BERT), DistilBERT (lighter and more efficient BERT), ELECTRA and ALBERT.
In this notebook we will work through the task of Question Answering where our language model will learn to answer questions based on the context provided."""
question = "What are the key works in this set of models?"

In [12]:
ft_qa_result= qa_ft_pipeline(
    question=question,
    context=context
)

In [15]:
pt_qa_result= qa_pt_pipeline(
    question=question,
    context=context
)

In [16]:
print("*"*55)
print(f"Context:{context}")
print("*"*55)
print(f"Question:{question}")
print("-"*55)
print(f"Response from Fine-Tuned Model:\n{ft_qa_result}")
print()
print(f"Response from Pretrained Model:\n{pt_qa_result}")

*******************************************************
Context:The key contribution from this set of models is the masked language modeling objective during the pre-training phase, where some tokens in the input are masked, and the model is trained to predict them (we will cover these in the upcoming section). Key works in this group of architectures are BERT, RoBERTa (or optimized BERT), DistilBERT (lighter and more efficient BERT), ELECTRA and ALBERT.
In this notebook we will work through the task of Question Answering where our language model will learn to answer questions based on the context provided.
*******************************************************
Question:What are the key works in this set of models?
-------------------------------------------------------
Response from Fine-Tuned Model:
{'score': 0.010789151303470135, 'start': 294, 'end': 326, 'answer': 'BERT, RoBERTa (or optimized BERT'}

Response from Pretrained Model:
{'score': 0.0001677741383900866, 'start': 230, 'e

## Bonus:: Visualising Things Under the Hood

Attention, especially Multi-Head Attention is a very powerful aspect of the overall transformer architecture. [A Multiscale Visualization of Attention in the Transformer Model](https://aclanthology.org/P19-3007.pdf) by Vig et. al. provides an amazing insight into visualising different attention heads across layer and how they pick up different aspects of the input/labnguage in general.

In this section, let us explore how our models differ in their understanding of inputs

In [17]:
from bertviz import head_view,model_view
from transformers import AutoTokenizer, AutoModel, utils

In [18]:
utils.logging.set_verbosity_error()

In [19]:
tokenizer = AutoTokenizer.from_pretrained(DISTILBET_BASE_UNCASED_CHECKPOINT)
pt_viz_model = AutoModel.from_pretrained(DISTILBET_BASE_UNCASED_CHECKPOINT,
                                           output_attentions=True)

ft_viz_model = AutoModel.from_pretrained(DISTILBET_CLASSIFICATION_CHECKPOINT,
                                           output_attentions=True)

ft_viz_qa_model = AutoModel.from_pretrained(DISTILBET_QA_CHECKPOINT,
                                           output_attentions=True)

In [20]:
input_text = "Dune was an amazing movie. Must Watch!"
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings

In [21]:
pt_outputs = pt_viz_model(inputs)  # Run model
pt_attention = pt_outputs[-1]  # Retrieve attention from model outputs

In [22]:
ft_outputs = ft_viz_model(inputs)  # Run model
ft_attention = ft_outputs[-1]  # Retrieve attention from model outputs

In [23]:
ft_qa_outputs = ft_viz_qa_model(inputs)  # Run model
ft_qa_attention = ft_qa_outputs[-1]  # Retrieve attention from model outputs

In [24]:
# pre-trained attention (compare layer 4(select from drop down) head 12(double clicking last colored cell on the right) )
head_view(pt_attention, tokens)

<IPython.core.display.Javascript object>

In [25]:
# sentiment fine-tuned attention (compare layer 4(select from drop down) head 12(double clicking last colored cell on the right) )
head_view(ft_attention, tokens)

<IPython.core.display.Javascript object>

In [26]:
# qa fine tuned attention (compare layer 4(select from drop down) head 12(double clicking last colored cell on the right) )
head_view(ft_qa_attention, tokens)

<IPython.core.display.Javascript object>

In [27]:
# pretrained model
model_view(pt_attention, tokens)

<IPython.core.display.Javascript object>

In [28]:
# sentiment fine tuned
model_view(ft_attention, tokens)

<IPython.core.display.Javascript object>

In [29]:
# qa finetuned
model_view(ft_qa_attention, tokens)

<IPython.core.display.Javascript object>