<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/BERTs/BERT_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **BERT**: _Using Huggingface_ 🤗

---

I'm learning from [The Official Hugging Face Transformer Docs](https://huggingface.co/docs/transformers/index). And I'll be using PyTorch the entire time.

In [None]:
##@ First lets install the huggingface trannsformers, datasets, evaluate and accelerate
! pip install transformers datasets evaluate accelerate

In [29]:
from google.colab import userdata
my_token = userdata.get('HF_collab')  #Loading the Hugging Face Access Token through the secretKey

In [30]:
## Then Logging in:
from huggingface_hub import login

login(my_token)

### Getting Started:

#### **A. Pipeline**

**Important Catalogue before starting:**

---


| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


In [31]:
from transformers import pipeline

classifier= pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [32]:
classifier("Hey! so this way we can pass the value to classifier and its super easy. I'm liking this!")

[{'label': 'POSITIVE', 'score': 0.9996222257614136}]

In [33]:
#@ Testing for negative sentiment
classifier("You dummy!")

[{'label': 'NEGATIVE', 'score': 0.9874186515808105}]

Likewise, if we have more than one inputs, we can pass inputs as lists to the `pileline()` and that will return the list of dictionaries.


In [34]:
results = classifier(["You look beautiful", "You ugly hag!"])
for result in results:
  print(f"label: {result['label']}, score: {result['score']}")

label: POSITIVE, score: 0.9998769760131836
label: NEGATIVE, score: 0.9993403553962708


Also the `pipeline()` can iterate over the entire dataset.

In [35]:
import torch

speech_recognizer = pipeline("automatic-speech-recognition", model= "facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [36]:
from datasets import load_dataset, Audio

dataset= load_dataset("PolyAI/minds14", name="en-US", split="train")

What we are dong in above code snippet is that: we are loading the specific "PolyAIs MINDS-14" dataset from the Huggingface hub.

Likewise `en-us` as name specifies the particular subset or config of the dataset. In this case, it loads the English(US) subset of the dataset.


Then there comes `split="train"` which specifies the split of the dataset to load.

In [37]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

More explanations:

1. The `cast_column` Method:
Its similar to type_casting that we used to do in basics of python. Here, what it does is, it modifies the "audio" column to use the `Audio` type with a specified sampling rate.

_Why cast_column?_

- So that all the data are standardized  to the desired format.

In [38]:
result = speech_recognizer(dataset[:2]["audio"])
print([d["text"] for d in result])

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE"]


> **Note**: \
Incase of the larger datasets like Audio and Images, we can use `generators` to avoid the memory overload. \
And, the HG-pipeline API can work seamlessly with these  geneators for effecient processing.

#### Using another model and tokenizer in the pipeline

Well, before when we used pipeline, we didn't mention the `model` during the _"sentiment-analysis"_ and by default it used: `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model which just classifies the english text.

Now, lets try calling the model which also works with French, Spanish languages.

We will be using [this model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

In [39]:
model_name  = "nlptown/bert-base-multilingual-uncased-sentiment"

In [40]:
classifier = pipeline("sentiment-analysis", model= model_name)

Device set to use cpu


In [41]:
##@ Lets try with Dutch sentences one with positive and another with negative sentiment
tests_dutch= classifier(["hey daar jochie, hoe gaat het? je ziet er onstuimig uit!!", "Hoe kan iemand er zo slecht uitzien?"])
for test_dutch in tests_dutch:
  print(f"label: {test_dutch['label']}, score: {test_dutch['score']}")

label: 5 stars, score: 0.3779810667037964
label: 1 star, score: 0.7339279651641846


#### **B. AutoClass**

This feature provides an abstraction that automatically selects and loads the appropriate model and tokenizer for a given task based on model's architecture.

> _This allows us to work with the pretrained models without needing to know the exact details of their configuration or class implementation._

##### B.1 AutoTokenizer:


In [42]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [43]:
encoding= tokenizer("Hey bud! What's up! You good?")
encoding

{'input_ids': [101, 32821, 35070, 106, 11523, 112, 161, 10700, 106, 10855, 12050, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

What might have happened internally:

---
1. **Text Normalization:**\
The input is cleaned (lowercased if using an uncased model, etc.).

2. **Tokenization:**\
The text is split into tokens (words, subwords, or special symbols) using the tokenizer's predefined rules. \
*Example:* `"Hey bud! What's up! You good?"` might be split into:\
`["hey", "bud", "!", "what", "'", "s", "up", "!", "you", "good", "?"]`.

3. **Mapping to IDs:**\
Each token is mapped to its corresponding ID in the tokenizer's vocabulary. For example:\
`"hey" → 32821`
`"!" → 106`

4. **Adding Special Tokens:**\
Special tokens like `[CLS]` (start of sequence) and `[SEP]` (end of sequence) are added.
Example: `[CLS] hey bud ! what ' s up ! you good ? [SEP]`.

There's a better format we could use. Which is shown below:

In [44]:
 #@ The example code snippet from the docs itself
 pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

print(pt_batch)

{'input_ids': tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103,   100,
         58263, 13299,   119,   102],
        [  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,
           102,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


Why this format?
- This format ensures all sequences in the batch are of equal length _(a requirement for transformer models)_. The `attention_mask` tells the model which parts of the input are meaningful and which are just padding.
---

##### B.2. AutoModel


- `AutoModel` is a generic class used to automatically load a pretrained transformer model.

- It’s designed to reduce the need for users to manually specify the model architecture (like BERT, RoBERTa, GPT, etc.).

- By specifying the name or path of a pretrained model, AutoModel figures out the right architecture for the task.

- For different tasks, there are specific variants of AutoModel:

  - **AutoModelForSequenceClassification**: For text classification tasks.
  - **AutoModelForTokenClassification**: For tasks like Named Entity Recognition (NER).
  - **AutoModelForQuestionAnswering**: For question-answering tasks.
  -**AutoModelForCausalLM**: For text generation tasks. etc.
---

In [45]:
from transformers import AutoModelForSequenceClassification as AmFSC

pt_model = AmFSC.from_pretrained(model_name)
pt_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

So it just recognized it as a BERT based model.

In [46]:
pt_outputs= pt_model(**pt_batch)
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6222, -2.7745, -0.8967,  2.0137,  3.3064],
        [ 0.0064, -0.1258, -0.0503, -0.1655,  0.1329]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The `pt_model(**pt_batch)` line performs a forward pass through the model. The model computes the logits _(raw, un-normalized scores for each class)_ for the input text.

Now, if we want to view the probablities using these logits values, we can.

In [47]:
from torch import nn

pt_pred = nn.functional.softmax(pt_outputs.logits, dim=-1)
pt_pred

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)

Now, why `dim= -1` ?

---

- `dim= -/+ 1 ` calculates the probabilities along last dimension _(across columns for each row)_ ie. PyTorch would normalize row-wise.

- `dim = 0` would normalize column-wise instead of row-wise.
  - This is wrong for classification tasks because:
    - You’d mix up scores from different examples, which doesn’t make sense.

> _**Note:**_  
- `dim=-1` is universal, flexible, and works for any tensor shape where the last dimension is the target.
- `dim=1` is fine in specific cases, but it's less robust and not future-proof if your tensor shapes change.

---

#### **C. Saving the Model**
---

After finetuning the model, we can save it with its tokenizer using `{pretrainedmodel}.save_pretrained()`

In [48]:
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

When we want to use model again, just use:

In [49]:
pt_model = AmFSC.from_pretrained("./pt_save_pretrained")  #I imported AutoModelForSeqClassification as AmFSC

**The Cool Feture in 🤗 transformers:**

---

The saved model could be reused as both TF or PyTorch Model

In [50]:
from transformers import TFAutoModelForSequenceClassification as TFAmFSC ##Note: TensorFlow has this instead :)
#@ If we want to convert this pt model to tf model,

tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = TFAmFSC.from_pretrained(pt_save_directory, from_pt= True)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


#### **D. Building the Custom Models**

---

In [51]:
from transformers import AutoConfig

my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12)

In [52]:
print(my_config)

DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.47.1",
  "vocab_size": 30522
}



In [53]:
##@ Using this custom configuration we could create a new model as well
from transformers import AutoModel
model_custom = AutoModel.from_config(my_config)

> _And we could perform rem tasks as before on this model created.._

---