<a href="https://colab.research.google.com/github/Firojpaudel/GenAI-Chronicles/blob/main/BERTs/BERT_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Trying to Implement BERT**: _Using Huggingface_ 🤗

---

I'm learning from [The Official Hugging Face Transformer Docs](https://huggingface.co/docs/transformers/index). And I'll be using PyTorch the entire time.

In [None]:
##@ First lets install the huggingface trannsformers, datasets, evaluate and accelerate
! pip install transformers datasets evaluate accelerate

In [21]:
from google.colab import userdata
my_token = userdata.get('HF_collab')  #Loading the Hugging Face Access Token through the secretKey

In [27]:
## Then Logging in:
from huggingface_hub import login

login(my_token)

### Getting Started:

#### **A. Pipeline**

**Important Catalogue before starting:**

---


| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


In [28]:
from transformers import pipeline

classifier= pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [29]:
classifier("Hey! so this way we can pass the value to classifier and its super easy. I'm liking this!")

[{'label': 'POSITIVE', 'score': 0.9996222257614136}]

In [24]:
#@ Testing for negative sentiment
classifier("You dummy!")

[{'label': 'NEGATIVE', 'score': 0.9874186515808105}]

Likewise, if we have more than one inputs, we can pass inputs as lists to the `pileline()` and that will return the list of dictionaries.


In [4]:
results = classifier(["You look beautiful", "You ugly hag!"])
for result in results:
  print(f"label: {result['label']}, score: {result['score']}")

label: POSITIVE, score: 0.9998769760131836
label: NEGATIVE, score: 0.9993403553962708


Also the `pipeline()` can iterate over the entire dataset.

In [5]:
import torch

speech_recognizer = pipeline("automatic-speech-recognition", model= "facebook/wav2vec2-base-960h")

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cpu


In [8]:
from datasets import load_dataset, Audio

dataset= load_dataset("PolyAI/minds14", name="en-US", split="train")

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


MInDS-14.zip:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

What we are dong in above code snippet is that: we are loading the specific "PolyAIs MINDS-14" dataset from the Huggingface hub.

Likewise `en-us` as name specifies the particular subset or config of the dataset. In this case, it loads the English(US) subset of the dataset.


Then there comes `split="train"` which specifies the split of the dataset to load.

In [9]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

More explanations:

1. The `cast_column` Method:
Its similar to type_casting that we used to do in basics of python. Here, what it does is, it modifies the "audio" column to use the `Audio` type with a specified sampling rate.

_Why cast_column?_

- So that all the data are standardized  to the desired format.

In [12]:
result = speech_recognizer(dataset[:2]["audio"])
print([d["text"] for d in result])

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE"]


> **Note**: \
Incase of the larger datasets like Audio and Images, we can use `generators` to avoid the memory overload. \
And, the HG-pipeline API can work seamlessly with these  geneators for effecient processing.

#### Using another model and tokenizer in the pipeline

Well, before when we used pipeline, we didn't mention the `model` during the _"sentiment-analysis"_ and by default it used: `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model which just classifies the english text.

Now, lets try calling the model which also works with French, Spanish languages.

We will be using [this model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

In [13]:
model_name  = "nlptown/bert-base-multilingual-uncased-sentiment"

In [14]:
classifier = pipeline("sentiment-analysis", model= model_name)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [18]:
##@ Lets try with Dutch sentences one with positive and another with negative sentiment
tests_dutch= classifier(["hey daar jochie, hoe gaat het? je ziet er onstuimig uit!!", "Hoe kan iemand er zo slecht uitzien?"])
for test_dutch in tests_dutch:
  print(f"label: {test_dutch['label']}, score: {test_dutch['score']}")

label: 5 stars, score: 0.3779810667037964
label: 1 star, score: 0.7339279651641846
