<a href="https://colab.research.google.com/github/Naomie25/DI-Bootcamp/blob/main/Week7_Day3_ExerciceXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🌟 Exercise 1: Tokenization with BERT

In [1]:
from transformers import BertTokenizer, BertForMaskedLM  # import BERT tokenizer and masked language model classes
import torch  # import PyTorch for tensor operations

In [2]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # load the pre-trained BERT (uncased) tokenizer
sentence = "Transformers are changing the world of NLP."

# Tokenize and prepare for BERT input
encoded_input = tokenizer(
    sentence,
    add_special_tokens=True,   # Add [CLS] and [SEP]
    padding='max_length',      # Pad to max length (default is 512 unless specified)
    truncation=True,           # Truncate if sentence too long
    max_length=16,             # Limit length to 16 tokens
    return_tensors='pt'        # Return PyTorch tensors
)

# Get token IDs
input_ids = encoded_input['input_ids'][0]
# Decode token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(input_ids)

# Print results
print("Input IDs:", input_ids)
print("Tokens:", tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Input IDs: tensor([  101, 19081,  2024,  5278,  1996,  2088,  1997, 17953,  2361,  1012,
          102,     0,     0,     0,     0,     0])
Tokens: ['[CLS]', 'transformers', 'are', 'changing', 'the', 'world', 'of', 'nl', '##p', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


Exercise 2: Sentiment Analysis with BERT Pipeline

In [3]:
from transformers import pipeline

# Create a sentiment-analysis pipeline
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Sample sentence
sentence = "I love shopping!"

# Get sentiment prediction
result = sentiment_pipeline(sentence)[0]

# Print result
print("Sentence:", sentence)
print("Predicted label:", result['label'])
print("Confidence score:", round(result['score'], 4))


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


Sentence: I love shopping!
Predicted label: POSITIVE
Confidence score: 0.9998


Exercise 3: Building a Custom Sentiment Analyzer

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

In [5]:
#Define the BERTSentimentAnalyzer Class
class BERTSentimentAnalyzer:
    def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
        # Load tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()  # Set model to evaluation mode

        # Labels (based on SST-2)
        self.labels = ["NEGATIVE", "POSITIVE"]

    def preprocess(self, text):
        # Tokenize and encode the input text
        encoded = self.tokenizer(
            text,
            return_tensors='pt',        # Return PyTorch tensors
            truncation=True,
            padding=True,
            max_length=512
        )
        return encoded

    def predict(self, text):
        # Preprocess input
        inputs = self.preprocess(text)

        # Forward pass through the model
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits

        # Apply softmax to get probabilities
        probs = F.softmax(logits, dim=1)
        confidence, prediction = torch.max(probs, dim=1)
        label = self.labels[prediction.item()]
        return {"label": label, "confidence": round(confidence.item(), 4)}


In [6]:
#Test the Analyzer with Sample Texts
# Create instance
analyzer = BERTSentimentAnalyzer()

# Test samples
samples = [
    "I absolutely love this movie!",
    "This is the worst experience I've ever had.",
    "The product is okay, not great but not terrible either.",
    "I'm so happy with the customer service!",
    "It was a complete waste of money."
]

# Analyze each sentence
for text in samples:
    result = analyzer.predict(text)
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (Confidence: {result['confidence']})\n")


Text: I absolutely love this movie!
Sentiment: POSITIVE (Confidence: 0.9999)

Text: This is the worst experience I've ever had.
Sentiment: NEGATIVE (Confidence: 0.9998)

Text: The product is okay, not great but not terrible either.
Sentiment: POSITIVE (Confidence: 0.9852)

Text: I'm so happy with the customer service!
Sentiment: POSITIVE (Confidence: 0.9999)

Text: It was a complete waste of money.
Sentiment: NEGATIVE (Confidence: 0.9998)



Exercise 5: Comparing BERT and GPT

Both models are based on the Transformer architecture, but they use different parts of it and serve different goals:

BERT uses only the encoder.

GPT uses only the decoder.



| Feature                   | **BERT**                                                                    | **GPT**                                                     |
| ------------------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------- |
| 🏗 **Architecture**       | Encoder-only                                                                | Decoder-only                                                |
| 🔁 **Directionality**     | **Bidirectional** (reads left and right context)                            | **Unidirectional** (left-to-right)                          |
| 🎯 **Primary Purpose**    | **Understanding** language                                                  | **Generating** language                                     |
| 🧠 **Training Objective** | Masked Language Modeling (MLM)                                              | Causal Language Modeling (CLM)                              |
| 🔍 **Common Use Cases**   | Classification, QA, Named Entity Recognition                                | Text generation, Chatbots, Autocomplete                     |
| ✅ **Strengths**           | Deep contextual understanding, good for tasks needing full-sentence context | Excellent at fluent and coherent **text generation**        |
| ❌ **Weaknesses**          | Not designed for generation                                                 | Weaker at understanding full sentence structure             |
| 🔧 **Fine-tuning**        | Often fine-tuned for **specific understanding tasks**                       | Often used **as-is** for generation or fine-tuned for style |
| 📚 **Popular Versions**   | BERT-base, BERT-large, RoBERTa                                              | GPT-2, GPT-3, GPT-4                                         |


Use BERT when:
- You need to understand or classify a sentence (e.g. Is this review positive?)

- The task depends on full sentence context (left and right)

- You’re working on question answering, sentence similarity, or entity recognition

Use GPT when:
- You need to generate text (e.g. write an email, summarize an article)

- The task is creative or open-ended

- You want to build chatbots, language models, or autocomplete systems

Exercise 6: Exploring BERT Applications in Retrieval-Augmented Generation (RAG)

RAG is an architecture that combines:

- Retrieval: Finding relevant information from a large document base.

- Generation: Producing a response based on both the input and the retrieved documents.

BERT’s Role in the Retrieval Component
BERT is used to understand the meaning of both the query and documents.

BERT acts as an encoder, not a generator.
It converts queries and documents into dense vectors (embeddings).

These embeddings represent the semantic meaning of the text.

The retrieval step uses these embeddings to find documents similar to the query.

How BERT Generates Embeddings?

The input text is tokenized and passed through BERT.

The output from the [CLS] token (or mean pooling of all tokens) is used as the embedding.

This is done for:

Queries (e.g., “What are symptoms of diabetes?”)

Documents (e.g., articles, knowledge base entries)

Vector Database: Matching Queries to Documents

A vector database (like FAISS, Weaviate, Pinecone) is used to:

- Store document embeddings.

- Compare the query embedding against all document embeddings using similarity (usually cosine similarity or dot product).

- Return the top-k most relevant documents.

Example: BERT + GPT in a RAG System

User Query:
“How do vaccines work?”

1. Query Encoding with BERT

BERT encodes the query into a vector: q_vec

2. Document Retrieval

q_vec is matched against a vector database using cosine similarity.

Top-3 matching documents (e.g., medical texts about vaccines) are retrieved.

3. Pass to GPT
These documents are fed to a generative model like GPT along with the query.

GPT uses them as context to generate a response:

“Vaccines work by stimulating your immune system to recognize and fight pathogens…”