Exercise 1: Tokenization With BERT

Objective: Learn how BERT tokenizes text and adds special tokens, preparing it for model input.

Why this matters:
Before any language model can process text, it needs to convert it into tokens and numerical IDs. BERT uses special tokens like [CLS] and [SEP] to mark the beginning and end of sentences. This exercise helps you understand how BERT prepares raw text for analysis.

Instructions:

Install the transformers and torch libraries.
Load the BERT tokenizer (bert-base-uncased).
Choose a sample sentence.
Tokenize the sentence and view how BERT breaks it down.
Prepare the sentence with special tokens, padding, and truncation for model input.
Review the token IDs and tokens, identifying the special tokens BERT adds.
Outcome: You will have a fully tokenized sentence, see the special tokens BERT adds, and understand how text becomes input for BERT.

In [5]:
#Install the transformers and torch libraries
import torch
import transformers

#Load the BERT tokenizer (bert-base-uncased)
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#Choose a sample sentence.
text = "How can you leave me standing alone in a world that's so cold? Maybe I'm just too demanding... Maybe I'm just like my father, too bored!"

#Tokenize the sentence and view how BERT breaks it down
encoded_input = tokenizer(text, return_tensors='tf')
for i in range(encoded_input['input_ids'].shape[0]):
    input_ids = encoded_input['input_ids'][i]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    print(f"Tokens for input {i}:", tokens)

#Prepare the sentence with special tokens, padding, and truncation for model input.
encoded_input = tokenizer(
    text,
    padding='max_length',       # pad to max_length
    truncation=True,            # truncate if needed
    max_length=20,              # example fixed max length
    return_tensors='tf'         # return TensorFlow tensors
)

#Review the token IDs and tokens, identifying the special tokens BERT adds.
for i in range(encoded_input['input_ids'].shape[0]):
    input_ids = encoded_input['input_ids'][i]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    print(f"Tokens for input {i}:", tokens)



Tokens for input 0: ['[CLS]', 'how', 'can', 'you', 'leave', 'me', 'standing', 'alone', 'in', 'a', 'world', 'that', "'", 's', 'so', 'cold', '?', 'maybe', 'i', "'", 'm', 'just', 'too', 'demanding', '.', '.', '.', 'maybe', 'i', "'", 'm', 'just', 'like', 'my', 'father', ',', 'too', 'bored', '!', '[SEP]']
Tokens for input 0: ['[CLS]', 'how', 'can', 'you', 'leave', 'me', 'standing', 'alone', 'in', 'a', 'world', 'that', "'", 's', 'so', 'cold', '?', 'maybe', 'i', '[SEP]']


🌟 Exercise 2: Sentiment Analysis With BERT Pipeline

Objective: Use a pre-trained BERT model to perform sentiment analysis.

Why this matters:

Pre-trained models like BERT can quickly classify text, such as determining if a sentence is positive or negative. Pipelines simplify this process, allowing you to focus on the task without managing low-level details.

Instructions:

Import the pipeline class from transformers.
Create a sentiment analysis pipeline using the distilbert-base-uncased-finetuned-sst-2-english model.
Provide a sample sentence.
Use the pipeline to predict the sentiment.
Review the predicted label and confidence score.
Outcome: You will have a working sentiment analysis pipeline that can classify text as positive or negative.

In [9]:
#Import the pipeline class from transformers.
from transformers import pipeline

#Create a sentiment analysis pipeline using the distilbert-base-uncased-finetuned-sst-2-english model.
analyzer = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

#Provide a sample sentence.
sample_sentence= "When I walk through the valley of the shadow of death, I don't feel no evil!"

#Use the pipeline to predict the sentiment.
result = analyzer(sample_sentence)
print(result)

#Review the predicted label and confidence score.
label = result[0]['label']
score = result[0]['score']

print(f"Predicted Sentiment: {label}")
print(f"Confidence Score: {score:.4f}")



Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9990573525428772}]
Predicted Sentiment: POSITIVE
Confidence Score: 0.9991


🌟 Exercise 3: Building A Custom Sentiment Analyzer

Objective: Build a sentiment analyzer with direct control over the tokenizer, model, and processing pipeline.

Why this matters:

Using pipelines is convenient, but building a custom analyzer helps you understand how models process inputs and generate outputs. You gain full control over preprocessing, model handling, and post-processing.

Instructions:

1. Import AutoTokenizer and AutoModelForSequenceClassification.
2. Create a class BERTSentimentAnalyzer with methods for:

Initializing the tokenizer and model.
Preprocessing input text (cleaning, tokenizing, preparing tensors).
Predicting sentiment and returning results.
3. Test your analyzer with various sample texts.

Outcome: You will have a custom sentiment analyzer and understand each component’s role in the pipeline.

In [10]:
#Import AutoTokenizer and AutoModelForSequenceClassification.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import re
import torch
import torch.nn.functional as F


#Create a class BERTSentimentAnalyzer with methods for:
class BERTSentimentAnalyzer:
  #Initializing the tokenizer and model
  def __init__(self, model_name='distilbert-base-uncased-finetuned-sst-2-english'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

  #Preprocessing input text (cleaning, tokenizing, preparing tensors)

  def preprocess_input_text(self, text):
      # Basic cleaning (optional)
      text = text.strip()
      text = re.sub(r'\s+', ' ', text)  # collapse multiple spaces
      text = re.sub(r'[^\w\s\'!?.,]', '', text)  # remove unwanted characters

      # Tokenize and prepare tensors
      inputs = self.tokenizer(
          text,
          return_tensors='pt',
          truncation=True,
          padding=True
      )
      return inputs

  #Predicting sentiment and returning results.
  def predict_sentiment(self, text):
      # Preprocess: tokenize and prepare tensors
      inputs = self.preprocess_input_text(text)

      # Run model (disable gradient calculations)
      with torch.no_grad():
          outputs = self.model(**inputs)

      # Get logits and convert to probabilities
      probs = F.softmax(outputs.logits, dim=1)

      # Get predicted class index and confidence
      predicted_class_idx = torch.argmax(probs, dim=1).item()
      confidence = probs[0][predicted_class_idx].item()

      # Convert class index to label (e.g., 'POSITIVE', 'NEGATIVE')
      label = self.model.config.id2label[predicted_class_idx]

      return {'label': label, 'confidence': round(confidence, 4)}

#Test your analyzer with various sample texts.
sample_texts= [
    "Who lives by the sword will die by the sword",
    "Even a broken watch gives the right time twice a day",
    "Women can keep a secret, as long as they cooperate on it"
]

bert_sentime_analyzer = BERTSentimentAnalyzer()
print([bert_sentime_analyzer.predict_sentiment(text) for text in sample_texts])


[{'label': 'NEGATIVE', 'confidence': 0.9692}, {'label': 'POSITIVE', 'confidence': 0.9815}, {'label': 'POSITIVE', 'confidence': 0.9694}]


🌟 Exercise 4: Understanding BERT For Named Entity Recognition (NER)

Objective: Explore how BERT identifies entities in text using the NER task.

Why this matters:

NER helps extract important information like names, locations, and organizations from text. BERT can be fine-tuned for NER using models trained with the B-I-O tagging scheme (Begin, Inside, Outside).

Instructions:

1. Import AutoTokenizer and AutoModelForTokenClassification.
2. Create a class BERTNamedEntityRecognizer with methods for:

Initializing the tokenizer and model.
Recognizing entities in a given text and mapping token predictions to labels.
3. Test your recognizer with sample text containing entities.

Outcome: You will build an NER system that identifies entities like names, places, and more using BERT.

In [25]:
#Import AutoTokenizer and AutoModelForTokenClassification.
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import pipeline

#Create a class BERTNamedEntityRecognizer with methods for:
class BERTNamedEntityRecognizer:
    #Initializing the tokenizer and model
    def __init__(self, model_name='dslim/bert-base-NER'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name)


    #Recognizing entities in a given text and mapping token predictions to labels.
    def recognize_entities(self, text):
        self.ner_pipeline = pipeline("ner", model=self.model, tokenizer=self.tokenizer, aggregation_strategy="simple")
        return self.ner_pipeline(text)

#Test your analyzer with various sample texts.
texts = [
    "Barack Obama was born in Hawaii.",
    "Apple Inc. is based in Cupertino.",
    "Angela Merkel was Chancellor of Germany."
]

bert_ner = BERTNamedEntityRecognizer()
result= [bert_ner.recognize_entities(text) for text in texts]

for i, entities in enumerate(result):
    print(f"\nText {i + 1}:")
    if not entities:
        print("  No entities found.")
    for entity in entities:
        print(f"  {entity['word']} ({entity['entity_group']}, confidence: {entity['score']:.4f})")



Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Device set to use cpu
Device set to use cpu



Text 1:
  Barack Obama (PER, confidence: 0.9993)
  Hawaii (LOC, confidence: 0.9997)

Text 2:
  Apple Inc (ORG, confidence: 0.9994)
  Cupertino (LOC, confidence: 0.9977)

Text 3:
  Angela Merkel (PER, confidence: 0.9982)
  Germany (LOC, confidence: 0.9996)


🌟 Exercise 5: Comparing BERT And GPT

Objective: Understand the architectural and functional differences between BERT and GPT models.

Why this matters:

BERT and GPT are foundational models in NLP but serve different purposes. Knowing their strengths, weaknesses, and use cases helps you choose the right model for your task.

Instructions:

1. Research the architectures and applications of BERT and GPT.
2. Create a comparison table based on:

Architecture (encoder, decoder, or both).
Primary purpose (understanding vs. generation).
Common use cases.
Strengths and weaknesses.
3. Reflect on the differences and similarities.

Outcome: You will have a clear comparison of BERT and GPT, helping you understand when to use each model.

| **Model Type**                             | **Architecture**               | **Primary Purpose** | **Common Use Cases**                                  | **Strengths**                                                                                           | **Weaknesses**                                                               |
| ------------------------------------------ | ------------------------------ | ------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| **BERT** (e.g. `bert-base-uncased`)        | Encoder-only                   | Understanding       | Classification, NER, QA (extractive), embeddings      | - Deep contextual understanding<br>- Bidirectional attention<br>- Great for tasks needing comprehension | - Not designed for generation<br>- Fixed input size                          |
| **GPT** (e.g. `gpt-3.5`, `gpt-4`)          | Decoder-only                   | Generation          | Text generation, chatbots, summarization, translation | - Strong generative capabilities<br>- Few-shot/in-context learning<br>- Long-form coherence             | - Not ideal for extractive QA or classification<br>- Needs careful prompting |
                                                     |


🌟 Exercise 6: Exploring BERT Applications In Retrieval-Augmented Generation (RAG)

Objective: Learn how BERT is used in RAG systems to enhance information retrieval.

Why this matters:

RAG systems combine retrieval and generation, allowing language models to access external knowledge. BERT plays a key role in retrieving relevant information, improving the quality of generated responses.

Instructions:

Research the concept of Retrieval-Augmented Generation (RAG).
Explain BERT’s role in the retrieval component.
Describe how BERT generates embeddings for documents and queries.
Discuss how a vector database is used to match queries with relevant documents.
Provide an example of how BERT and a generative model like GPT work together in a RAG system.
Outcome: You will understand BERT’s role in RAG systems and how it enhances retrieval for generation tasks.

RAG is a hybrid architecture that fetches relevant documents or passages from a knowledge base and produces a coherent answer based on retrieved content.
This allows large language models (LLMs) to generate grounded, accurate responses using external knowledge.

BERT will generate embeddings for both Documents from databases during indexing phase and User queries at runtime. It results in a semantical map of texts in a vector space where distance is a measurement of similarity, even if the wording differs.

📐 3. How BERT Generates Embeddings

🔹 For Documents (at indexing time):
embedding = bert_model.encode("Document text here")
The embedding is a dense vector (e.g., 384 or 768 dimensions).
It’s stored in a vector database for fast lookup.
🔹 For Queries (at retrieval time):
query_embedding = bert_model.encode("What is the capital of France?")
The query is embedded into the same space as the documents.
BERT learns to place semantically related inputs (e.g., “capital of France” and “Paris”) closer together.
📦 4. Vector Database for Retrieval

A vector database (like FAISS, Pinecone, Weaviate, or Qdrant):

Stores all document embeddings
Enables fast similarity search (using cosine similarity, dot product, etc.)
Returns top-k most relevant documents for a query
Example:
similar_docs = vector_db.search(query_embedding, top_k=5)
These docs are then fed into the generative model.

🤖 5. RAG in Action: BERT + GPT Example

Let’s walk through a simplified flow:
Step 1: Preprocessing

# Document index
docs = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
doc_embeddings = bert_model.encode(docs)
vector_db.add(docs, doc_embeddings)
Step 2: User asks a question

query = "What is the capital of France?"
query_embedding = bert_model.encode(query)
top_docs = vector_db.search(query_embedding, top_k=1)
Step 3: Prompt construction

prompt = f"Use the following passage to answer the question:\n\n{top_docs[0]}\n\nQ: {query}\nA:"
Step 4: GPT generates an answer

answer = gpt_model.generate(prompt)
print(answer)  # → "Paris"