# BERT: Bidirectional Encoder Representations from Transformers

## What is BERT and Why Does It Matter?

Unlike the embedding methods you've learned (Word2Vec, GloVe, ELMo), BERT understands context from both directions simultaneously. When you learned Word2Vec and GloVe, you saw that each word gets a fixed vector regardless of context. ELMo improved this by being context-aware, but BERT takes it further by looking at words from both left and right at the same time.

Think about the word "bank" in these sentences:
- "I went to the bank to deposit money"
- "I sat by the river bank"

BERT understands these differently because it reads the entire sentence bidirectionally, not just left-to-right.

## The Foundation: Understanding Transformers Basics

Before diving into BERT, you need to grasp the transformer architecture concept. Transformers use something called "attention mechanism" which allows the model to focus on relevant parts of the input when processing each word.

The attention mechanism answers: "When processing this word, which other words in the sentence should I pay attention to?"

For example, in "The animal didn't cross the street because it was too tired", the attention mechanism helps the model understand that "it" refers to "animal", not "street".

## Key Concepts in BERT

**Tokenization in BERT**

BERT uses WordPiece tokenization, which breaks words into subwords. This handles unknown words better than word-level tokenization.

Example:
- "playing" might become ["play", "##ing"]
- "unhappiness" might become ["un", "##happiness"]

The "##" indicates a subword that continues from the previous token.

**Special Tokens**

BERT uses special tokens:
- `[CLS]` - Added at the start of every sequence, used for classification tasks
- `[SEP]` - Separates sentences in sentence-pair tasks
- `[MASK]` - Used during training to mask words
- `[PAD]` - Padding token for making sequences same length

---

### **Pre-training Tasks**

BERT is trained **before use** on a large text corpus using two self-supervised tasks. These tasks help BERT understand both **word meaning** and **sentence relationships**.

1. **Masked Language Modeling (MLM):**
Instead of reading text only from left to right, BERT learns using full context. During training, about 15% of words in a sentence are hidden or altered. BERT’s job is to predict the original word by looking at the words **before and after** it.

Example:
"The cat sat on the [MASK]"
BERT uses surrounding words to predict: **"mat"**

This task teaches BERT grammar, word meaning, and how the same word can change meaning depending on context.

2. **Next Sentence Prediction (NSP):**
BERT also learns how sentences relate to each other. It is given two sentences together and must decide whether the second sentence logically follows the first one or not.

Example (valid pair):
Sentence A: "She finished her exam."
Sentence B: "She felt relieved."

Example (invalid pair):
Sentence A: "She finished her exam."
Sentence B: "The sky is blue."

This task helps BERT understand sentence flow, which is important for question answering and sentence-pair tasks.

### Install Libraries

In [None]:
# Core libraries for BERT
pip install transformers torch

## Load a Pretrained BERT Model


* Tokenizer → converts text to numbers
* Model → converts numbers to embeddings
* `eval()` → disables training behavior like dropout

In [4]:
from transformers import BertTokenizer, BertModel
import torch

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Set model to evaluation mode (important)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## Tokenize Text

* Text → tokens
* Tokens → token IDs
* Adds `[CLS]` and `[SEP]` automatically

In [5]:
text = "I love learning NLP"

# Tokenize and convert to tensors
inputs = tokenizer(
    text,
    return_tensors="pt",   # PyTorch tensors
    padding=True,
    truncation=True
)

print(inputs)

{'input_ids': tensor([[  101,  1045,  2293,  4083, 17953,  2361,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


## Generate BERT Embeddings

This is where BERT actually works.

What you get:

* Shape → `(batch_size, sequence_length, 768)`
* Each word has a **768-dimensional vector**
* Same word in different sentences → different vectors


In [6]:
with torch.no_grad():  # no gradients needed
    outputs = model(**inputs)

# Extract embeddings
last_hidden_state = outputs.last_hidden_state

## Understand BERT Output Clearly

In [7]:
print(last_hidden_state.shape)

torch.Size([1, 7, 768])


* `batch_size` → number of sentences
* `sequence_length` → number of tokens
* `768` → embedding size

## Sentence Embedding 

You usually need **one vector per sentence**, not per word.

### CLS Token Embedding (Simple & Common)

In [8]:
# CLS token is the first token
sentence_embedding = last_hidden_state[:, 0, :]

print(sentence_embedding.shape)

torch.Size([1, 768])


This gives:

* One **768D vector**
* Represents the entire sentence

## Compare Meaning of Words (Context Demo)

In [9]:
texts = [
    "I went to the bank to deposit money",
    "The river bank was beautiful"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state

* Word **“bank”** gets **different embeddings**
* This is why BERT is powerful

In [None]:
# Extract embedding vectors for "bank"

bank_indices = [5, 3]
bank_vec_1 = embeddings[0, bank_indices[0], :]
bank_vec_2 = embeddings[1, bank_indices[1], :]

# Compare the embeddings numerically - cosine similarity
from torch.nn.functional import cosine_similarity

similarity = cosine_similarity(bank_vec_1, bank_vec_2, dim=0)
print("Cosine similarity:", similarity.item())


Cosine similarity: 0.5034990310668945


## Downstream Task: Text Classification (Basic Example)

In [None]:
# Load Classification Model

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2   # binary classification
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
# Tokenize Input for Classification

text = "This course is very helpful"

inputs = tokenizer(
    text,
    return_tensors="pt",
    padding=True,
    truncation=True
)

print(inputs)

{'input_ids': tensor([[  101,  2023,  2607,  2003,  2200, 14044,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


In [23]:
# Get Prediction (No Training)

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
prediction = torch.argmax(logits, dim=1)

print(prediction)

tensor([0])


This is **inference only**.

# Question Answering with BERT

1. **Load a pretrained BERT QA model**

   * `deepset/bert-base-uncased-squad2` is already fine-tuned on a QA dataset (SQuAD2).
   * This means it **can find answers in a context paragraph** without further training.

2. **Take user input**

   * The user provides a **context paragraph** and a **question**.

3. **Prepare input for BERT**

   * The question and context are tokenized together.
   * Special tokens `[CLS]` (start) and `[SEP]` (separator) are added.
   * Attention masks and token type IDs are created to let BERT know which tokens belong to the question and which belong to the context.

4. **Predict answer span**

   * BERT outputs **start logits** and **end logits** for each token position in the context.
   * The model chooses the token with the **highest start logit** and the **highest end logit** as the answer span.

5. **Extract the answer**

   * The tokens between the predicted start and end positions are converted back to text using the tokenizer.
   * If the model predicts an invalid span (end < start), it returns `"No Answer Found"`.

6. **Compute confidence**

   * Softmax probabilities of the start and end positions are averaged to give a simple **confidence score** for the predicted answer.

7. **Display result**

   * Prints the context, question, answer, and confidence.

---

### How it’s implemented

* **Transformers library** handles all tokenization and model operations.
* **`BertTokenizer`** converts text to token IDs and maps tokens back to words.
* **`BertForQuestionAnswering`** contains BERT + a small linear layer that predicts start and end positions.
* **`torch.no_grad()`** disables gradient calculations since we only want **inference**, not training.

---


> User gives context + question → BERT predicts start and end positions in the context → Tokens are decoded → Answer + confidence is shown.



In [30]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

# Load pretrained QA model (fine-tuned on SQuAD2)
tokenizer = BertTokenizer.from_pretrained("deepset/bert-base-uncased-squad2")
model = BertForQuestionAnswering.from_pretrained("deepset/bert-base-uncased-squad2")
model.eval()

def prepare_qa_input(question, context):
    """Encode question and context for BERT QA"""
    encoding = tokenizer.encode_plus(
        question,
        context,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
        return_token_type_ids=True
    )
    return encoding

def answer_question(question, context):
    """Extract answer from context for a given question"""
    encoding = prepare_qa_input(question, context)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']
    token_type_ids = encoding['token_type_ids']

    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get most likely start and end positions
    start_idx = torch.argmax(start_logits)
    end_idx = torch.argmax(end_logits)

    # If model predicts end before start, answer is "No Answer"
    if end_idx < start_idx:
        return "No Answer Found", 0.0

    answer_tokens = input_ids[0][start_idx:end_idx + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

    # Compute confidence score
    start_score = torch.max(torch.softmax(start_logits, dim=1)).item()
    end_score = torch.max(torch.softmax(end_logits, dim=1)).item()
    confidence = (start_score + end_score) / 2

    return answer, confidence

# Take user input
context = input("Enter the context:\n")
question = input("\nEnter the question:\n")

answer, confidence = answer_question(question, context)
print(f"\nContext: {context}")
print(f"\nQuestion: {question}")
print(f"\nAnswer: {answer}")
print(f"Confidence: {confidence:.2%}")



Context: What are Newton’s Laws of Motion? An object at rest remains at rest, and an object in motion remains in motion at constant speed and in a straight line unless acted on by an unbalanced force. The acceleration of an object depends on the mass of the object and the amount of force applied. Whenever one object exerts a force on another object, the second object exerts an equal and opposite on the first. Sir Isaac Newton worked in many areas of mathematics and physics. He developed the theories of gravitation in 1666 when he was only 23 years old. In 1686, he presented his three laws of motion in the “Principia Mathematica Philosophiae Naturalis.”  By developing his three laws of motion, Newton revolutionized science. Newton’s laws together with Kepler’s Laws explained why planets move in elliptical orbits rather than in circles

Question: in which paper newton introduced laws of motion

Answer: principia mathematica philosophiae naturalis
Confidence: 54.24%
