# ü§ñ BERT: Bidirectional Encoder Representations from Transformers

This notebook explores **BERT**, one of the most influential NLP models. BERT uses the **encoder-only** architecture and is trained with **Masked Language Modeling (MLM)**.

## Key Concepts

| Concept | Description |
|:--------|:------------|
| **Architecture** | Encoder-only Transformer (12 layers in base) |
| **Attention** | Bidirectional - sees entire context |
| **Training** | Masked Language Modeling (MLM) + Next Sentence Prediction |
| **Output** | Contextual embeddings for each token |

---

## 1. Setup and Imports

We'll use the ü§ó Hugging Face `transformers` library, which provides pre-trained BERT models and tokenizers.

In [None]:
from transformers import BertTokenizer, BertModel

## 2. Loading the BERT Tokenizer

BERT uses **WordPiece tokenization**, which breaks words into subword units. This allows BERT to:
- Handle out-of-vocabulary words
- Keep vocabulary size manageable (~30,000 tokens)
- Capture morphological patterns

The `bert-base-uncased` model:
- Uses lowercase text (uncased)
- Has 12 transformer layers
- 768-dimensional embeddings
- 110M parameters

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## 3. Tokenizing a Single Sentence

Let's tokenize a simple sentence and examine the output:

```
Tokenizer Output:
‚îú‚îÄ‚îÄ input_ids:      Token IDs (integers representing each token)
‚îú‚îÄ‚îÄ token_type_ids: Segment IDs (0 for first sentence, 1 for second)
‚îî‚îÄ‚îÄ attention_mask: 1 for real tokens, 0 for padding
```

**Special Tokens:**
- `[CLS]` (ID: 101): Added at the start, used for classification
- `[SEP]` (ID: 102): Added at the end, separates sentences

In [None]:
tokens = tokenizer('I am a nerd', return_tensors='pt')
tokens

### Understanding the Token IDs

Let's decode the token IDs to see what tokens they represent:

```
Token ID  ‚Üí  Token
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
101       ‚Üí  [CLS]
1045      ‚Üí  i
2572      ‚Üí  am
1037      ‚Üí  a
11265     ‚Üí  ner
4103      ‚Üí  ##d
102       ‚Üí  [SEP]
```

Notice that "nerd" is split into "ner" + "##d" (the `##` prefix indicates a subword continuation).

## 4. Batch Tokenization with Padding

When processing multiple sentences of different lengths, we need **padding** to create uniform tensor shapes.

```
Before Padding:           After Padding:
"I am a nerd"    (4 words)     "I am a nerd"     [PAD] [PAD]
"reading books" (5 words)      "reading books all day long"
```

In [None]:
tokens = tokenizer(['I am a nerd','reading books all day long'], padding=True, return_tensors='pt')
tokens

## 5. Loading the BERT Model

Now we'll load the pre-trained BERT model and pass our tokenized inputs through it.

```
BERT Model Architecture:

    Input IDs ‚Üí Embedding Layer ‚Üí 12√ó Transformer Encoder Layers ‚Üí Output
                      ‚Üì
            + Positional Encoding
            + Token Type Embedding
```

In [None]:
model = BertModel.from_pretrained("bert-base-uncased")
output = model(**tokens)

## 6. Understanding BERT's Output

BERT returns two main outputs:

### `last_hidden_state`
Contextual embeddings for **every token** in the input.
- Shape: `(batch_size, sequence_length, hidden_size)`
- Each token has a 768-dimensional vector that depends on its context

### `pooler_output`
A single vector representing the **entire sentence**.
- Shape: `(batch_size, hidden_size)`
- Derived from the `[CLS]` token's embedding
- Often used for classification tasks

```
Input:  [CLS] I am a ner ##d [SEP]
          ‚Üì   ‚Üì  ‚Üì ‚Üì  ‚Üì   ‚Üì   ‚Üì
Output:  h‚ÇÄ  h‚ÇÅ h‚ÇÇ h‚ÇÉ h‚ÇÑ  h‚ÇÖ  h‚ÇÜ   ‚Üê last_hidden_state
          ‚Üì
        pooler_output (from [CLS])
```

In [None]:
output['last_hidden_state']

### Checking the Output Shapes

For our batch of 2 sentences with 7 tokens each:

In [None]:
output['last_hidden_state'].shape

**Shape breakdown:** `[2, 7, 768]`
- `2` = batch size (2 sentences)
- `7` = sequence length (7 tokens per sentence)
- `768` = hidden dimension (BERT-base embedding size)

In [None]:
output['pooler_output'].shape

**Shape breakdown:** `[2, 768]`
- `2` = batch size (one pooled output per sentence)
- `768` = hidden dimension

---

## üìö Key Takeaways

| Concept | Description |
|:--------|:------------|
| **WordPiece Tokenization** | Splits words into subwords (e.g., "nerd" ‚Üí "ner" + "##d") |
| **Special Tokens** | `[CLS]` for classification, `[SEP]` for separation |
| **last_hidden_state** | Contextual embeddings for each token |
| **pooler_output** | Single embedding for the whole sentence |
| **Bidirectional** | Each token sees the full context (left AND right) |

---

## üöÄ Next Steps

- See `spam_classification.ipynb` for using BERT for text classification
- See `GPT2.ipynb` for a decoder-only model comparison