# 🧠 From Text to Tensors: A Deep Dive into Tokenization & Embeddings

Welcome! This notebook is your guide to understanding two of the most fundamental concepts in Natural Language Processing (NLP): **Tokenization** and **Embeddings**.

Before a machine learning model can understand human language, we need to convert raw text into a numerical format it can process. This pipeline involves two key steps:

1.  **Tokenization**: The process of breaking down a piece of text into smaller units called **tokens**. These can be words, subwords, or characters.
2.  **Embeddings**: The process of converting these tokens into numerical vectors. These vectors are designed to capture the semantic meaning and relationships between tokens.

Let's explore how this works in practice.

---

## Part 1: Tokenization - The Art of Splitting Text

Think of tokenization like chopping vegetables before cooking. You take a large, unstructured piece of text and break it into small, manageable pieces (tokens) that the model can digest.

### 1.1 - Simple Word Tokenization

The most basic approach is to split text by spaces and punctuation.

In [None]:
sentence = "NLP is fascinating, isn't it?"

# Using Python's built-in split()
basic_tokens = sentence.split(' ')
print(f"Basic split: {basic_tokens}")

# This is a bit naive. Notice how "fascinating," and "it?" still have punctuation.
# A better approach uses libraries like NLTK or spaCy.

# Install NLTK (if you haven't already)
!pip install nltk -q

import nltk
from nltk.tokenize import word_tokenize

# Download the necessary resource for the tokenizer
nltk.download('punkt')

nltk_tokens = word_tokenize(sentence)
print(f"NLTK split: {nltk_tokens}")

**Problem with Word Tokenization**: What happens when the model encounters a word it has never seen before during training (an "out-of-vocabulary" or OOV word)? Also, how does it handle variations of a word like "run", "running", and "ran"? They are treated as completely separate tokens.

This leads us to a more modern and effective approach: **Subword Tokenization**.

### 1.2 - Subword Tokenization

The core idea of subword tokenization is to break down rare words into smaller, meaningful parts, while keeping common words as single tokens.

For example, a word like `"tokenization"` might be split into `"token"` and `"##ization"`.

**Advantages**:
* **Handles OOV words**: It can represent any new word by combining known subwords.
* **Reduces vocabulary size**: The model only needs to store a vocabulary of subwords, which is much smaller than storing every single word in a language.
* **Captures morphology**: It understands that words like "running" and "runner" share a common root, `"run"`.

Popular algorithms include **Byte-Pair Encoding (BPE)** (used by GPT) and **WordPiece** (used by BERT). Today, we'll use the industry-standard `transformers` library from Hugging Face to see this in action.

In [None]:
# Install the transformers library
!pip install transformers -q

from transformers import BertTokenizer, GPT2Tokenizer

# Load a pre-trained tokenizer for BERT
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load a pre-trained tokenizer for GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Let's see how these tokenizers handle a complex sentence.

In [None]:
text = "Subword tokenization is powerfully effective."

# Tokenize with BERT's WordPiece tokenizer
bert_tokens = bert_tokenizer.tokenize(text)
print(f"BERT Tokens: {bert_tokens}")
# Notice how 'powerfully' is split into 'powerful' and '##ly'

# Tokenize with GPT-2's BPE tokenizer
gpt2_tokens = gpt2_tokenizer.tokenize(text)
print(f"GPT-2 Tokens: {gpt2_tokens}")
# Notice how 'Ġtokenization' starts with a 'Ġ' (space) character

### 1.3 - From Tokens to IDs

Models don't understand strings like `"token"` or `"##ization"`. They need numbers. So, the next step is to convert these tokens into unique integer IDs from the tokenizer's vocabulary.

In [None]:
# The encode() method combines tokenization and conversion to IDs
bert_input_ids = bert_tokenizer.encode(text)
print(f"BERT Input IDs: {bert_input_ids}")

# Let's see what these IDs correspond to
decoded_bert_tokens = bert_tokenizer.convert_ids_to_tokens(bert_input_ids)
print(f"Decoded BERT Tokens: {decoded_bert_tokens}")

# Notice the special tokens [CLS] and [SEP] that BERT adds automatically.
# [CLS] (Classification) is used for sentence-level tasks.
# [SEP] (Separator) is used to separate multiple sentences.

---

## Part 2: Embeddings - Giving Meaning to Numbers

Now that we have token IDs, we need to represent them in a way that captures their meaning. Simply using the IDs (e.g., 101, 1143, etc.) isn't enough, as the numbers themselves have no inherent relationship to one another.

This is where **embeddings** come in. An embedding is a dense vector of real numbers that represents a token. 

**The Goal**: Tokens with similar meanings should have similar vectors. For example, the vectors for `"king"` and `"queen"` should be closer to each other than the vector for `"car"`.

### 2.1 - Why Not One-Hot Encoding?
A simple approach would be to create a huge vector for each token, with a `1` at the index corresponding to its vocabulary ID and `0`s everywhere else. 

**Problems**:
* **Sparsity & High Dimensionality**: If our vocabulary has 30,000 tokens, each vector would have 30,000 dimensions, which is computationally expensive.
* **No Semantic Relationship**: The vectors are orthogonal, meaning the model can't learn that `"cat"` is more similar to `"kitten"` than to `"airplane"`.

### 2.2 - Contextual Embeddings with Transformers

Older models like Word2Vec and GloVe produced **static embeddings**, where each word had a single, fixed vector. 

Modern Transformer models like BERT and GPT produce **contextual embeddings**. The vector for a word changes depending on the sentence it's in. The embedding for `"bank"` will be different in `"river bank"` vs. `"money bank"`.

Let's generate some embeddings using a Transformer model.

In [None]:
import torch
from transformers import BertModel, BertTokenizer

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Sentences with the same word ("bank") in different contexts
sentence1 = "I sat by the river bank."
sentence2 = "I need to go to the bank to deposit money."

In [None]:
# Tokenize the sentences
# We use padding to make both sequences the same length and return PyTorch tensors
inputs = tokenizer([sentence1, sentence2], padding=True, truncation=True, return_tensors="pt")

print("Input IDs:")
print(inputs['input_ids'])
print("\nAttention Mask:") # The attention mask tells the model which tokens are real and which are padding
print(inputs['attention_mask'])

In [None]:
# Get the embeddings from the model
# We don't need to calculate gradients for this, so we use torch.no_grad()
with torch.no_grad():
    outputs = model(**inputs)

# The embeddings are in the 'last_hidden_state' attribute
last_hidden_states = outputs.last_hidden_state

print(f"Shape of embeddings tensor: {last_hidden_states.shape}")
# Shape is (batch_size, sequence_length, hidden_dim)
# batch_size = 2 (we passed two sentences)
# sequence_length = 11 (the length of the padded sequences)
# hidden_dim = 768 (the size of the embedding vector for BERT-base)

Now, let's prove that the embeddings for `"bank"` are different in each sentence.

In [None]:
# Find the index of the token 'bank' in each sentence
tokens1 = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
tokens2 = tokenizer.convert_ids_to_tokens(inputs['input_ids'][1])

bank_index1 = tokens1.index('bank')
bank_index2 = tokens2.index('bank')

print(f"Index of 'bank' in sentence 1: {bank_index1}")
print(f"Index of 'bank' in sentence 2: {bank_index2}")

# Get the embedding vectors for 'bank' from each sentence
bank_embedding1 = last_hidden_states[0, bank_index1]
bank_embedding2 = last_hidden_states[1, bank_index2]

# Compare the two embeddings using Cosine Similarity
from torch.nn.functional import cosine_similarity

similarity = cosine_similarity(bank_embedding1.unsqueeze(0), bank_embedding2.unsqueeze(0))

print(f"\nFirst 5 values of 'bank' embedding 1: {bank_embedding1[:5]}")
print(f"First 5 values of 'bank' embedding 2: {bank_embedding2[:5]}")
print(f"\nCosine Similarity between the two 'bank' embeddings: {similarity.item():.4f}")

# Let's also check the similarity of 'bank' with 'river' in the first sentence
river_index1 = tokens1.index('river')
river_embedding1 = last_hidden_states[0, river_index1]
river_bank_similarity = cosine_similarity(bank_embedding1.unsqueeze(0), river_embedding1.unsqueeze(0))
print(f"Cosine Similarity between 'river' and 'bank' in sentence 1: {river_bank_similarity.item():.4f}")

# And the similarity of 'bank' with 'money' in the second sentence
money_index2 = tokens2.index('money')
money_embedding2 = last_hidden_states[1, money_index2]
money_bank_similarity = cosine_similarity(bank_embedding2.unsqueeze(0), money_embedding2.unsqueeze(0))
print(f"Cosine Similarity between 'money' and 'bank' in sentence 2: {money_bank_similarity.item():.4f}")

As you can see, the embedding for `"bank"` is more similar to `"river"` in the first context and more similar to `"money"` in the second context. This is the power of contextual embeddings!

---

## Part 3: Visualizing Embeddings

It's hard to imagine a 768-dimensional space. To make this more intuitive, we can use dimensionality reduction techniques like **PCA** to project these vectors down to 2D and plot them.

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
%matplotlib inline

# Words we want to visualize
words = [
    # Animals
    'cat', 'dog', 'kitten', 'puppy',
    # Royalty
    'king', 'queen', 'prince', 'princess',
    # Objects
    'car', 'truck', 'boat', 'plane'
]

# We need to get the embeddings for these specific words from BERT's embedding layer
# Note: These will be static, context-independent embeddings from the first layer
word_ids = tokenizer.convert_tokens_to_ids(words)
word_embeddings = model.embeddings.word_embeddings.weight[word_ids, :].detach().numpy()

# Use PCA to reduce dimensions from 768 to 2
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(word_embeddings)

# Plot the results
plt.figure(figsize=(12, 10))
for i, word in enumerate(words):
    x, y = embeddings_2d[i, :]
    plt.scatter(x, y)
    plt.annotate(word, (x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.title('2D PCA Visualization of Word Embeddings')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

You can clearly see clusters forming! The animals are grouped together, the royalty terms are close, and the vehicles form another cluster. This visualizes the semantic relationships captured by the embeddings.

---

## Conclusion & Next Steps

Congratulations! You've successfully journeyed from raw text to meaningful numerical representations.

**You've learned**:
1.  **Tokenization**: How to split text into manageable tokens using word and subword strategies.
2.  **Vocabulary & IDs**: How to map tokens to numerical IDs.
3.  **Embeddings**: Why embeddings are superior to one-hot encoding for representing meaning.
4.  **Contextual Embeddings**: How models like BERT create dynamic representations of words based on their context.

These tokenized and embedded vectors are the inputs for virtually all modern NLP tasks, from **text classification** and **question answering** to **text generation**.