# C2: TEXT PROCESSING - VECTORISATION AND WORD EMBEDDING

## Vectorization

- **Definition:** The process of converting text (words, sentences, documents) into numerical representations (vectors) so that ML models can process them.  
- **Need:** Machines cannot understand raw text; they work with numbers.

### Terminologies

- **Corpus:** A collection of documents.  
- **Vocabulary:** A unique set of words in the corpus.  
- **Document-Term Matrix (DTM):** A representation where rows are documents and columns are words.  

### Tokenization

- **Definition:** The process of breaking text into smaller units (tokens) before giving it to a model.  
- **Token:** A token can be a word, subword, or even a character depending on the tokenizer.  
- **Model Mapping:** The model converts each token into a number (ID) which maps to an embedding vector.  

#### Need for Tokenization

- Provides a consistent way to split text.  
- Builds a vocabulary list mapping tokens to numbers.  

#### Types of Tokenization

1. **Word-Level Tokenization**  
   - Splits text by spaces/punctuation.  
   - Example: `"I love NLP"` $\to$ `["I", "love", "NLP"]`  
   - **Problem:** Huge vocabulary; cannot handle new/rare words.  

2. **Character-Level Tokenization**  
   - Each character is treated as a token.  
   - Example: `"Chat"` $\to$ `["C", "h", "a", "t"]`  
   - **Advantage:** Very flexible.  
   - **Disadvantage:** Sequences become very long.  

3. **Subword-Level Tokenization**  
   - Breaks rare words into smaller units while keeping common words intact.  
   - Example: `"unhappiness"` $\to$ `["un", "##happi", "##ness"]`  
     - `##` indicates that this piece continues a word.  
   - **Advantage:** Can handle new words by combining smaller pieces.  

### Common Techniques of Vectorization

Some widely used methods to represent text as vectors:

1. **One-Hot Encoding**  
   - Represents each word as a binary vector.  
   - Simple but results in high-dimensional sparse vectors.  
   - No semantic meaning captured.  

2. **Bag of Words (BoW)**  
   - Represents text by word counts or frequencies.  
   - Ignores word order.  
   - **Limitation:** Cannot capture context.  

3. **TF-IDF (Term Frequency–Inverse Document Frequency)**  
   - Adjusts word counts by reducing the weight of common words and increasing the weight of rare words.  
   - Better than BoW but still ignores word order.  

4. **Word Embeddings**  
   - Dense vector representations learned from large corpora (e.g., Word2Vec, GloVe).  
   - Capture semantic meaning (e.g., "king" – "man" + "woman" ≈ "queen").  

5. **Contextual Embeddings**  
   - Dynamic embeddings generated from context (e.g., BERT, GPT).  
   - The same word can have different vectors depending on its usage.  
   - State-of-the-art for modern NLP.  


#### One-Hot Encoding

- **Definition:** Each word in the vocabulary is represented as a binary vector of size equal to the vocabulary.  
- Only one position is `1` (indicating the word’s index), and all other positions are `0`.  

**Example:**  
- Vocabulary: `["cat", "dog", "bat"]`  
- `"cat"` $\to$ `[1, 0, 0]`  
- `"dog"` $\to$ `[0, 1, 0]`  
- `"bat"` $\to$ `[0, 0, 1]`  

**Advantages:**  
- Simple and easy to implement.  
- Works well for very small vocabularies.  

**Disadvantages:**  
- High dimensionality: if the vocabulary has 10,000 words, the vector length will also be 10,000.  
- Sparse representation (mostly zeros).  
- No semantic meaning: words like `"cat"` and `"dog"` are represented as completely independent, with no relationship captured.  
- Cannot handle new words (out-of-vocabulary issue).  


In [1]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample words
words = np.array(["cat", "dog", "bat", "dog", "cat"]).reshape(-1, 1)

# Initialize encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
one_hot = encoder.fit_transform(words)

print("Vocabulary: \n", encoder.categories_, "\n")
print("One hot vectors :\n", one_hot)

Vocabulary: 
 [array(['bat', 'cat', 'dog'], dtype='<U3')] 

One hot vectors :
 [[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


#### Bag of Words (BoW)

- **Definition:** Represents documents as vectors of word frequency counts.  
- **Vocabulary:** Built from all unique words in the dataset.  
- **Conversion:** Each document is transformed into a vector of counts corresponding to the vocabulary.  
- **Limitation:** Ignores the order of words and syntactic/semantic relationships.  

**Example:**  
Documents:  
1. `"I love dogs"`  
2. `"I love cats"`  

Vocabulary: `["I", "love", "dogs", "cats"]`  

- Doc1 = `[1, 1, 1, 0]`  
- Doc2 = `[1, 1, 0, 1]`  

**Advantages:**  
- Simple and intuitive.  
- Easy to implement.  
- Works well for small datasets and basic models.  

**Disadvantages:**  
- High-dimensional and sparse for large vocabularies.  
- No semantic meaning (e.g., `"dog"` and `"puppy"` are unrelated).  
- Cannot handle synonyms or polysemy (same word with different meanings).  
- Sensitive to stopwords and noise.  
- Ignores word order, so `"dog bites man"` and `"man bites dog"` look the same.  

In [2]:
from sklearn.feature_extraction.text import  CountVectorizer

# Sample corpus (documents)
corpus = [
    "I love dogs",
    "I love cats",
    "Cats and dogs are great"
]

# Initialize vectorizer
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

# Fit and transform
X = vectorizer.fit_transform(corpus)

# Vocabulary
print("Vocabulary :", vectorizer.get_feature_names_out(), "\n")

# Document-term matrix
print("Bag of words representation :\n", X.toarray())

Vocabulary : ['and' 'are' 'cats' 'dogs' 'great' 'i' 'love'] 

Bag of words representation :
 [[0 0 0 1 0 1 1]
 [0 0 1 0 0 1 1]
 [1 1 1 1 1 0 0]]


### Term Frequency - Inverse Document Frequency (TF-IDF)

- **Key Concepts:**  
  - **TF (Term Frequency):** How often a word appears in a document.  
  - **IDF (Inverse Document Frequency):** How rare a word is across all documents.  
  - **TF-IDF:** Highlights words that are frequent in one document but not common across all documents.  

- **Formula (IDF):**  
  $\mathrm{IDF(word)} = \log \frac{\text{Total Documents}}{1 + \text{Documents containing word}}$

**Example:**  
- Doc1 = `"I love NLP"`  
- Doc2 = `"I love Deep learning"`  
- Vocabulary = `[I, love, NLP, Deep, learning]`  

**Term Frequency (TF):**  
- Doc1 = `[1, 1, 1, 0, 0]`  
- Doc2 = `[1, 1, 0, 1, 1]`  

**Inverse Document Frequency (IDF):**  
- `"I"`: $\log(2/2) = 0$ (appears in both docs)  
- `"love"`: $\log(2/2) = 0$ (appears in both docs)  
- `"NLP"`: $\log(2/1) \approx 0.693$ (only in Doc1)  
- `"Deep"`: $\log(2/1) \approx 0.693$ (only in Doc2)  
- `"learning"`: $\log(2/1) \approx 0.693$ (only in Doc2)  

**TF-IDF = TF × IDF:**  
- Doc1 = `[0, 0, 0.693, 0, 0]`  
- Doc2 = `[0, 0, 0, 0.693, 0.693]`  

Now the model gives higher weight to `"NLP"` in Doc1 and `"Deep"` & `"learning"` in Doc2, while common words like `"I"` and `"love"` are ignored.  

**Advantages:**  
- Reduces the importance of common words (stopwords).  
- Useful for information retrieval, search engines, and keyword extraction.  
- Simple and interpretable.  

**Disadvantages:**  
- Still ignores word order and context.  
- Rare words may get high weights even if not meaningful.  
- Weights are dataset-dependent (new documents may change scores).  


In [3]:
from sklearn.feature_extraction.text import  TfidfVectorizer
import pandas as pd

# Sample documents
docs = [
    "I love NLP",
    "I love deep learning",
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

# Fit and transform
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert document for clarity
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(df)

       deep         i  learning      love       nlp
0  0.000000  0.501549  0.000000  0.501549  0.704909
1  0.576152  0.409937  0.576152  0.409937  0.000000


### Word Embeddings

- **Definition:** A way of representing words as dense, low-dimensional vectors where semantic meaning and relationships between words are captured.  
- **Core Idea:**  
  - Similar words → vectors close together.  
  - Dissimilar words → vectors farther apart.  
- **Need:** Captures semantic and syntactic relationships beyond simple frequency counts.  

**Key Properties:**  
- Each word is represented by a learned vector.  
- Dimensions are not predefined but learned automatically during training.  
- Words used in similar contexts tend to have similar vectors (distributional hypothesis).  

**Examples of Word Embedding Models:**  

1. **Word2Vec**  
   - Two training methods:  
     - **CBOW (Continuous Bag of Words):** Predicts a target word from its surrounding context.  
     - **Skip-Gram:** Predicts surrounding context given a target word.  

2. **GloVe (Global Vectors)**  
   - Learns embeddings using word co-occurrence statistics across the entire corpus.  
   - Captures both local (context-based) and global statistical information.  

3. **FastText**  
   - Extension of Word2Vec that uses subword (character n-grams) information.  
   - Advantage: Can generate embeddings for out-of-vocabulary words (e.g., rare or misspelled words).  

**Limitations:**  
- **Static embeddings:** Each word has only one vector.  
  - Example: `"bank"` (river bank vs. money bank) gets the same representation.  
- Cannot fully capture polysemy (multiple meanings of the same word).  
- Do not adapt based on sentence context.  

**Advantages:**  
- Dense, low-dimensional, and computationally efficient compared to one-hot vectors.  
- Capture semantic relationships:  
  - `"king" – "man" + "woman" ≈ "queen"`  
- Widely used as pretrained embeddings (e.g., Google’s Word2Vec, Stanford’s GloVe).  


In [4]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["i", "love", "nlp"],
    ["i", "love", "deep", "learning"],
    ["nlp", "is", "fun"],
    ["deep", "learning", "is", "powerful"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Get vector for a word
print("Vector for 'nlp':\n", model.wv['nlp'])

# Find most similar words
print("\nMost similar to 'nlp':")
print(model.wv.most_similar('nlp'))


Vector for 'nlp':
 [-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]

Most similar to 'nlp':
[('love', 0.16563552618026733), ('learning', 0.1267007291316986), ('powerful', 0.08872983604669571), ('is', 0.011071977205574512), ('i', -0.027841337025165558), ('deep', -0.15515567362308502), ('fun', -0.2187293916940689)]


In [3]:
# Download pretrained GloVe (example: 100d)
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip glove.6B.zip

import numpy as np

def load_glove(path):
    embeddings = {}
    with open(path, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove = load_glove("../glove/glove.6B.50d.txt")
print("Vector for 'computer':", glove["computer"][:10])  # first 10 dims


Vector for 'computer': [ 0.079084 -0.81504   1.7901    0.91653   0.10797  -0.55628  -0.84427
 -1.4951    0.13418   0.63627 ]


In [7]:
from gensim.models import FastText

model = FastText(sentences, vector_size=50, window=3, min_count=1)

print("Vector for 'learning':\n", model.wv['learning'])
print("Most similar to 'learning':\n", model.wv.most_similar("learning"))

Vector for 'learning':
 [-1.0521135e-03 -2.9445016e-05 -2.6681324e-04 -2.0629806e-03
 -1.5533755e-03  1.5935432e-03  2.3113489e-03 -8.0846180e-04
 -3.6553531e-03  1.6169866e-03  2.6812166e-04 -1.2354901e-03
  2.0685391e-05  1.8805226e-03  1.1954642e-03 -1.3146290e-03
  1.6609058e-03 -8.6206105e-04 -3.7580152e-04  3.7063458e-03
  1.4002168e-03 -1.0286489e-03 -3.4085761e-03  2.4692728e-03
  1.8361167e-03 -2.4178838e-03 -2.5847720e-03  2.7235423e-03
 -1.2194831e-03  5.0885845e-03  3.2921617e-03  2.9445691e-03
 -2.5521258e-03  1.1273614e-03 -8.8686880e-04 -1.2647441e-03
  9.6110343e-05  1.0053181e-03  4.1550426e-03  4.9981056e-03
  3.1284827e-03  2.3335047e-04  3.5504177e-03  2.7294530e-04
 -1.5858869e-03 -1.0548470e-03 -3.9235768e-03 -4.2642830e-03
 -2.5969148e-03  9.5642585e-04]
Most similar to 'learning':
 [('A', 0.11139702051877975), (' ', 0.10953480750322342), ('d', 0.06675366312265396), ('h', 0.057259414345026016), ('.', 0.027056237682700157), ('p', 0.015873564407229424), ('s', 0.012

### Contextual Embeddings

- **Definition:** Word representations that change depending on the surrounding context (sentence/paragraph).  
- **Key Idea:** Unlike static embeddings (Word2Vec, GloVe), the same word can have different vectors depending on its meaning in context.  

**Examples of Models:**  
- **ELMo (Embeddings from Language Models):** Generates context-dependent embeddings using bidirectional LSTMs.  
- **BERT (Bidirectional Encoder Representations from Transformers):** Uses transformer architecture to generate contextual embeddings in both directions.  
- **GPT (Generative Pre-trained Transformer):** Produces contextual embeddings in a unidirectional (left-to-right) manner.  

**Example Sentences:**  
- `"I deposited money in the bank"` → vector close to **finance**.  
- `"We sat by the bank of the river"` → vector close to **nature**.  

**Uses / Advantages:**  
- Captures **polysemy** (multiple meanings of the same word).  
- Understands **word order, syntax, and grammar**.  
- Provides **dynamic embeddings** that adapt to context.  
- Learned using **large neural networks** (LSTMs, Transformers) trained on massive corpora.  
- Powers modern NLP tasks:  
  - Machine Translation  
  - Question Answering  
  - Sentiment Analysis  
  - Named Entity Recognition  

**Limitations:**  
- Computationally expensive to train and use.  
- Requires large datasets and high-performance hardware.  
- Interpretability is lower compared to simple models.  


In [5]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load pretrained BERT Model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Sentences with "Apple" in different meanings
sentences = [
    "I ate an apple.",
    "Apple released a new iPhone."
]

# Tokenize sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

# Get embeddings from BERT
with torch.no_grad():
    outputs = model(**inputs)
    # outputs[0] = last hidden states for each token
    last_hidden_states = outputs.last_hidden_state

# Extract embeddings for the word "apple"
for i, sentence in enumerate(sentences):
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][i])
    print(f"\nSentence: {sentence}")
    for idx, token in enumerate(tokens):
        if "apple" in token.lower():  # match token containing "apple"
            print(f"Token: {token}")
            print(f"Embedding vector shape: {last_hidden_states[i, idx].shape}")
            print(f"First 5 values: {last_hidden_states[i, idx][:5]}")

  from .autonotebook import tqdm as notebook_tqdm
2025-09-03 19:51:47.092927: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-03 19:51:47.518341: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-03 19:51:50.197123: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.



Sentence: I ate an apple.
Token: apple
Embedding vector shape: torch.Size([768])
First 5 values: tensor([ 0.1211,  0.7320, -0.5054, -0.6165,  1.0468])

Sentence: Apple released a new iPhone.
Token: apple
Embedding vector shape: torch.Size([768])
First 5 values: tensor([ 0.5733,  0.1726, -0.2070, -0.3598,  0.6186])


## Pre trained embeddings in Keras

- Keras allows loading pre-trained embeddings into neural networks

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

docs = ["I love NLP", "NLP is fun", "I love machine learning"]

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(docs)
sequences = tokenizer.texts_to_sequences(docs)
word_index = tokenizer.word_index

print("Word Index:", word_index)
print("Sequences:", sequences)

# Padding
padded = pad_sequences(sequences, padding='post')
print("Padded Sequences:\n", padded)

Word Index: {'i': 1, 'love': 2, 'nlp': 3, 'is': 4, 'fun': 5, 'machine': 6, 'learning': 7}
Sequences: [[1, 2, 3], [3, 4, 5], [1, 2, 6, 7]]
Padded Sequences:
 [[1 2 3 0]
 [3 4 5 0]
 [1 2 6 7]]


In [None]:
# Use Case: Sentiment Analysis (with TF-IDF + Logistic Regression)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

texts = ["I love this movie", "This film is terrible", "Amazing storyline", "Worst acting ever"]
labels = [1, 0, 1, 0]  # 1 = positive, 0 = negative

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

model = LogisticRegression()
model.fit(X, labels)

# Test
test = ["I love this film", "movie is worst"]
X_test = vectorizer.transform(test)
print(model.predict(X_test))  # [1, 0]


[1 0]
