# TEXT PROCESSING - VECTORISATION AND WORD EMBEDDING

## Vectorisation

- **Definition :** It is the process of converting text (words, sentences, documents) into numerical representations (vectors) so that the ML model can process them
- Need: machines can't understand raw text, they work with numbers

### Terminologies

- Corpus: A collection of documents
- Vocabulary: Unique set of words in corpus
- Document-term matrix: Representation where rows are documents and columns are words

### Tokenization

- **Definition :** It is the process if breaking text into smaller units before giving it to a model
- A token can be a word, subword or even a character depending in the toeknizer
- The modelconverts each token intoanumber (ID) whichmaps to an embedding vector

#### Need of tokenization

- Consistent way to chop up text
- Vocabulary list mapping tokens to numbers

#### Types of tokenization

1. Word level tokenization
    - Splits text by spaces/punctuation
    - Eg: "I love NLP" $\to$ ["I", "love", "NLP"]
    - Problem: Huge vocabulary, can't handle new/rare words

2. Character level tokenization
    - Each character is a token
    - "Chat" $\to$ ["C", "h", "a", "t"]
    - Very flexible, but sequence becomes long

3. Subword-level tokenization
    - Breaks rare words into smaller units while keeping common words intact
    - "unhappiness" $\to$ ["un", "##happi", "##ness"]
        - `##`: this piece continues a word
    - Advantage: Can handle new words by combining pieces

### Common techniques of Vectorisation

Some common techniques to vectorize text
- One hot encoding
- Bag of words
- TF-IDF
- Word embeddings
- Contextual embeddings

#### One-Hot Encoding

- Each word in the vocabulary is represented as a vector of size equal to the vocabulary
- only one position = 1, all others are considered 0
- Eg: Consider a vocabulary <br>
    - vocab = ["cat", "dog", "bat"]
    - "cat" $\to$ [1, 0, 0]
    - "dog" $\to$ [0, 1, 0]
    - "bat" $\to$ [0, 0, 1]
- Advantages
    - Simple and easy
    - good for small vocabularies
- Disadvantages
    - If vocab = 10000 words, then vector length would be 10000, which is very sparse and memory heavy

In [1]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample words
words = np.array(["cat", "dog", "bat", "dog", "cat"]).reshape(-1, 1)

# Initialize encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
one_hot = encoder.fit_transform(words)

print("Vocabulary: \n", encoder.categories_, "\n")
print("One hot vectors :\n", one_hot)

Vocabulary: 
 [array(['bat', 'cat', 'dog'], dtype='<U3')] 

One hot vectors :
 [[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


#### Bag of words

- Represents documents as word frequency counts
- Vocabulary is built from all unique words in the dataset
- Each document is conberted into a vector of countd
- Order of words is ignored
- Eg: 
    - Document:
        1. "I love dogs"
        2. "I love cats"
    - Vocabulary will be ["I", "love", "dogs", "cats"]
    - Doc1 = [1, 1, 1, 0]
    - Doc2 = [1, 1, 0, 1]

In [2]:
from sklearn.feature_extraction.text import  CountVectorizer

# Sample corpus (documents)
corpus = [
    "I love dogs",
    "I love cats",
    "Cats and dogs are great"
]

# Initialize vectorizer
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

# Fit and transform
X = vectorizer.fit_transform(corpus)

# Vocabulary
print("Vocabulary :", vectorizer.get_feature_names_out(), "\n")

# Document-term matrix
print("Bag of words representation :\n", X.toarray())

Vocabulary : ['and' 'are' 'cats' 'dogs' 'great' 'i' 'love'] 

Bag of words representation :
 [[0 0 0 1 0 1 1]
 [0 0 1 0 0 1 1]
 [1 1 1 1 1 0 0]]


### Term frequency - Inverse document frequency

- Key concept:
    - TF: How often a word appears in the document
    - IDF: How rare a word is across all documents
    - TF-IDF: Highlights words that are frequent in one document but are not common everywhere
- Formula: Inverse document frequency
    - $\mathrm{IDF(word)} = log \frac{Total docs}{1 + Docs\;containing\;word}$
- Eg:
    - Doc1 = "I love NLP"
    - Doc2 = "I love Deep learning"
    - vocab = (I, love, NLP, Deep, learning)
    - Term frequency
        - Doc1 = [1, 1, 1, 0, 0]
        - Doc2 = [1, 1, 0, 1, 1]
    - Inverse document frequency (Formula)
        - "I": log(2/2) = 0 (appears in both docs)
        - "love": log(2/2) = 0 (appears in both docs)
        - "NLP": log(2/1) $\approx$ 0.693 (only in Doc1)
        - "Deep": log(2/1) $\approx$ 0.693 (only in Doc2)
        - "learning": log(2/1) $\approx$ 0.693 (only in Doc2)
    - TF-IDF: TF $\times$ IDF
        - Doc1 = [(1 $\times$ 0), (1 $\times$ 0), (1 $\times$ 0.693), (1 $\times$ 0), (1 $\times$ 0)] <br>
        => [0, 0, 0.693, 0]
        - Doc2 = [(1 $\times$ 0), (1 $\times$ 0), (1 $\times$ 0), (1 $\times$ 0.693), (1 $\times$ 0.693)] <br>
        => [0, 0, 0, 0.693, 0.693]
    - Now the model knows that "NLP" from Doc1 and "deep" & "learning" in Doc2 are special
    - common words are ignored

In [3]:
from sklearn.feature_extraction.text import  TfidfVectorizer
import pandas as pd

# Sample documents
docs = [
    "I love NLP",
    "I love deep learning",
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

# Fit and transform
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert document for clarity
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

print(df)

       deep         i  learning      love       nlp
0  0.000000  0.501549  0.000000  0.501549  0.704909
1  0.576152  0.409937  0.576152  0.409937  0.000000


### Word embeddings

- **Definition :** A way of representing words as dense vectors where semantic meaning and relationships between words are captured
- similar words: vectors close together
- dissimilar words: vectors far apart
- Need: captures semantic relationships
- Key Idea:
    - Each word is represented by a vector
    - The dimensions are learned automatically
    - Context matters as words used in similar contexts have similar vectors
- Eg:
    - Word2Vec
        1. CBOW: Continuous bag of words, predicts word from its context
        2. Skip-Gram: predicts context from a word
    - GloVe: Global vectors, Uses word's co occurence statistic accross the whole corpus
    - FastText: similar to Word2Vec but uses subword information
- Limitation: static embedding
    - each word has only one vector (eg: bank for river bank and money bank is the same)

In [5]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [
    ["i", "love", "nlp"],
    ["i", "love", "deep", "learning"],
    ["nlp", "is", "fun"],
    ["deep", "learning", "is", "powerful"]
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, workers=2, sg=1)

# Get vector for a word
print("Vector for 'nlp':\n", model.wv['nlp'])

# Find most similar words
print("\nMost similar to 'nlp':")
print(model.wv.most_similar('nlp'))


Vector for 'nlp':
 [-0.01723938  0.00733148  0.01037977  0.01148388  0.01493384 -0.01233535
  0.00221123  0.01209456 -0.0056801  -0.01234705 -0.00082045 -0.0167379
 -0.01120002  0.01420908  0.00670508  0.01445134  0.01360049  0.01506148
 -0.00757831 -0.00112361  0.00469675 -0.00903806  0.01677746 -0.01971633
  0.01352928  0.00582883 -0.00986566  0.00879638 -0.00347915  0.01342277
  0.0199297  -0.00872489 -0.00119868 -0.01139127  0.00770164  0.00557325
  0.01378215  0.01220219  0.01907699  0.01854683  0.01579614 -0.01397901
 -0.01831173 -0.00071151 -0.00619968  0.01578863  0.01187715 -0.00309133
  0.00302193  0.00358008]

Most similar to 'nlp':
[('love', 0.16563552618026733), ('learning', 0.1267007291316986), ('powerful', 0.08872983604669571), ('is', 0.011071977205574512), ('i', -0.027841337025165558), ('deep', -0.15515567362308502), ('fun', -0.2187293916940689)]


### Contextual embeddings

- Depending on the surroung words, this gives a different vector for the same word
- Model Eg: BERT, ELMo, GPT
- Eg: 
    - I deposited money in the bank <br>
    -> Vector close to finance
    - We sat by the bank of river <br>
    -> Vector closer to nature
- Uses:
    - They capture polysemy
    - They understand word order and grammar
    - They are learned by Large nural networks trained on massive text

In [9]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load pretrained BERT Model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Sentences with "Apple" in different meanings
sentences = [
    "I ate an apple.",
    "Apple released a new iPhone."
]

# Tokenize sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)

# Get embeddings from BERT
with torch.no_grad():
    outputs = model(**inputs)
    # outputs[0] = last hidden states for each token
    last_hidden_states = outputs.last_hidden_state

# Extract embeddings for the word "apple"
for i, sentence in enumerate(sentences):
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][i])
    print(f"\nSentence: {sentence}")
    for idx, token in enumerate(tokens):
        if "apple" in token.lower():  # match token containing "apple"
            print(f"Token: {token}")
            print(f"Embedding vector shape: {last_hidden_states[i, idx].shape}")
            print(f"First 5 values: {last_hidden_states[i, idx][:5]}")


Sentence: I ate an apple.
Token: apple
Embedding vector shape: torch.Size([768])
First 5 values: tensor([ 0.1211,  0.7320, -0.5054, -0.6165,  1.0468])

Sentence: Apple released a new iPhone.
Token: apple
Embedding vector shape: torch.Size([768])
First 5 values: tensor([ 0.5733,  0.1726, -0.2070, -0.3598,  0.6186])
