# Experiment 7: Feature Encoding

In machine learning, models require **numerical inputs**, but datasets often contain **categorical or text data**.  
Encoding is the process of converting **categorical or textual features into numeric representations** so that ML models can process them.

- **Categorical Encoding**: Converts discrete categories (e.g., color, species) into numbers.  
- **Text Encoding**: Converts text (sentences, documents) into numeric vectors.  
- Proper encoding is critical for model performance.

# Categorical Encoding Techniques

Common techniques for categorical data:

1. **One-Hot Encoding**
   - Creates a binary column for each category.
   - No ordinal assumptions are made.
   - Example: Color = {Red, Blue, Green} → [1,0,0], [0,1,0], [0,0,1]

2. **Label Encoding**
   - Assigns a unique integer to each category.
   - Suitable for **ordinal features** (where order matters).
   - Example: Size = {Small, Medium, Large} → 0, 1, 2

3. **Binary/Ordinal Encoding**
   - Efficient for **high-cardinality categorical features**.
   - Reduces the number of columns compared to One-Hot.

# Text Encoding Techniques

Text data requires encoding to be used in ML models:

1. **Bag of Words (BoW)**
   - Represents text as **word frequency vectors**.
   - Ignores word order and semantics.

2. **TF-IDF (Term Frequency–Inverse Document Frequency)**
   - Weighted version of BoW.
   - Common words get lower weight, rare words get higher weight.

3. **Word2Vec**
   - Neural network-based embeddings.
   - Captures **semantic meaning**; similar words are close in vector space.
   - Example: "King - Man + Woman ≈ Queen"

4. **FastText**
   - Extends Word2Vec using **subword information** (character n-grams).
   - Handles rare or unknown words better.
   - Useful for morphologically rich languages.

# Summary of Encoding

- **Categorical Encoding** → One-Hot, Label, Binary/Ordinal for discrete features.  
- **Text Encoding** → BoW, TF-IDF for classical ML; Word2Vec, FastText for modern NLP.  
- Proper encoding ensures **numerical representation** of all features, enabling ML models to learn patterns effectively.
- Scaling (from Experiment 6) is usually applied **after encoding**, if needed.

# Step 1: Loading Text Dataset

In this step, we will load the text data file `Text_Data.json`.  

- Each entry in the file represents a separate sentence.  
- We will read the file and store it as a **list of sentences** or a **Pandas DataFrame** for easier processing in later encoding steps.

In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

ModuleNotFoundError: No module named 'google'

In [3]:
import pandas as pd
import json
import os
os.chdir("/content/drive/MyDrive/Classwork_UPES/EoAIML/Classwork/Codes")
# Load the text data file
with open('Text_Data.json','r') as file:
    sentences = json.load(file)

# Convert to DataFrame for easier processing
df_text = pd.DataFrame({'Sentence': sentences})

# Display first 5 sentences
print("First 5 sentences in the dataset:\n")
print(df_text.head())

RuntimeError: CPU dispatcher tracer already initlized

# Bag of Words (BoW) Encoding

**Bag of Words (BoW)** is one of the simplest and most commonly used text encoding techniques for classical machine learning.  

### Concept:
- Each sentence/document is represented as a **vector of word frequencies**.  
- The **order of words is ignored**; only the presence or frequency matters.  
- Creates a **vocabulary** of all unique words across the dataset.  

### Representation:
1. Let the dataset have **N sentences**:  
   $$
   D = \{S_1, S_2, ..., S_N\}
   $$  
2. Construct the **vocabulary V** containing all unique words:  
   $$
   V = \{w_1, w_2, ..., w_m\}
   $$  
   where $m$ = total number of unique words.  
3. For each sentence $S_i$, represent it as a **vector of length m**:  
   $$
   \mathbf{v}_i = [f(w_1, S_i), f(w_2, S_i), ..., f(w_m, S_i)]
   $$  
   where $f(w_j, S_i)$ = frequency of word $w_j$ in sentence $S_i$.  

### Key Points:
- **Sparse Representation:** Most sentences use only a small subset of vocabulary → many zeros.  
- **Simple & interpretable:** Easy to implement and visualize.  
- **Limitation:** Does **not capture word order or semantic meaning**.  
- Often used as input to classical ML models like **Naive Bayes, Logistic Regression, SVM**.

# Example

Consider 5 sentences:

1. "I love NLP"  
2. "NLP loves me"  
3. "I love machine learning"  
4. "Machine learning is fun fun"   ← "fun" appears twice  
5. "I love love NLP NLP NLP"       ← "love" appears twice, "NLP" appears thrice

### Step 1: Build Vocabulary
List all unique words (after lowercasing):  
V = {I, love, NLP, loves, me, machine, learning, is, fun}

### Step 2: BoW Vectors (frequency counts)

| Sentence                    | I | love | NLP | loves | me | machine | learning | is | fun |
|------------------------------|---|------|-----|-------|----|---------|----------|----|-----|
| I love NLP                   | 1 | 1    | 1   | 0     | 0  | 0       | 0        | 0  | 0   |
| NLP loves me                 | 0 | 0    | 1   | 1     | 1  | 0       | 0        | 0  | 0   |
| I love machine learning      | 1 | 1    | 0   | 0     | 0  | 1       | 1        | 0  | 0   |
| Machine learning is fun fun  | 0 | 0    | 0   | 0     | 0  | 1       | 1        | 1  | 2   |
| I love love NLP NLP NLP      | 1 | 2    | 3   | 0     | 0  | 0       | 0        | 0  | 0   |

### Notes:
- Each **row** represents a sentence vector.  
- Each **column** represents a word in the vocabulary.  
- Repeated words are counted multiple times → reflected in frequency.  
- This is the **core idea of Bag of Words**: frequency-based numeric representation.

# Algorithm

1. **Input:** Collection of sentences/documents.
2. **Preprocessing:** Lowercase, remove punctuation, tokenize words.
3. **Build Vocabulary:** List all unique words in the dataset.
4. **Vectorization:** For each sentence/document:
   - Initialize a vector of zeros of length = vocabulary size.
   - Count the frequency of each word in the sentence and update the vector.
5. **Output:** Matrix of size (number of sentences × vocabulary size) representing BoW features.

**Notes:**
- Optional: Apply **stopword removal** to reduce irrelevant words.
- Optional: Limit vocabulary to top-K frequent words to reduce dimensionality.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the sentences
bow_matrix = vectorizer.fit_transform(df_text['Sentence'])

# Convert to DataFrame for easy viewing
df_bow = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display first 5 rows of BoW representation
print("Bag of Words feature matrix (first 5 rows):\n")
print(df_bow.head())

# Display the vocabulary
print("\nVocabulary (words as columns):\n")
print(vectorizer.get_feature_names_out())

Bag of Words feature matrix (first 5 rows):

   about  above  abreast  abreastly  abundant  abuts  accessed  accommodate  \
0      0      0        0          0         0      0         0            0   
1      0      0        0          0         0      0         0            0   
2      0      0        0          0         0      0         0            0   
3      0      0        0          0         0      0         0            0   
4      0      0        0          0         0      0         0            0   

   accompanied  accompanying  ...  yards  years  yellow  yellowish  yet  you  \
0            0             0  ...      0      0       0          0    0    0   
1            0             0  ...      0      0       0          0    0    0   
2            0             0  ...      0      0       0          0    0    0   
3            0             0  ...      0      0       0          0    0    0   
4            0             0  ...      0      0       0          0    0    0   


In [None]:
# Scratch Code

class BagOfWords:
    """
    A class to implement Bag of Words (BoW) encoding from scratch.

    Attributes:
    -----------
    sentences : list of str
        List of sentences/documents to encode.

    Methods:
    --------
    fit_transform()
        Learns vocabulary from sentences and returns the BoW feature matrix.
    """

    def __init__(self, sentences):
        """
        Initialize the BagOfWords encoder with sentences.

        Parameters:
        -----------
        sentences : list of str
            List of sentences/documents to encode.
        """
        self.sentences = sentences
        self.vocab = []

    def fit_transform(self):
        """
        Convert the stored sentences into a Bag of Words feature matrix.

        Returns:
        --------
        df_bow : pandas.DataFrame
            DataFrame where each row represents a sentence and each column a word from vocabulary.
        """
        # Step 1: Tokenize sentences and build vocabulary
        tokenized_sentences = [sentence.lower().split() for sentence in self.sentences]
        self.vocab = sorted(set(word for sentence in tokenized_sentences for word in sentence))

        # Step 2: Initialize matrix
        import numpy as np
        matrix = np.zeros((len(self.sentences), len(self.vocab)), dtype=int)

        # Step 3: Count word frequencies
        for i, sentence in enumerate(tokenized_sentences):
            for word in sentence:
                j = self.vocab.index(word)
                matrix[i, j] += 1

        # Step 4: Convert to DataFrame
        import pandas as pd
        df_bow = pd.DataFrame(matrix, columns=self.vocab)

        return df_bow

In [None]:
# Small sentence set (from previous BoW example)
sentences_example = [
    "I love NLP",
    "NLP loves me",
    "I love machine learning",
    "Machine learning is fun fun",   # 'fun' repeated twice
    "I love love NLP NLP NLP"        # 'love' twice, 'NLP' thrice
]

# Initialize BagOfWords object
bow_encoder_example = BagOfWords(sentences_example)

# Generate BoW matrix
df_bow_example = bow_encoder_example.fit_transform()

# Display BoW matrix
print("Bag of Words feature matrix for small example:\n")
print(df_bow_example)

Bag of Words feature matrix for small example:

   fun  i  is  learning  love  loves  machine  me  nlp
0    0  1   0         0     1      0        0   0    1
1    0  0   0         0     0      1        0   1    1
2    0  1   0         1     1      0        1   0    0
3    2  0   1         1     0      0        1   0    0
4    0  1   0         0     2      0        0   0    3


# TF-IDF (Term Frequency–Inverse Document Frequency) – Theory

TF-IDF is a **weighted version of Bag of Words** that emphasizes important words in a document while reducing the impact of common words.

### Components:

1. **Term Frequency (TF)**  
   Measures how often a term appears in a document:
   $$
   TF_{t,d} = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}
   $$

2. **Inverse Document Frequency (IDF)**  
   Measures how unique a term is across all documents:
   $$
   IDF_t = \log \frac{N}{1 + \text{Number of documents containing term t}}
   $$  
   - $N$ = total number of documents  
   - Common words across many documents have lower IDF

3. **TF-IDF Score**  
   Combines TF and IDF:
   $$
   TF\text{-}IDF_{t,d} = TF_{t,d} \times IDF_t
   $$

### Key Points:
- Words that appear frequently in a document but rarely across documents get **high scores**.  
- Reduces weight of stopwords like "the", "is", "and".  
- Output is a **sparse numeric vector**, one per document.  
- Widely used in **text classification, clustering, and information retrieval**.

# Example

Consider 3 sentences:

1. "I love NLP"  
2. "NLP loves me"  
3. "I love machine learning"

### Step 1: Bag of Words Frequencies
| Sentence           | I | love | NLP | loves | me | machine | learning |
|-------------------|---|------|-----|-------|----|---------|----------|
| I love NLP         | 1 | 1    | 1   | 0     | 0  | 0       | 0        |
| NLP loves me       | 0 | 0    | 1   | 1     | 1  | 0       | 0        |
| I love machine learning | 1 | 1 | 0   | 0     | 0  | 1       | 1        |

### Step 2: Compute TF-IDF
- TF is normalized term frequency per document.  
- IDF downweights common words ("I", "love", "NLP") that appear in multiple sentences.  
- TF-IDF vectors reflect **importance of each word in context**:

| Sentence           | I | love | NLP | loves | me | machine | learning |
|-------------------|---|------|-----|-------|----|---------|----------|
| I love NLP         | 0.41 | 0.41 | 0.41 | 0   | 0  | 0       | 0        |
| NLP loves me       | 0   | 0    | 0.41 | 0.58 | 0.58 | 0     | 0        |
| I love machine learning | 0.41 | 0.41 | 0   | 0   | 0 | 0.58    | 0.58     |

# TF-IDF Algorithm

1. **Input:** Collection of sentences/documents.  
2. **Preprocessing:** Lowercase, remove punctuation, tokenize.  
3. **Build Vocabulary:** List all unique words across documents.  
4. **Compute Term Frequency (TF):** Count occurrences of each word in each document and normalize by document length.  
5. **Compute Inverse Document Frequency (IDF):**  
   $$
   IDF_t = \log \frac{N}{1 + df_t}
   $$  
   where $df_t$ = number of documents containing word $t$, and $N$ = total number of documents.  
6. **Compute TF-IDF:** Multiply TF by IDF for each term:  
   $$
   TF\text{-}IDF_{t,d} = TF_{t,d} \times IDF_t
   $$  
7. **Output:** Matrix of size (number of documents × vocabulary size) representing TF-IDF features.  

**Notes:**  
- Can use libraries like `TfidfVectorizer` from scikit-learn for implementation.  
- Optional: Remove stopwords or limit vocabulary size to reduce dimensionality.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the sentences
tfidf_matrix = tfidf_vectorizer.fit_transform(df_text['Sentence'])

# Convert to DataFrame for easy viewing
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display first 5 rows of TF-IDF feature matrix
print("TF-IDF feature matrix (first 5 rows):\n")
print(df_tfidf.head())

# Display the vocabulary
print("\nVocabulary (words as columns):\n")
print(tfidf_vectorizer.get_feature_names_out())

TF-IDF feature matrix (first 5 rows):

   about  above  abreast  abreastly  abundant  abuts  accessed  accommodate  \
0    0.0    0.0      0.0        0.0       0.0    0.0       0.0          0.0   
1    0.0    0.0      0.0        0.0       0.0    0.0       0.0          0.0   
2    0.0    0.0      0.0        0.0       0.0    0.0       0.0          0.0   
3    0.0    0.0      0.0        0.0       0.0    0.0       0.0          0.0   
4    0.0    0.0      0.0        0.0       0.0    0.0       0.0          0.0   

   accompanied  accompanying  ...  yards  years  yellow  yellowish  yet  you  \
0          0.0           0.0  ...    0.0    0.0     0.0        0.0  0.0  0.0   
1          0.0           0.0  ...    0.0    0.0     0.0        0.0  0.0  0.0   
2          0.0           0.0  ...    0.0    0.0     0.0        0.0  0.0  0.0   
3          0.0           0.0  ...    0.0    0.0     0.0        0.0  0.0  0.0   
4          0.0           0.0  ...    0.0    0.0     0.0        0.0  0.0  0.0   

   zi

In [None]:
# Scratch code

import numpy as np
import pandas as pd

class TFIDF:
    """
    A class to implement TF-IDF encoding from scratch.

    Attributes:
    -----------
    sentences : list of str
        List of sentences/documents to encode.
    vocab : list of str
        Unique words across all sentences.

    Methods:
    --------
    fit_transform()
        Computes the TF-IDF feature matrix for the stored sentences.
    """

    def __init__(self, sentences):
        """
        Initialize the TFIDF encoder with sentences.

        Parameters:
        -----------
        sentences : list of str
            List of sentences/documents to encode.
        """
        self.sentences = sentences
        self.vocab = []

    def fit_transform(self):
        """
        Compute TF-IDF feature matrix from the stored sentences.

        Returns:
        --------
        df_tfidf : pandas.DataFrame
            DataFrame where each row represents a sentence and each column a word from vocabulary.
        """
        # Step 1: Tokenize sentences
        tokenized_sentences = [sentence.lower().split() for sentence in self.sentences]

        # Step 2: Build vocabulary
        self.vocab = sorted(set(word for sentence in tokenized_sentences for word in sentence))

        N = len(tokenized_sentences)  # number of documents

        # Step 3: Compute document frequency for each word
        df_count = {word: 0 for word in self.vocab}
        for word in self.vocab:
            df_count[word] = sum(word in sentence for sentence in tokenized_sentences)

        # Step 4: Initialize TF-IDF matrix
        tfidf_matrix = np.zeros((N, len(self.vocab)))

        # Step 5: Compute TF-IDF values
        for i, sentence in enumerate(tokenized_sentences):
            total_terms = len(sentence)
            for j, word in enumerate(self.vocab):
                tf = sentence.count(word) / total_terms  # term frequency
                idf = np.log(N / (1 + df_count[word]))   # inverse document frequency
                tfidf_matrix[i, j] = tf * idf

        # Step 6: Convert to DataFrame
        df_tfidf = pd.DataFrame(tfidf_matrix, columns=self.vocab)
        return df_tfidf

In [None]:
# Small 5-sentence example
sentences_example = [
    "I love NLP",
    "NLP loves me",
    "I love machine learning",
    "Machine learning is fun fun",
    "I love love NLP NLP NLP"
]

# Initialize TF-IDF object
tfidf_encoder = TFIDF(sentences_example)

# Generate TF-IDF matrix
df_tfidf_example = tfidf_encoder.fit_transform()

# Display TF-IDF matrix
print("TF-IDF feature matrix for small example:\n")
print(df_tfidf_example)

TF-IDF feature matrix for small example:

        fun         i        is  learning      love    loves   machine  \
0  0.000000  0.074381  0.000000  0.000000  0.074381  0.00000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.30543  0.000000   
2  0.000000  0.055786  0.000000  0.127706  0.055786  0.00000  0.127706   
3  0.366516  0.000000  0.183258  0.102165  0.000000  0.00000  0.102165   
4  0.000000  0.037191  0.000000  0.000000  0.074381  0.00000  0.000000   

        me       nlp  
0  0.00000  0.074381  
1  0.30543  0.074381  
2  0.00000  0.000000  
3  0.00000  0.000000  
4  0.00000  0.111572  


# Word2Vec – Word Embeddings

**Word2Vec** is a neural network-based model that converts words into dense vectors of fixed size, capturing semantic meaning and relationships between words.

### Key Concepts:
- Each word is represented as a **dense vector** in a high-dimensional space.
- Words appearing in **similar contexts** have similar vectors.
- Captures semantic relationships, e.g.,
  $$
  \text{vec}(king) - \text{vec}(man) + \text{vec}(woman) \approx \text{vec}(queen)
  $$

### Models:
1. **CBOW (Continuous Bag of Words):**
   - Predicts a target word from surrounding context words.
   - Faster and works well with smaller datasets.

2. **Skip-gram:**
   - Predicts surrounding context words given a target word.
   - Performs better with rare words.

### Key Points:
- Output is a **dense vector** for each word (usually 100–300 dimensions).
- Preserves **semantic similarity** between words.
- Commonly used in NLP tasks: sentiment analysis, text classification, recommendation systems.

# Word2Vec Example

Sentences:
1. "I love NLP"
2. "NLP loves me"
3. "I love machine learning"

- Word2Vec will create a **vector representation** for each word.
- Similar words in context get **closer vectors**.
- Example: After training a 5-dimensional Word2Vec:
  $$
  \text{vec}(NLP) = [0.12, -0.05, 0.33, 0.21, 0.08]
  $$
  $$
  \text{vec}(love) = [0.25, 0.11, 0.30, -0.02, 0.15]
  $$
- Vectors can then be used as features for ML models.

# Word2Vec Algorithm

1. **Input:** Collection of sentences/documents.  
2. **Preprocessing:** Tokenize sentences, lowercase words, remove punctuation.  
3. **Choose model:** CBOW or Skip-gram.  
4. **Train Neural Network:** Learn dense vector representations for each word.  
5. **Output:** Dictionary mapping each word to its vector representation.

**Notes:**
- Libraries: `gensim` (Word2Vec), `tensorflow` or `pytorch` for custom training.
- Hyperparameters: vector size, window size, min_count, epochs.

In [None]:
# install gensim
# !pip install -U gensim



In [None]:
from gensim.models import Word2Vec
import pandas as pd

# Tokenize sentences (split into words)
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Train Word2Vec model
# CBOW: sg=0 (default), Skip-gram: sg=1
w2v_model = Word2Vec(
    sentences=tokenized_sentences,  # Tokenized corpus: list of list of words
    vector_size=1024,               # Dimensionality of the word embeddings
    window=10,                      # Context window size
    min_count=1,                    # Include all words (min frequency threshold)
    sg=0,                           # Training algorithm: 0 = CBOW, 1 = Skip-gram
    workers=4,                      # Number of CPU cores for training
    epochs=50                       # Number of training iterations over the corpus
)


# Example: Get vector for a word
word = "buildings"
if word in w2v_model.wv:
    print(f"Vector representation for '{word}':\n", w2v_model.wv.get_vector(word))

# Example: Most similar words
similar_words = w2v_model.wv.most_similar(word, topn=5)
print(f"\nTop 5 words similar to '{word}':\n", similar_words)

Vector representation for 'buildings':
 [-0.24368595  0.67628956 -0.07799989 ...  0.07771645  0.08644833
  0.16326168]

Top 5 words similar to 'buildings':
 [('building', 0.5575284361839294), ('houses', 0.556564211845398), ('meadows', 0.3854692876338959), ('trees', 0.3841174244880676), ('green', 0.3694857954978943)]


In [None]:
from gensim.models import Word2Vec
import pandas as pd

# Small sentence set for demonstration
csentences = [
    "I love NLP",
    "I like NLP",
    "NLP loves me",
    "NLP likes me",
    "I love machine learning",
    "I like machine learning",
    "Machine learning is fun",
    "I love coding",
    "I like coding",
]

# Tokenize sentences
tokenized_sentences = [s.lower().split() for s in csentences]

# Train Word2Vec model
# CBOW: sg=0 (default), Skip-gram: sg=1
w2v_model = Word2Vec(
    sentences=tokenized_sentences,  # Tokenized corpus: list of list of words
    vector_size=100,               # Dimensionality of the word embeddings
    window=10,                      # Context window size
    min_count=1,                    # Include all words (min frequency threshold)
    sg=0,                           # Training algorithm: 0 = CBOW, 1 = Skip-gram
    workers=4,                      # Number of CPU cores for training
    epochs=50                       # Number of training iterations over the corpus
)

# Vocabulary
vocabulary = sorted(w2v_model.wv.index_to_key)  # all words in sorted order

# Example: Get vector for all words
for word in vocabulary:
    print(f"Shape of vector representation for '{word}':\t", w2v_model.wv.get_vector(word).shape)

Shape of vector representation for 'coding':	 (100,)
Shape of vector representation for 'fun':	 (100,)
Shape of vector representation for 'i':	 (100,)
Shape of vector representation for 'is':	 (100,)
Shape of vector representation for 'learning':	 (100,)
Shape of vector representation for 'like':	 (100,)
Shape of vector representation for 'likes':	 (100,)
Shape of vector representation for 'love':	 (100,)
Shape of vector representation for 'loves':	 (100,)
Shape of vector representation for 'machine':	 (100,)
Shape of vector representation for 'me':	 (100,)
Shape of vector representation for 'nlp':	 (100,)


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Vocabulary
vocabulary = sorted(w2v_model.wv.index_to_key)  # all words in sorted order

# Build matrix of word vectors
vectors = np.array([w2v_model.wv[word] for word in vocabulary])

# Compute cosine similarity between all words
similarity_matrix = cosine_similarity(vectors)

# For each word, get words in descending similarity
df_data = [
    [vocabulary[idx] for idx in np.argsort(-similarity_matrix[i])]
    for i in range(len(vocabulary))
]

# Create DataFrame
df = pd.DataFrame(df_data, index=vocabulary, columns=[f'Rank_{i+1}' for i in range(len(vocabulary))])

df

Unnamed: 0,Rank_1,Rank_2,Rank_3,Rank_4,Rank_5,Rank_6,Rank_7,Rank_8,Rank_9,Rank_10,Rank_11,Rank_12
coding,coding,learning,likes,machine,like,love,i,fun,me,is,loves,nlp
fun,fun,learning,machine,love,is,loves,nlp,i,likes,coding,like,me
i,i,is,me,loves,likes,fun,coding,nlp,like,learning,love,machine
is,is,i,fun,likes,coding,learning,loves,me,nlp,love,machine,like
learning,learning,fun,loves,like,coding,love,me,machine,nlp,is,i,likes
like,like,learning,love,coding,nlp,loves,i,me,machine,fun,is,likes
likes,likes,loves,me,i,coding,fun,love,is,machine,learning,nlp,like
love,love,loves,like,machine,learning,fun,likes,coding,nlp,me,i,is
loves,loves,learning,love,me,likes,i,machine,fun,nlp,like,is,coding
machine,machine,love,fun,nlp,coding,loves,likes,learning,me,like,i,is


# FastText – Theory

**FastText** is an extension of Word2Vec developed by Facebook AI Research (FAIR).  
Unlike Word2Vec, which learns a vector for each *whole word*, FastText represents each word as a **bag of character n-grams**.  
This makes it more powerful for handling rare words, misspellings, and morphologically rich languages.

---

## Key Idea

- Each word $w$ is represented as a combination of its character *n-grams*.  
- Example: For the word *"where"* with $n = 3$:  
  - n-grams = `"<wh"`, `"whe"`, `"her"`, `"ere"`, `"re>"`  
- The vector for $w$ is computed as the **sum of vectors of its n-grams**.

---

## Mathematical Representation

If a word $w$ is represented by character n-grams  
$(g_{1}, g_{2}, \ldots, g_{k})$, then its vector is:

$$
\vec{w} = \sum_{i=1}^{k} \vec{g_i}
$$

where:  
- $\vec{g_i}$ = vector of the $i^{th}$ n-gram  
- $k$ = total number of n-grams for word $w$

---

## Advantages
- Handles **out-of-vocabulary (OOV) words** by using subword n-grams.  
- Better at capturing **morphological variations** (e.g., "play", "playing", "played").  
- Works well for **low-resource languages** and noisy text.

# FastText – Small Example

Suppose we use character 3-grams ($n=3$).  

### Word: "play"
- 3-grams: `"<pl"`, `"pla"`, `"lay"`, `"ay>"`  

### Word: "player"
- 3-grams: `"<pl"`, `"pla"`, `"lay"`, `"aye"`, `"yer"`, `"er>"`  

### Key Observation
- Both words share the n-grams `"<pl"`, `"pla"`, `"lay"`.  
- This makes their vectors **similar**, even though "play" and "player" are different words.  

# FastText – Algorithm

1. **Input:** Collection of sentences/documents.  
2. **Preprocessing:** Tokenize text, lowercase, remove punctuation (optional).  
3. **Generate n-grams:** For each word, break into character n-grams of length $n$.  
4. **Assign vectors:** Initialize random vectors for each n-gram.  
5. **Training:**  
   - Similar to Word2Vec (CBOW or Skip-gram).  
   - Instead of predicting using word vectors, use the **sum of its n-gram vectors**.  
6. **Output:** Word embeddings that are enriched by subword information.  

**Notes:**  
- In practice, use `gensim.models.FastText` for training.  
- Choose n-gram range carefully (commonly $n = 3$ to $6$).  

In [None]:
from gensim.models import FastText

class FastTextEncoder:
    """
    A class to generate FastText embeddings for a corpus of tokenized sentences.

    Methods:
    --------
    train(sentences, vector_size=100, window=5, min_count=1, sg=0, epochs=50)
        Train the FastText model on tokenized sentences.

    get_vector(word)
        Return the embedding vector for a given word.

    most_similar(word, topn=5)
        Return the top-n most similar words to the given word.
    """

    def __init__(self, sentences):
        """
        Initialize with tokenized sentences.
        Parameters:
        -----------
        sentences : list of list of str
            Tokenized sentences (list of words).
        """
        self.sentences = sentences
        self.model = None

    def train(self, vector_size=100, window=5, min_count=1, sg=0, epochs=50):
        """
        Train FastText model on the corpus.
        """
        self.model = FastText(
            sentences=self.sentences,
            vector_size=vector_size,
            window=window,
            min_count=min_count,
            sg=sg,
            workers=4,
            epochs=epochs
        )

    def get_vector(self, word):
        """Return vector of a word."""
        if self.model and word in self.model.wv:
            return self.model.wv[word]
        else:
            return None

    def most_similar(self, word, topn=5):
        """Return top-n most similar words."""
        if self.model and word in self.model.wv:
            return self.model.wv.most_similar(word, topn=topn)
        else:
            return None

In [None]:
# Tokenize sentences (split into words)
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Initialize encoder with tokenized sentences
ft_encoder = FastTextEncoder(tokenized_sentences)

# Train FastText embeddings
ft_encoder.train(vector_size=50, window=3, sg=0, epochs=50)  # CBOW, small vector for demo

# Example: Get vector for a word
word = "trees"
vector = ft_encoder.get_vector(word)
print(f"Vector for '{word}':\n", vector)

# Example: Most similar words
similar_words = ft_encoder.most_similar(word, topn=5)
print(f"\nTop 5 words similar to '{word}':\n", similar_words)

Vector for 'trees':
 [ 0.6530846   0.878089    0.22609228  1.0980083   4.0503297  -2.1577334
 -1.019514   -0.9750183   0.11853927 -1.746333    0.20947021  1.0823535
 -2.3126569  -0.25367138 -4.2428236  -0.25693595  0.29158935  0.40239558
  1.5529388   0.14716645  1.4325798  -1.4569715  -0.5504494  -2.5085375
 -0.05441535  0.8548568  -2.7766485  -2.186073    2.923112   -0.7620086
 -2.645418   -0.54623264  0.1195144   0.21797189 -2.3204353   0.48574346
  1.8264054   0.5877467  -0.21056727  0.18775581 -1.0848439   3.507628
  1.7511572  -0.9868019  -0.74996454  0.29289952  1.5102203   0.6893213
  2.2077527  -1.3493694 ]

Top 5 words similar to 'trees':
 [('plants', 0.8836347460746765), ('tree', 0.7652081847190857), ('meadows', 0.7073495388031006), ('buildings', 0.665357768535614), ('plans', 0.610248327255249)]


In [None]:
# Small sentence set for demonstration
csentences = [
    "I love NLP",
    "I like NLP",
    "NLP loves me",
    "NLP likes me",
    "I love machine learning",
    "I like machine learning",
    "Machine learning is fun",
    "I love coding",
    "I like coding",
]

# Tokenize sentences (split into words)
tokenized_sentences = [sentence.lower().split() for sentence in csentences]

# Initialize encoder with tokenized sentences
ft_model = FastTextEncoder(tokenized_sentences)

# Train FastText embeddings
ft_model.train(vector_size=100, window=3, sg=0, epochs=50)  # CBOW, small vector for demo

# Vocabulary (all words in sorted order)
vocabulary = sorted(ft_model.model.wv.index_to_key)

# Example: Get vector for all words
for word in vocabulary:
    print(f"Shape of vector representation for '{word}':\t", ft_model.get_vector(word).shape)

Shape of vector representation for 'coding':	 (100,)
Shape of vector representation for 'fun':	 (100,)
Shape of vector representation for 'i':	 (100,)
Shape of vector representation for 'is':	 (100,)
Shape of vector representation for 'learning':	 (100,)
Shape of vector representation for 'like':	 (100,)
Shape of vector representation for 'likes':	 (100,)
Shape of vector representation for 'love':	 (100,)
Shape of vector representation for 'loves':	 (100,)
Shape of vector representation for 'machine':	 (100,)
Shape of vector representation for 'me':	 (100,)
Shape of vector representation for 'nlp':	 (100,)


In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Vocabulary (all words in sorted order)
vocabulary = sorted(ft_model.model.wv.index_to_key)

# Build matrix of word vectors
vectors = np.array([ft_model.get_vector(word) for word in vocabulary])

# Compute cosine similarity between all words
similarity_matrix = cosine_similarity(vectors)

# For each word, get words in descending similarity order
df_data = [
    [vocabulary[idx] for idx in np.argsort(-similarity_matrix[i])]
    for i in range(len(vocabulary))
]

# Create DataFrame
df = pd.DataFrame(df_data, index=vocabulary, columns=[f'Rank_{i+1}' for i in range(len(vocabulary))])

# Display the DataFrame
df

Unnamed: 0,Rank_1,Rank_2,Rank_3,Rank_4,Rank_5,Rank_6,Rank_7,Rank_8,Rank_9,Rank_10,Rank_11,Rank_12
coding,coding,learning,is,loves,machine,love,likes,i,like,fun,nlp,me
fun,fun,like,likes,learning,love,coding,nlp,machine,me,loves,i,is
i,i,likes,nlp,loves,is,like,learning,coding,love,machine,me,fun
is,is,love,coding,me,likes,i,loves,nlp,learning,machine,like,fun
learning,learning,coding,i,loves,fun,like,likes,is,love,nlp,machine,me
like,like,likes,machine,fun,me,i,love,learning,nlp,coding,loves,is
likes,likes,like,me,loves,i,fun,is,love,machine,learning,coding,nlp
love,love,loves,is,machine,coding,nlp,likes,like,fun,me,i,learning
loves,loves,love,coding,nlp,likes,i,machine,learning,is,me,fun,like
machine,machine,like,love,coding,nlp,likes,loves,me,fun,learning,is,i


# 🧠 Transformer-Based Embeddings

In machine learning, data must be **numerically encoded** before it can be used for training or inference.  
Encoding transforms symbolic information into numerical form so that mathematical models can process it.

- **Categorical Encoding**: Converts discrete categories (e.g., color, species) into numbers.  
- **Text Encoding**: Converts text (sentences, documents) into numeric vectors.  
- Proper encoding is **critical for model performance**, as it determines how effectively a model can learn underlying patterns.

## Evolution of Text Encoding

1. **Bag of Words (BoW)** and **TF-IDF**
   - Represent text as frequency-based vectors.
   - Ignore word order and context.

2. **Word2Vec** and **FastText**
   - Learn word embeddings from local context using shallow neural networks.
   - Capture semantic similarity but are **context-independent** (same embedding for “bank” in *river bank* and *money bank*).

3. **Transformer-Based Embeddings**
   - Introduced with the **Transformer architecture**, which uses *self-attention* instead of recurrence.
   - Generate **contextual embeddings** — word meaning adapts based on sentence context.

## Advantages of Transformer-Based Embeddings

- **Context Awareness**: Captures meaning depending on surrounding words.  
- **Bidirectionality**: Considers both left and right contexts.  
- **Long-Range Dependencies**: Understands relationships between distant words.  
- **Transfer Learning**: Pre-trained models (like BERT, FNet, RoBERTa) can be fine-tuned for downstream tasks efficiently.

### Examples of Transformer Embedding Models

| Model       | Core Mechanism       | Key Advantage                    |
|------------|--------------------|----------------------------------|
| BERT       | Encoder-only       | Deep bidirectional understanding |
| RoBERTa    | Optimized BERT     | Better training dynamics          |
| DistilBERT | Lightweight BERT   | Faster inference                  |
| FNet       | Fourier-based      | Faster and simpler alternative to attention |

# 🔬 FNetWord — Word-Level Embedding using FNet

**FNet** (Fourier Network) is a Transformer-inspired model that replaces the costly *self-attention* mechanism  
with a **2D Fourier Transform** applied over the sequence of token embeddings.  
This drastically reduces computation while maintaining strong performance on many NLP tasks.

## How FNet Works

- Input tokens are converted into embeddings.  
- Instead of computing pairwise attention weights, FNet applies a **Fourier Transform** to mix information across tokens.  
- The resulting transformed representations pass through a **feed-forward network** to produce contextual embeddings.

This mechanism captures **global context** efficiently and is computationally much faster than BERT-like attention models.

## The `FNetWord` Class

- A custom PyTorch module built on top of the pretrained **FNet encoder**.  
- Loads a JSON file containing a list of sentences.  
- Tokenizes the sentences and extracts contextualized embeddings for **each word**.  
- Aggregates subword tokens into a single **word-level vector** using the **mean pooling strategy**.  
- Removes all special tokens (e.g., `[CLS]`, `[SEP]`, `[PAD]`), returning only actual words from the JSON file.

## Output

Returns a Python dictionary:

```python
{
    "word": embedding_vector
}

In [None]:
import json
from collections import defaultdict
from typing import Dict, List
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModel
from torch.optim import AdamW  # Correct import


class TransformerEmbedder(nn.Module):
    """
    Transformer-based word-level embedding extractor using FNet.
    Produces contextualized embeddings for each word (aggregated from subwords).
    """

    def __init__(self, json_path: str, model_name: str = "google/fnet-base"):
        super().__init__()

        # Load input sentences
        with open(json_path, "r", encoding="utf-8") as f:
            self.sentences = json.load(f)

        if not isinstance(self.sentences, list):
            raise ValueError("JSON file must contain a list of sentences (list[str]).")

        # Initialize tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

    def fine_tune(self, lr: float = 2e-5, batch_size: int = 8, epochs: int = 2):
        """
        Simple fine-tuning loop: runs over sentences to update model parameters.
        """
        self.model.train()
        optimizer = AdamW(self.model.parameters(), lr=lr)
        dataloader = DataLoader(self.sentences[:30], batch_size=batch_size, shuffle=True)
        for epoch in range(epochs):
            for batch in dataloader:
                # Tokenize batch
                encoded = self.tokenizer(
                    batch,
                    is_split_into_words=False,
                    return_tensors="pt",
                    padding=True,
                    truncation=True
                )
                device = next(self.model.parameters()).device
                encoded = {k: v.to(device) for k, v in encoded.items()}

                # Forward pass
                outputs = self.model(**encoded)
                # Use mean of embeddings as dummy loss (placeholder)
                loss = outputs.last_hidden_state.mean()
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()

    def forward(self, vector_size: int = 768, fine_tune: bool = False) -> Dict[str, List[float]]:
        """
        Compute word embeddings for all words in the loaded sentences.
        """
        if fine_tune:
            self.fine_tune(epochs=2)

        self.model.train(fine_tune)
        word_vectors = defaultdict(list)

        for sentence in self.sentences:
            words = sentence.strip().split()
            if not words:
                continue

            encoded = self.tokenizer(
                words,
                is_split_into_words=True,
                return_tensors="pt",
                padding=True,
                truncation=True
            )

            # Get word_ids before converting to device
            word_ids = encoded.word_ids(batch_index=0)
            device = next(self.model.parameters()).device
            encoded = {k: v.to(device) for k, v in encoded.items()}

            with torch.set_grad_enabled(fine_tune):
                outputs = self.model(**encoded)
                last_hidden = outputs.last_hidden_state.squeeze(0)

            token_buckets = defaultdict(list)
            for idx, wid in enumerate(word_ids):
                if wid is not None:
                    token_buckets[wid].append(last_hidden[idx])

            for wid, sub_embeds in token_buckets.items():
                word = words[wid].lower()
                vec = torch.stack(sub_embeds).mean(dim=0)
                word_vectors[word].append(vec.detach().cpu())

        final_embeddings = {
            w: torch.stack(v).mean(dim=0)[:vector_size].tolist()
            for w, v in word_vectors.items()
            if w not in self.tokenizer.all_special_tokens
        }

        return final_embeddings

In [40]:
embedding = TransformerEmbedder("Text_Data.json")
embed_vectors = embedding(10,True)

4
