<center>

# NLP_9 : FastText Embeddings

</center>


## 1. Introduction

**FastText** is a word embedding technique developed by **Facebook AI Research (FAIR)**.  
It is an extension of **Word2Vec**, designed to overcome Word2Vecâ€™s limitation in handling
**rare words and out-of-vocabulary (OOV) words**.

Unlike Word2Vec, FastText represents a word as a **collection of character n-grams**, rather than a single token.

## 2. Motivation Behind FastText

Traditional word embedding models like **Word2Vec** treat each word as an atomic unit.
As a result:
- Rare words get poor embeddings  
- OOV words are ignored completely  
- Morphological information is lost  

FastText solves this by learning embeddings for **subwords (character n-grams)**.

## 3. Core Idea of FastText

A word is broken into **character n-grams**, and the word embedding is computed as the **sum (or average)** of its n-gram vectors.

### Example

Word:playing
Character n-grams (n = 3):pl, pla, lay, ayi, yin, ing, ng

Final word vector:

## 4. Model Architecture

FastText uses the same architectures as Word2Vec:
- **CBOW**
- **Skip-Gram**

But instead of learning embeddings for whole words only, it learns embeddings for:
- Words  
- Character n-grams  


## 5. Learning Objective (Skip-Gram FastText)

Given a **target word**, FastText predicts surrounding context words using subword information.

Mathematically:

$$P(w_{context} | w_{target}) = \sum_{g \in G(w)} \vec{z}_g $$

where:

\begin{aligned}
G(w) & = \text{set of character n-grams of word } w \\
\vec{z}_g & = \text{embedding vector of n-gram } g
\end{aligned}


## 6. Advantages of FastText

- Handles **out-of-vocabulary (OOV)** words  
- Performs well on **small datasets**  
- Captures **morphological structure**  
- Better embeddings for **rare words**  
- Works well for **morphologically rich languages**


## 7. Limitations

- Larger vocabulary size  
- Higher memory usage compared to Word2Vec  
- Slower training time  
- Still limited in capturing long-range context compared to transformers  

## 8. Comparison with Word2Vec

| Feature | Word2Vec | FastText |
|-------|---------|----------|
| Uses subword information | No | Yes |
| Handles OOV words | No | Yes |
| Rare word performance | Weak | Strong |
| Vocabulary size | Smaller | Larger |
| Training speed | Faster | Slower |


## 9. Applications of FastText

- Text classification  
- Sentiment analysis  
- Language identification  
- Named Entity Recognition (NER)  
- Low-resource languages  


## 10. Summary

- FastText extends Word2Vec using **character n-grams**
- Generates embeddings even for **unseen words**
- Ideal for **rare words and morphologically complex languages**
- Widely used in real-world NLP systems


##### Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

We have a dataset of ecommerce item description. Total 4 categories,
1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [1]:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


**Drop NA values**

In [2]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [3]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [4]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


In [5]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with __label__ prefix. We will just create a third column in the dataframe that has __label__ as well as the product description

In [6]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [7]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


**Pre-procesing**
1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [8]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [9]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [10]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


**Train Test Split**

In [11]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [12]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [13]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

**Train the model and evaluate performance**

In [14]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10085, 0.969360436291522, 0.969360436291522)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

**Now let's do prediction for few product descriptions**

In [15]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99720895]))

In [16]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [17]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000966]))

In [18]:
model.get_nearest_neighbors("painting")

[(0.9975425601005554, 'vacuum'),
 (0.996705949306488, 'equipment'),
 (0.9960797429084778, 'door'),
 (0.9955933690071106, 'mist'),
 (0.9955537915229797, 'temperature'),
 (0.995466947555542, 'canopy'),
 (0.9952825903892517, 'heating'),
 (0.995228111743927, 'pour'),
 (0.9950337409973145, 'ro'),
 (0.9949843883514404, 'lint')]

In [19]:
model.get_nearest_neighbors("sony")

[(0.9983882308006287, 'telescope'),
 (0.9980239272117615, 'tripod'),
 (0.9978470802307129, 'antenna'),
 (0.9976548552513123, 'ram'),
 (0.9970351457595825, 'whey'),
 (0.9969323873519897, 'charger'),
 (0.9967682361602783, 'macbook'),
 (0.9967113137245178, 'viewing'),
 (0.996561586856842, 'receiver'),
 (0.9964932799339294, 'iball')]

In [20]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, '588973113'),
 (0.0, 'councils'),
 (0.0, 'schuon')]

<div style="text-align: right;">
    <b>Author:</b> Monower Hossen <br>
    <b>Date:</b> January 10, 2026
</div>
