<h3>NLP : Text Classification Using FastText</h3>

In [46]:
# pip install fasttext-wheel

Defaulting to user installation because normal site-packages is not writeable
Collecting fasttext-wheel
  Obtaining dependency information for fasttext-wheel from https://files.pythonhosted.org/packages/96/58/2d1c2557cefa8d30c7e7ed182cac53cc811b4dcf265ffa64fb8e8a6287c5/fasttext_wheel-0.9.2-cp311-cp311-win_amd64.whl.metadata
  Downloading fasttext_wheel-0.9.2-cp311-cp311-win_amd64.whl.metadata (16 kB)
Collecting pybind11>=2.2 (from fasttext-wheel)
  Obtaining dependency information for pybind11>=2.2 from https://files.pythonhosted.org/packages/04/e8/a22d08220cb5e230007d9b0c11e429c3b19d01315d1a99dbd45e3bf97386/pybind11-2.13.5-py3-none-any.whl.metadata
  Using cached pybind11-2.13.5-py3-none-any.whl.metadata (9.5 kB)
Downloading fasttext_wheel-0.9.2-cp311-cp311-win_amd64.whl (232 kB)
   ---------------------------------------- 0.0/232.4 kB ? eta -:--:--
   - -------------------------------------- 10.2/232.4 kB ? eta -:--:--
   ------------- ------------------------- 81.9/232.4 kB 919.0 kB



##### Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

We have a dataset of ecommerce item description. Total 4 categories,
1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [183]:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


**Drop NA values**

In [185]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [187]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [190]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

In [192]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with __label__ prefix. We will just create a third column in the dataframe that has __label__ as well as the product description

In [195]:
df['category'] = '__label__' + df['category'].astype(str) 
df.head(5)

# Output is the label and how it should be represented in fasttext.

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [197]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


**Pre-procesing**
1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [200]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [202]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [204]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


**Train Test Split**

In [206]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [207]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [208]:
# Generating files for fasttext.

train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

**Train the model and evaluate performance**

In [210]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

# (size, precision, recall)

(10085, 0.9682697074863659, 0.9682697074863659)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

**Now let's do prediction for few product descriptions**

In [213]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3") 
# The output is the format how fast text excepts the data.

(('__label__electronics',), array([0.98561972]))

In [214]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [215]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000989]))

In [219]:
model.get_nearest_neighbors("painting")

[(0.9976388216018677, 'vacuum'),
 (0.9968333840370178, 'guard'),
 (0.9968314170837402, 'heating'),
 (0.9966275095939636, 'lid'),
 (0.9962871670722961, 'lamp'),
 (0.9962404370307922, 'microwave'),
 (0.9952072501182556, 'mist'),
 (0.9951967000961304, 'juicer'),
 (0.994998037815094, 'bulb'),
 (0.994970977306366, 'toaster')]

In [221]:
model.get_nearest_neighbors("sony")

[(0.9988397359848022, 'external'),
 (0.998672366142273, 'binoculars'),
 (0.9981507658958435, 'dvd'),
 (0.9975149631500244, 'nikon'),
 (0.9973592162132263, 'glossy'),
 (0.9968783259391785, 'laptop'),
 (0.9967763423919678, 'binocular'),
 (0.9965054392814636, 'mic'),
 (0.9964385628700256, 'devices'),
 (0.9964382648468018, 'stereo')]

In [222]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, 'biswaer'),
 (0.0, 'dwadash'),
 (0.0, 'colourants')]