### NLP Tutorial: Text Classification Using FastText
**Dataset Credits:** https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
We have a dataset of ecommerce item description. Total 4 categories,

1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [5]:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(10)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...
5,Household,Pitaara Box Romantic Venice Canvas Painting 6m...
6,Household,Paper Plane Design Starry Night Vangoh Wall Ar...
7,Household,Pitaara Box Romantic Venice Canvas Painting 6m...
8,Household,SAF 'Ganesh Modern Art Print' Painting (Synthe...
9,Household,Paintings Villa UV Textured Modern Art Print F...


In [2]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [3]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [4]:
df.category.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8670
Name: category, dtype: int64

In [6]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

In [7]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [8]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [34]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


### Pre-procesing

1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [28]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [30]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [31]:
preprocess("  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi")

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [52]:
# df['category_description'] = df['category_description'].map(preprocess)
# df.head()

### Train Test Split

In [36]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [37]:
train.shape, test.shape

((40340, 3), (10085, 3))

In [38]:
test.head(3)

Unnamed: 0,category,description,category_description
18212,__label__Household,"WULF Wood Chisel Set, 0.25-1inch - Set of 4 Th...","__label__Household WULF Wood Chisel Set, 0.25-..."
43640,__label__Electronics,Samsung 23.5 inch (59.8 cm) LED Monitor - Full...,__label__Electronics Samsung 23.5 inch (59.8 c...
28719,__label__Books,Ethics and Law in Dental Hygiene: Pageburst E-...,__label__Books Ethics and Law in Dental Hygien...


In [39]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

### Train the model and evaluate performance

In [41]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199775 sha256=6899c45b8993127d57436495e6d1ce4facc23dc092e9e1b0487049a9d7d5cf66
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [42]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(2318, 0.9590163934426229, 0.9590163934426229)

### Now let's do prediction for few product descriptions

In [43]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__Electronics',), array([0.51310372]))

In [44]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__Clothing_Accessories',), array([0.99995899]))

In [45]:
model.predict("think and grow rich deluxe edition")

(('__label__Books',), array([0.98485148]))

In [46]:
model.get_nearest_neighbors("painting")

[(0.5098688006401062, 'because'),
 (0.5074371695518494, '850'),
 (0.5039157867431641, 'Fibre'),
 (0.4930494725704193, 'toy'),
 (0.489446759223938, 'Proof'),
 (0.4789711534976959, 'hours'),
 (0.4782071113586426, 'Led'),
 (0.47722262144088745, 'Charms'),
 (0.47673627734184265, '(Blue'),
 (0.4721719026565552, 'bar')]

In [47]:
model.get_nearest_neighbors("sony")

[(0.9676256775856018, 'phone'),
 (0.967024564743042, 'eSATA'),
 (0.9669581651687622, 'A521X'),
 (0.9669325947761536, '(left'),
 (0.9669070839881897, 'name:58'),
 (0.9668970108032227, 'fan)'),
 (0.9668970108032227, 'COMPUTER'),
 (0.9668970108032227, '(Single'),
 (0.9668970108032227, 'name:PHOENIX'),
 (0.9668970108032227, 'INTL')]

In [48]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'the'),
 (0.0, 'a'),
 (0.0, 'is'),
 (0.0, 'for'),
 (0.0, 'with'),
 (0.0, '</s>'),
 (0.0, 'Charged.'),
 (0.0, 'Lifespan'),
 (0.0, 'over-Discharge')]