 text classification for e-commerce items using fastText library
 

We have a dataset of ecommerce item description. Total 4 categories,

1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [3]:
#  importing dat afrom kaggle

! pip install kaggle


! mkdir ~/.kaggle



! cp kaggle.json ~/.kaggle/



! chmod 600 ~/.kaggle/kaggle.json




Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [4]:
! kaggle datasets download -d saurabhshahane/ecommerce-text-classification

Downloading ecommerce-text-classification.zip to /content
  0% 0.00/7.86M [00:00<?, ?B/s] 64% 5.00M/7.86M [00:00<00:00, 48.4MB/s]
100% 7.86M/7.86M [00:00<00:00, 69.2MB/s]


In [5]:
! unzip /content/ecommerce-text-classification.zip

Archive:  /content/ecommerce-text-classification.zip
  inflating: ecommerceDataset.csv    


In [6]:
import pandas as pd
df = pd.read_csv("ecommerceDataset.csv", names= ["category", "description"], header= None)
print(df.shape)
df.head()


(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [7]:
df.category.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: category, dtype: int64

dataset not heavily imbalanced

In [8]:
# check for missing values

df.isnull().sum()

category       0
description    1
dtype: int64

In [9]:
df.dropna(inplace= True)
df.shape

(50424, 2)

In [12]:
# rename clothing colunm
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace= True)

In [13]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description



In [14]:
df['category'] = "__label__" + df["category"].astype(str)
df.head()

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


merging category and description

In [28]:
df['category_description'] = df["category"] + " " + df["description"]
df.head()

Unnamed: 0,category,description,category_description,category_description.1
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__HouseholdPaper Plane Design Framed Wa...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__HouseholdSAF 'Floral' Framed Painting...,__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__HouseholdSAF 'UV Textured Modern Art ...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__HouseholdSAF Flower Print Framed Pain...,__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__HouseholdIncredible Gifts India Woode...,__label__Household Incredible Gifts India Wood...


In [29]:
df.category_description[0]

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and so

# Preprocessing

1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [30]:
# USING REGEX
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [31]:
def preprocess(text):
  text = re.sub(r'[^\w\s\']',' ', text)
  text = re.sub(' +', ' ', text)
  return text.strip().lower()

In [32]:
df['category_description'] = df['category_description'].map(preprocess)

In [33]:
df.head()

Unnamed: 0,category,description,category_description,category_description.1
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__HouseholdPaper Plane Design Framed Wa...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__HouseholdSAF 'Floral' Framed Painting...,__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__HouseholdSAF 'UV Textured Modern Art ...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__HouseholdSAF Flower Print Framed Pain...,__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__HouseholdIncredible Gifts India Woode...,__label__household incredible gifts india wood...


In [35]:
df.category_description[0]

'__label__household paper plane design framed wall hanging motivational office decor art prints 8 7 x 8 7 inch set of 4 painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it this is an special series of paintings which makes your wall very beautiful and gives a royal touch this painting is ready to hang you would be proud to possess this unique painting that is a niche apart we use only the most modern and efficient printing technology on our prints with only the and inks and precision epson roland and hp printers this innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime we print solely with top notch 100 inks to achieve brilliant and true colours due to their high level of uv resistance our prints retain their beautiful colours for many years add colour and style to your living space with this digitally printed painting some are for pleasure and some for eternal bli

train test split

In [36]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size = 0.2)

In [37]:
train.shape, test.shape


((40339, 4), (10085, 4))

In [41]:
train.to_csv("ecommerce.train", columns= ["category_description"], index= False, header= False)
test.to_csv("ecommerce.test", columns= ["category_description"], index= False, header= False)

Train the model and evaluate performance

In [40]:
! pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 3.3 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.10.0-py3-none-any.whl (213 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3168454 sha256=77ad9b06684cab6cdf9944e7d60f19c4f4817ce655524cffa976e8c77f75a08a
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.10.0


In [42]:
import fasttext

model = fasttext.train_supervised (input= "ecommerce.train")
model.test("ecommerce.test")


(10083, 0.9677675295051076, 0.9677675295051076)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good



prediction on several product descriptions



In [43]:

model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")


(('__label__electronics',), array([0.99084079]))

In [44]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")


(('__label__clothing_accessories',), array([1.00001001]))

In [45]:
model.predict("think and grow rich deluxe edition")


(('__label__books',), array([1.00000989]))

In [46]:
model.get_nearest_neighbors("painting")


[(0.9988938570022583, 'saints'),
 (0.9988886713981628, 'televisions'),
 (0.9988877177238464, 'husker'),
 (0.9988875389099121, 'multifold'),
 (0.9988871216773987, 'electrade'),
 (0.9988827705383301, 'attributed'),
 (0.9988793134689331, '200kgs'),
 (0.9988771080970764, 'alt2035'),
 (0.9988765716552734, 'cheerful'),
 (0.9988734126091003, 'playfully')]

In [47]:
model.get_nearest_neighbors("sony")


[(0.9994961023330688, 'optimise'),
 (0.9994891881942749, 'notification'),
 (0.9994875192642212, '1mw'),
 (0.9994750618934631, 'ubuntu'),
 (0.9994686245918274, 'mgen2hn'),
 (0.9994666576385498, 'perfectionists'),
 (0.9994592070579529, 'ultraflex'),
 (0.9994592070579529, 'cradlefit'),
 (0.9994590282440186, 'ssa'),
 (0.9994538426399231, 'inspection')]