In [None]:
import pandas as pd

df = pd.read_csv('ecommerce_dataset.csv', names=['category', 'description'], header=None)

In [None]:
df.shape

(50425, 2)

In [None]:
df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [None]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
Household,19313
Books,11820
Electronics,10621
Clothing & Accessories,8671


In [None]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [None]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


In [None]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

**NOTE:**  
For doing `text classification` in **fastText** uses **supervised learning** method, and for that the file format should be like this...  
`__label__`{category} {Text}  
**Example:**  
`__label__`elctronics Apple iPhone 6S.

If you have **multiple** labels then...  
`__label__{category1}` `__label__{category2}` {Text}  



In [None]:
# Adding prefix "__label__" to category.
df['category'] = "__label__" + df['category'].astype(str)

In [None]:
df.head()

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [None]:
# Merging these two columns.
df['category_description'] = df['category'] + " " + df['description']

In [None]:
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household Incredible Gifts India Wood...


In [None]:
df['category_description'][0]

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and so

In [None]:
# Preprocessing using regular expression
import re

text = '__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart.'
text = re.sub(r'[^\w\s\'.]|(?<!\d)\.(?!\d)', ' ', text)

In [None]:
text

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints  8.7 X 8.7 inch    Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it  This is an special series of paintings which makes your wall very beautiful and gives a royal touch  This painting is ready to hang  you would be proud to possess this unique painting that is a niche apart '

In [None]:
# removing extra spaces
text = re.sub(r' +', ' ', text)

In [None]:
text

'__label__Household Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints 8.7 X 8.7 inch Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it This is an special series of paintings which makes your wall very beautiful and gives a royal touch This painting is ready to hang you would be proud to possess this unique painting that is a niche apart '

In [None]:
text.strip().lower() # "strip()" function will remove the leading spaces.

'__label__household paper plane design framed wall hanging motivational office decor art prints 8.7 x 8.7 inch set of 4 painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it this is an special series of paintings which makes your wall very beautiful and gives a royal touch this painting is ready to hang you would be proud to possess this unique painting that is a niche apart'

In [None]:
# Putting all together

def preprocess(text):
  text = re.sub(r'[^\w\s\'.]|(?<!\d)\.(?!\d)', ' ', text)
  text = re.sub(r' +', ' ', text)
  return text.strip().lower()

In [None]:
df['category_description'] = df['category_description'].map(preprocess)

In [None]:
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [None]:
# train_test_split
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [None]:
train.shape, test.shape

((40339, 3), (10085, 3))

Here `train` and `test` are the individual dataframes.

In [None]:
train.to_csv("ecommerce_train", columns=['category_description'], index=False, header=False)
test.to_csv("ecommerce_test", columns=['category_description'], index=False, header=False)

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4296189 sha256=6f42fd1803a31d3f6bf26df31d044da62fd9c3240919d30d937bcb72a504158d
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58

fasttext's **unsupervised method** is used to generate **word Embedding**.  
whereas **supervised method** is used for **text classification**.

In [None]:
import fasttext

model = fasttext.train_supervised(input="ecommerce_train")


In [None]:
model.test("ecommerce_test")

(10085, 0.9669806643529995, 0.9669806643529995)

`(10085, 0.9669806643529995, 0.9669806643529995)`--> `(test_size, precision, recall)`

In [None]:
model.predict("wintech assemble desktop pc cpu 500gb data hdd 4gb ram intel c2d processor")

(('__label__electronics',), array([0.99969673]))

In [None]:
model.predict("think and grow rich")

(('__label__books',), array([1.00000894]))

In [None]:
# We can also perform word embedding with this.
model.get_nearest_neighbors("cotton")

[(0.9998396039009094, '125cm'),
 (0.9996483325958252, 'mannat'),
 (0.9995973706245422, 'lustrous'),
 (0.999276340007782, 'loosing'),
 (0.9991471767425537, '21cm'),
 (0.9989436268806458, 'mart'),
 (0.9982795715332031, 'warmness'),
 (0.9982308745384216, 'paneling'),
 (0.9977583289146423, 'opal'),
 (0.9974756240844727, 'flo')]

In [None]:
model.get_nearest_neighbors("sony")

[(0.9994274973869324, 'seizes'),
 (0.9994274973869324, 'glass4'),
 (0.9994274973869324, 'continuum'),
 (0.9994274973869324, '3340'),
 (0.9994223713874817, 'mylar'),
 (0.9994188547134399, 'z500'),
 (0.9994120597839355, 'wayne'),
 (0.9994071125984192, '7601'),
 (0.9994071125984192, '0oc'),
 (0.9994071125984192, 'ralink')]