## 📦 E-commerce Text Classification Dataset

**Dataset Source**: [Kaggle - Ecommerce Text Classification](https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification)

We are working with a dataset containing product descriptions from an e-commerce platform.  
The goal is to classify each product into one of the following four categories:

- 🏠 **Household**
- 💻 **Electronics**
- 👗 **Clothing and Accessories**
- 📚 **Books**

### 🔍 Task Objective

Given a product description, the task is to **predict the correct category** from the four listed above using text classification techniques.


In [7]:
import pandas as pd

df= pd.read_csv(r"C:\Codes\NLP\datasets\ecommerceDataset.csv",  names=["category", "description"], header=None)
print(df.shape)

(50425, 2)


In [8]:
# Drop NA values
df.dropna(inplace=True)
df.shape

(50424, 2)

In [9]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [10]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)
df.category.unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description.

In [11]:
df['category'] = '__label__' + df['category']
df.head(3)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...


Create a new column that merge two existing columns

In [12]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


**Pre-procesing**

- Remove punctuation
- Remove extra space
- Make the entire sentence lower case

In [14]:
import re

def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [15]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [16]:
# Train Test split
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [17]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

In [21]:
# Train the model and evaluate the performance
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10085, 0.9676747645017353, 0.9676747645017353)

**Prediction**

In [25]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [24]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [23]:
model.get_nearest_neighbors("painting")

[(0.9976692199707031, 'temperature'),
 (0.996240496635437, 'guard'),
 (0.9961748719215393, 'alarm'),
 (0.9960747361183167, 'handmade'),
 (0.9956462979316711, 'lint'),
 (0.995530903339386, 'primer'),
 (0.9951908588409424, 'vacuum'),
 (0.9948580265045166, 'artificial'),
 (0.9947441220283508, 'pipe'),
 (0.9944864511489868, 'steam')]