<a href="https://colab.research.google.com/github/Arman001/nlp-projects/blob/main/Text_Classification/Ecommerce_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ecommerce Classification
**By Muhammad Saad**
22, Jan 2025
***
This project demonstrates a complete NLP pipeline for classifying short e-commerce product descriptions into categories such as *Household, Books, Electronics, and Clothing & Accessories*. It covers:

- Loading and cleaning a large CSV dataset (50k+ rows)
- Handling missing values and class imbalance
- Text preprocessing using spaCy (lemmatization, stopword and punctuation removal)
- TF-IDF vectorization of cleaned text
- SMOTE for oversampling underrepresented classes
- Training and evaluating two models:
  - **Multinomial Naive Bayes** (F1-score ~94%)
  - **K-Nearest Neighbors**
- Evaluation using precision, recall, F1-score, and accuracy

📊 **Best Model (Naive Bayes) Accuracy:** 94% on test set

Tech stack: Python, scikit-learn, spaCy, imbalanced-learn, pandas, tqdm

## Importing Libraries

In [None]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
import spacy
from tqdm import tqdm
from sklearn.naive_bayes import MultinomialNB


## Importing Dataset

In [None]:
dataset = pd.read_csv("./datasets/ecommerce-text-classification/ecommerceDataset.csv", header=None)
dataset.columns = ['label', 'text']

In [None]:
dataset.head()

Unnamed: 0,label,text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [None]:
print(dataset.count())

label    50425
text     50424
dtype: int64


In [None]:
dataset['label_num'] = dataset.label.map({
  'Household':0,
  'Books' : 1,
  'Electronics':2,
  'Clothing & Accessories':3
})
dataset.head()

Unnamed: 0,label,text,label_num
0,Household,Paper Plane Design Framed Wall Hanging Motivat...,0
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",0
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...,0
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1...",0
4,Household,Incredible Gifts India Wooden Happy Birthday U...,0


## Data Preprocessing (Phase 1)

### Handling Missing Values

In [None]:
print("Number of Missing Values:\n")
print(dataset.isnull().sum())
missing_rows = dataset[dataset.isnull().any(axis=1)]
print(missing_rows)

Number of Missing Values:

label        0
text         1
label_num    0
dtype: int64
                        label text  label_num
39330  Clothing & Accessories  NaN          3


In [None]:
## dropping missing row as it is only one
dataset = dataset.drop(missing_rows.index)
print("Number of Missing Values:\n")
print(dataset.isnull().sum())

Number of Missing Values:

label        0
text         0
label_num    0
dtype: int64


### Data Balance Check

In [None]:
label_distribution = dataset['label'].value_counts()
print("Label Distribution:")
print(label_distribution)

Label Distribution:
label
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8670
Name: count, dtype: int64


### Text Preprocessing

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

In [None]:
def preprocess_with_spacy_pipe(df, column):
    processed_texts = []

    for doc in tqdm(nlp.pipe(df[column], batch_size=900), total=len(df), desc="Preprocessing"):
        tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
        processed_texts.append(" ".join(tokens))

    return processed_texts


In [None]:
dataset['preprocessed_text'] = preprocess_with_spacy_pipe(dataset, 'text')


Preprocessing: 100%|██████████| 50424/50424 [10:04<00:00, 83.35it/s]  


In [None]:
dataset.head()

Unnamed: 0,label,text,label_num,preprocessed_text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...,0,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",0,SAF Floral Framed Painting Wood 30 inch x 10 i...
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...,0,SAF UV Textured Modern Art Print Framed Painti...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1...",0,SAF Flower Print Framed Painting Synthetic 13....
4,Household,Incredible Gifts India Wooden Happy Birthday U...,0,Incredible Gifts India Wooden Happy Birthday U...


Using SMOTE before splitting the training and testing dataset can cause problems for test dataset. We don't want synthetic dataset in our test dataset. So, we will split the data first and then do main preprocessing.

## Train Test Split

In [None]:
X = dataset['preprocessed_text']
y = dataset['label_num']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2025, stratify=y
)
print("Before SMOTE:")
print("Train class distribution:", y_train.value_counts())

Before SMOTE:
Train class distribution: label_num
0    15450
1     9456
2     8497
3     6936
Name: count, dtype: int64


In [None]:
print(X_train.shape)

(40339,)


## Data Preprocssing

### Applying TF-IDF

In [None]:
# Convert Text Data to Numerical Data
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

### Applying SMOTE for Oversampling

In [None]:
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_tfidf, y_train)

In [None]:
print("After SMOTE:")
print("Train class distribution:", y_train_balanced.value_counts())


After SMOTE:
Train class distribution: label_num
3    15450
0    15450
2    15450
1    15450
Name: count, dtype: int64


## Training the Model

### K-Neighbors Classifier

In [None]:
knn_model = KNeighborsClassifier()

In [None]:
knn_model.fit(X_train_balanced, y_train_balanced)

### Naive Bayes Classifier

In [None]:
nb_model = MultinomialNB()
nb_model.fit(X_train_balanced, y_train_balanced)

## Testing the Model

### Naive Bayes Classifier

In [None]:
y_pred_nb = nb_model.predict(X_test_tfidf)

In [None]:
print(classification_report(y_test,y_pred_nb))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94      3863
           1       0.97      0.91      0.94      2364
           2       0.91      0.93      0.92      2124
           3       0.94      0.98      0.96      1734

    accuracy                           0.94     10085
   macro avg       0.94      0.94      0.94     10085
weighted avg       0.94      0.94      0.94     10085



### K-Neighbors Classifier

In [None]:
y_pred_knn = knn_model.predict(X_test_tfidf)

In [None]:
print(classification_report(y_test,y_pred_knn))

              precision    recall  f1-score   support

           0       0.99      0.72      0.83      3863
           1       0.70      0.98      0.82      2364
           2       0.84      0.97      0.90      2124
           3       0.99      0.87      0.92      1734

    accuracy                           0.86     10085
   macro avg       0.88      0.88      0.87     10085
weighted avg       0.89      0.86      0.86     10085

