# **`Ecommerce`** Text Classification

**`About Dataset`**

This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.

### import Libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
stem = WordNetLemmatizer()

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import *

### import datasets

In [2]:
df = pd.read_csv("Datasets/ecommerceDataset.csv", header=None, names=['labels', 'datapoints'])
df.head()

Unnamed: 0,labels,datapoints
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [3]:
df.shape

(50425, 2)

### data cleaning

In [4]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

### data Preprocess

In [5]:
corpus = []

for i in range(df.shape[0]):
    
    # regular expressions in the datapoints
    datapoint = re.sub(r'\W', " ", df['datapoints'][i]) # remove special char
    datapoint = re.sub(r'\s+[a-zA-Z]\s+', " ", datapoint) # remove all single char
    datapoint = re.sub(r'\^[a-zA-Z]\s+', " ", datapoint) # remove begning single char
    datapoint = re.sub(r'\s+', " ", datapoint, flags=re.I) # remove multi spaces

    datapoint = datapoint.lower() # convert Lower case

    datapoint = datapoint.split() # convert string to words element
    datapoint = [stem.lemmatize(word) for word in datapoint] # playing -> play, played -> play
    datapoint = " ".join(datapoint) # convert words element to string

    corpus.append(datapoint)

### split the data

In [6]:
x = corpus
y = df.labels

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

### Feature Extraction using **`CV / Tf-Idf`** model

In [8]:
cv = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words='english')

tfidf = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words='english')

In [9]:
cv_train = cv.fit_transform(x_train)
cv_test = cv.transform(x_test)

tfidf_train = tfidf.fit_transform(x_train)
tfidf_test = tfidf.transform(x_test)

### **`Naive Bayes Classifier`**

In [10]:
model = MultinomialNB()

In [11]:
model.fit(cv_train, y_train)
predict_cv = model.predict(cv_test)

In [12]:
accuracy_score(y_test, predict_cv)

0.913534952900347

In [13]:
model.fit(tfidf_train, y_train)
predict_tfidf = model.predict(tfidf_test)

In [14]:
accuracy_score(y_test, predict_tfidf)

0.9185919682697075

### Predict the Label

In [15]:
datapoints = "Inquilab: Bhagat Singh on Religion & Revolution About the Author S Irfan Habib is an Indian historian of science, a widely published author, and a public intellectual. He was the Abul Kalam Azad Chair at the National Institute of Educational Planning and Administration (NIEPA), New Delhi. Before joining NIEPA, he was a scientist at the National Institute of Science, Technology and Development Studies (NISTADS), New Delhi."

In [16]:
datapoint = re.sub(r'\W', " ", datapoint) # remove special char
datapoint = re.sub(r'\s+[a-zA-Z]\s+', " ", datapoint) # remove all single char
datapoint = re.sub(r'\^[a-zA-Z]\s+', " ", datapoint) # remove begning single char
datapoint = re.sub(r'\s+', " ", datapoint, flags=re.I) # remove multi spaces

datapoint = datapoint.lower() # convert Lower case

datapoint = datapoint.split() # convert string to words element
datapoint = [stem.lemmatize(word) for word in datapoint] # playing -> play, played -> play
datapoint = " ".join(datapoint) # convert words element to string

In [17]:
datapoint = tfidf.transform([datapoint])

In [18]:
predict = model.predict(datapoint)
predict

array(['Books'], dtype='<U22')