# README

The aim of this project is to categorize text descriptions into 4 distinct categories. The dataset required for this project can be accessed at https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification?select=ecommerceDataset.csv. Multiple text classification algorithms are employed in this project to demonstrate their varying levels of accuracy.

# Analyze data

In [27]:
# import modules
import pandas as pd

# sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# spacy
import spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
# import csv file
df = pd.read_csv('ecommerceDataset.csv', header =None)
df.rename(columns = {0 : 'Label', 1: 'Text'}, inplace=True)

In [3]:
# determine number of unique labels
df['Label'].unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [4]:
# visualize the first example
print(f"Label: {df['Label'][0]}")
print(f"Text: {df['Text'][0]}")

Label: Household
Text: Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and

In [5]:
# detect Null values
print(f"Number of Null values in the text is {df['Text'].isnull().sum()}")

# drop Null values
df.dropna(inplace=True)

# detect Null values
print(f"Number of Null values in the text is {df['Text'].isnull().sum()}")

Number of Null values in the text is 1
Number of Null values in the text is 0


# Bag of words with SVM

In [6]:
# label encode y. y contains 4 categories.
LE = LabelEncoder()
y = LE.fit_transform(df['Label'])
print(y.shape)
print(y[0])

(50424,)
3


In [None]:
from sklearn.preprocessing import OneHotEncoder

enc.fit(y)

In [7]:
%%time
# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english', max_df=0.5)
# vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['Text'])

# dimensions of X matrix - bag of words
print(X.shape)

# dimensions of feature names
names = vectorizer.get_feature_names_out()
print(len(names))

(50424, 78571)
78571
CPU times: total: 516 ms
Wall time: 2.71 s


In [8]:
# divide into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(40339, 78571)
(40339,)
(10085, 78571)
(10085,)


In [9]:
%%time
# apply SVM to text classification
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)

CPU times: total: 2min 21s
Wall time: 6min 5s


In [10]:
y_pred = clf.predict(X_test)

In [11]:
# classification report
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94      2327
           1       0.97      0.97      0.97      1702
           2       0.98      0.93      0.95      2119
           3       0.96      0.96      0.96      3937

    accuracy                           0.96     10085
   macro avg       0.96      0.96      0.96     10085
weighted avg       0.96      0.96      0.96     10085

Accuracy: 0.9565691621219633


# Bag of words + Lemmatization with SVM 

In [38]:
# create function to apply lemmatization and remove stop words
def lemmatize_stop_punct(text):
    
    string = ""
    for token in nlp(text):
        if not token.is_stop and not token.is_punct:
            string = string + token.lemma_ + " "

    return string.lower()

In [41]:
df['Text'][:10]

0    Paper Plane Design Framed Wall Hanging Motivat...
1    SAF 'Floral' Framed Painting (Wood, 30 inch x ...
2    SAF 'UV Textured Modern Art Print Framed' Pain...
3    SAF Flower Print Framed Painting (Synthetic, 1...
4    Incredible Gifts India Wooden Happy Birthday U...
5    Pitaara Box Romantic Venice Canvas Painting 6m...
6    Paper Plane Design Starry Night Vangoh Wall Ar...
7    Pitaara Box Romantic Venice Canvas Painting 6m...
8    SAF 'Ganesh Modern Art Print' Painting (Synthe...
9    Paintings Villa UV Textured Modern Art Print F...
Name: Text, dtype: object

In [46]:
%%time
# apply function
df['Text_lemma'] = df['Text'].apply(lemmatize_stop_punct)
df['Text_lemma']

CPU times: total: 406 ms
Wall time: 1.28 s


0     paper plane design framed wall hanging motivat...
1     saf floral framed painting wood 30 inch x 10 i...
2     saf uv texture modern art print framed paintin...
3     saf flower print framed painting synthetic 13....
4     incredible gifts india wooden happy birthday u...
                            ...                        
95    embroiderymaterial aari embroidery needles bea...
96    imported universal quilting embroidery presser...
97    wooden embroidery hoop frame crafters designer...
98    sewn golden colour seed bead embroidery superi...
99    icraft rn281 cotton embroidery thread set mult...
Name: Text_lemma, Length: 100, dtype: object

In [48]:
60000/100 * 1.2

720.0

In [None]:
df['Text_lemma'].to_csv('saved_data.csv')