<a href="https://colab.research.google.com/github/KelvinLam05/Marketing-Response/blob/main/product_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that products are assigned to the right product categories when listed by third parties.

In this project, we will use a Multinomial Naive Bayes model and apply Natural Language Processing (NLP) techniques to predict product categories from product names.

In [280]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [281]:
# Load dataset
df = pd.read_csv('/content/product_comparison_platform.csv', names = ['product_id', 'product_title', 'vendor_id', 'cluster_id', 'cluster_label', 'category_id', 'category_label'])

In [282]:
# Examine the data
df.head()

Unnamed: 0,product_id,product_title,vendor_id,cluster_id,cluster_label,category_id,category_label
0,1,apple iphone 8 plus 64gb silver,1,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones
1,2,apple iphone 8 plus 64 gb spacegrau,2,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones
3,4,apple iphone 8 plus 64gb space grey,4,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones


In [283]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   product_id      35311 non-null  int64 
 1   product_title   35311 non-null  object
 2   vendor_id       35311 non-null  int64 
 3   cluster_id      35311 non-null  int64 
 4   cluster_label   35311 non-null  object
 5   category_id     35311 non-null  int64 
 6   category_label  35311 non-null  object
dtypes: int64(4), object(3)
memory usage: 1.9+ MB


Printing the value_counts( ) of the category_label column we are trying to predict shows that we have 10 different classes present, each of which has 2212 to 5501 product titles in the dataset. This is good because we have plenty of data from which to make our predictions.

In [284]:
df['category_label'].value_counts()

Fridge Freezers     5501
Mobile Phones       4081
Washing Machines    4044
CPUs                3862
Fridges             3584
TVs                 3564
Dishwashers         3424
Digital Cameras     2697
Microwaves          2342
Freezers            2212
Name: category_label, dtype: int64

**Split the train and test data**

In [285]:
X = df['product_title']

In [286]:
y = df['category_label']

In [287]:
from sklearn.model_selection import train_test_split

For the train-test split we use stratify to keep the ratio of category labels.

In [288]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

**Fit the model**

Now the data have been prepared, we will fit the multinomial Naive Bayes model. 

In [289]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [290]:
pipeline = Pipeline(steps = [('vect', CountVectorizer()),
                             ('clf', MultinomialNB())])

In [None]:
pipeline.fit(X_train, y_train)

**Assess performance**

Once the model has been fitted to the data, we can make our predictions on the X_test dataset and then calculate the F1 score. This gives us a decent score on the test data, with an F1 score of 0.951.

In [292]:
y_pred = pipeline.predict(X_test)

In [293]:
from sklearn.metrics import f1_score

In [294]:
f1_score(y_test, y_pred, average = 'macro')

0.9514703704135764

**Examine the predictions**

To check how well the model did in a bit more detail we can examine the precision, recall, and F1 score for each of the classes using a classification report.

In [295]:
from sklearn.metrics import classification_report

In [296]:
print(classification_report(y_test, y_pred))

                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00       773
 Digital Cameras       0.99      0.99      0.99       540
     Dishwashers       0.96      0.97      0.97       685
        Freezers       0.99      0.72      0.84       442
 Fridge Freezers       0.86      0.96      0.91      1100
         Fridges       0.90      0.87      0.88       717
      Microwaves       0.99      0.97      0.98       468
   Mobile Phones       0.99      0.99      0.99       816
             TVs       0.99      0.98      0.99       713
Washing Machines       0.96      0.98      0.97       809

        accuracy                           0.95      7063
       macro avg       0.96      0.94      0.95      7063
    weighted avg       0.96      0.95      0.95      7063

