# **Machine Learning Models for Product Categorization**

In this notebook, several famous Machine Learning models are trained on the unbalanced dataset, balanced dataset using oversampling and balanced dataset using undersampling in order to check which gives the best accuracy. These models are then then evaluated with the help of Classification Report, Confusion Matrix, ROC Curves, Accuracy Score, etc.

The ML Models are trained for all the 3 datasets in the following order:


1.   Imbalanced Dataset
2.   Balanced Dataset (using Oversampling)
3.   Balanced Dataset (using Undersampling)



### ***Machine Learning Models Used:***
* Logistic Regression (both Binary and Multiclass Variants)
* Multinomial Naive Bayes
* Linear Support Vector Machine
* Decision Trees Classifier
* Random Forest Classifier
* K Nearest Neighbours







## **Importing the required libraries**

In [None]:
#importing the libraries for matrix and dataframe handling, plotting, etc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# importing the other miscellaneous libraries used
import re
import warnings
warnings.filterwarnings("ignore")

# importing the NLTK related libraries and functions along with evaluation metircs
import nltk
import string
nltk.download("all")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive




## PART 1: Loading and Preparing the Imbalanced Dataset

In this section, we will load the dataset, which was saved as a CSV file from a previous notebook. This dataset includes both noise and products belonging to the 13 primary categories. The Product Descriptions in this dataset have already undergone extensive preprocessing in the previous notebook, including steps such as:

1. **Lowercasing**: Converting all text to lowercase to ensure uniformity.
2. **Stopword Removal**: Eliminating common words that do not contribute significant meaning (e.g., 'the', 'and', 'in').
3. **Tokenization**: Breaking down the text into individual words or tokens.
4. **Lemmatization**: Reducing words to their base or root form (e.g., 'running' to 'run').

After loading the dataset into a Pandas DataFrame, we will proceed to remove any noise present in the data. This will involve filtering out irrelevant or extraneous information to ensure that only the products belonging to the 13 primary categories are retained. The following steps outline this process in detail.

In [None]:
unbalanced_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/un_balanced_products.csv')
unbalanced_df.head(15)

Unnamed: 0.1,Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,...,description,product_rating,overall_rating,brand,product_specifications,primary_categories,main_category,desc_pol,desc_len,cleaned_desc
0,0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.14375,410.0,key features of alisha solid women's cycling s...
1,1,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,Footwear,SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",...,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""...",footwear,1,0.027778,650.0,key features of aw bellies sandals wedges heel...
2,2,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.122917,403.0,key features of alisha solid women's cycling s...
3,3,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.127778,416.0,key features of alisha solid women's cycling s...
4,4,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals",Footwear,SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",...,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",...",footwear,1,-0.032143,428.0,"key features of dilli bazaaar bellies, corpora..."
5,5,4044c0ac52c1ee4b28777417651faf42,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FVUHAAVH9X,1199.0,479.0,"[""http://img5a.flixcart.com/image/short/5/z/c/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.098333,419.0,key features of alisha solid women's cycling s...
6,6,e54bc0a7c3429da2ebef0b30331fe3d2,2016-03-25 22:59:23 +0000,http://www.flipkart.com/ladela-bellies/p/itmeh...,Ladela Bellies,Footwear,SHOEH4KM2W3Z6EH5,1724.0,950.0,"[""http://img5a.flixcart.com/image/shoe/s/g/m/b...",...,Key Features of Ladela Bellies Brand: LADELA C...,5,5,Ladela,"{""product_specification""=>[{""key""=>""Occasion"",...",footwear,1,0.215,358.0,key features of ladela bellies brand: ladela c...
7,7,c73e78fb440ff8972e0762daed4fc109,2016-03-25 22:59:23 +0000,http://www.flipkart.com/carrel-printed-women-s...,Carrel Printed Women's,Clothing,SWIEHF3EF5PZAZUY,2299.0,910.0,"[""http://img6a.flixcart.com/image/swimsuit/5/v...",...,Key Features of Carrel Printed Women's Fabric:...,No rating available,No rating available,Carrel,"{""product_specification""=>[{""key""=>""Neck"", ""va...",clothing,1,0.339474,1182.0,key features of carrel printed women's fabric:...
8,8,9aacdecceb404c74abddc513fd2756a8,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FGBDJGX8FW,999.0,379.0,"[""http://img6a.flixcart.com/image/short/q/z/v/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.109259,414.0,key features of alisha solid women's cycling s...
9,9,83c53f8948f508f51d2249b489ca8e7d,2016-03-25 22:59:23 +0000,http://www.flipkart.com/freelance-vacuum-bottl...,Freelance Vacuum Bottles 350 ml Bottle,Pens & Stationery,BOTEGYTZ2T6WUJMM,699.0,699.0,"[""http://img5a.flixcart.com/image/bottle/j/m/m...",...,Specifications of Freelance Vacuum Bottles 350...,No rating available,No rating available,Freelance,"{""product_specification""=>[{""key""=>""Body Mater...",toys&schoolsupplies,1,0.010417,216.0,specifications of freelance vacuum bottles 350...


# Product Classes Encoding

To plot the ROC curves and calculate the AUC score, it was necessary to have an appropriate encoding for the 13 primary categories in both directions. Therefore, two dictionaries were created to establish this mapping.

In [None]:
### Helper Dictionaries for Manipulating Testing Output Before Plotting ROC Curves

##Helper dictionaries were created to transform the testing output into the appropriate format needed for plotting the ROC curves.

category_mapping = {  0	: "homefurnishing/kitchen",
                      1	: "clothing",
                      2	: "jewellery",
                      3	: "personalaccessories",
                      4	: "electronics",
                      5	: "footwear",
                      6	: "automotive",
                      7	: "toys&schoolsupplies",
                      8	: "tools&hardware",
                      9	: "babycare",
                      10 : "sports&fitness",
                      11 : "petsupplies",
                      12 : "ebooks"	}

reverse_category_mapping = {"homefurnishing/kitchen":0,
                            "clothing":1,
                            "jewellery":2,
                            "personalaccessories":3,
                            "electronics":4,
                            "footwear":5,
                            "automotive":6,
                            "toys&schoolsupplies":7,
                            "tools&hardware":8,
                            "babycare":9,
                            "sports&fitness":10,
                            "petsupplies":11,
                            "ebooks":12}


**Removing the noise from the dataset**



In [None]:
##Noise can introduce irrelevant or misleading information, which can negatively impact the accuracy and robustness of the model. By removing noise,
#the model can focus on the most relevant features and patterns, leading to better performance.

In [None]:
#dropping the noise in the dataset by considering only the above mentioned categories

print(unbalanced_df.shape)
unbalanced_df = unbalanced_df[unbalanced_df["main_category"]==1]
print(unbalanced_df.shape)

(14999, 21)
(14999, 21)


## **Training and Evaluating Multiple Machine Learning Models**


Training and Evaluating Multiple Machine Learning Models
For each of the six specified machine learning algorithms, we have created dedicated functions. These functions perform the following steps:

Dataset Splitting: Randomly split the dataset into training and testing sets using the train_test_split function.

Bag of Words Model: Create a Bag of Words representation for the training dataset to convert the text data into numerical form interpretable by the model.

TF-IDF Vectorization: Convert the Bag of Words model into corresponding TF-IDF vectors, as these tend to improve performance in machine learning models.

Model Training: Fit the training dataset to the respective machine learning model.

Prediction: Predict the output for the testing dataset.

Accuracy Calculation: Calculate the accuracy of the predicted categories compared to the actual categories in the testing dataset.

Evaluation Metrics: Print evaluation metrics such as the Confusion Matrix and Classification Report to obtain an in-depth understanding of the model's accuracy and performance.



## **1) Logistic Regression (Binary Classification Method)**

In [None]:
def logistic_regression(x,y):
  from sklearn.linear_model import LogisticRegression

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the training dataset to the model
  lr_model = LogisticRegression()
  lr_model.fit(x_train,y_train)
  lr_predict = lr_model.predict(x_test)
  lr_pred_prob = lr_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, lr_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test,lr_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test,lr_predict))

  return y_test, lr_predict, lr_pred_prob

## **2) Logistic Regression (Multiclass Classification Method)**

This multiclass classification model has been trained to plot ROC curves and calculate AUC scores, giving us a clear understanding of the model's accuracy and effectiveness.

In [None]:
def logistic_regression_multiclass(x,y):
  from sklearn.linear_model import LogisticRegression

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  reg = 0.1

  #fitting the training dataset to the multiclass classification Logistic Regression model
  lr_model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_iter=1000).fit(x_train,y_train)
  lr_predict = lr_model.predict(x_test)
  lr_pred_prob = lr_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, lr_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test,lr_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test,lr_predict))

  return y_test, lr_predict, lr_pred_prob

## **3) Multinomial Naive Bayes Classifier**

In [None]:
def naive_bayes(x,y):
  from sklearn.naive_bayes import MultinomialNB

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the tarining dataset to the model
  nb_model = MultinomialNB()
  nb_model.fit(x_train,y_train)
  nb_predict = nb_model.predict(x_test)
  nb_pred_prob = nb_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, nb_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test,nb_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test,nb_predict))

  return y_test, nb_predict, nb_pred_prob

## **4) Linear Support Vector Machine**

In [None]:
def linear_svm(x,y):
  from sklearn.svm import LinearSVC

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the training dataset to the model
  svc_model = LinearSVC(random_state=42, max_iter=2000)
  svc_model.fit(x_train,y_train)
  svc_predict = svc_model.predict(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, svc_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test,svc_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test,svc_predict))

  return y_test, svc_predict

## **5) Decision Trees Classifier**

In [None]:
def decision_trees(x,y):
  from sklearn.tree import DecisionTreeClassifier

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the training dataset to the model
  dtree_model = DecisionTreeClassifier(max_depth = 30)
  dtree_model.fit(x_train,y_train)
  dtree_predict = dtree_model.predict(x_test)
  dtree_pred_prob = dtree_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, dtree_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test,dtree_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test,dtree_predict))

  return y_test, dtree_predict, dtree_pred_prob

## **6) Random Forest Classifier**

In [None]:
def random_forest(x,y):
  from sklearn.ensemble import RandomForestClassifier

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the training dataset to the model
  rfc_model = RandomForestClassifier(random_state=42, max_depth = 30)
  rfc_model.fit(x_train,y_train)
  rfc_predict = rfc_model.predict(x_test)
  rfc_pred_prob = rfc_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, rfc_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test, rfc_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test, rfc_predict))

  return y_test, rfc_predict, rfc_pred_prob

## **7) K Nearest Neighbours**

In [None]:
def k_nearest_neighbours(x,y):
  from sklearn.neighbors import KNeighborsClassifier

  #splitting the dataset into training and test parts
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

  #bag of words implmentation
  cv = CountVectorizer()
  x_train = cv.fit_transform(x_train).toarray()

  #TF-IDF implementation
  vector = TfidfTransformer()
  x_train = vector.fit_transform(x_train).toarray()
  x_test = cv.transform(x_test)

  #fitting the training dataset to the model
  knn_model = KNeighborsClassifier(algorithm='brute')
  knn_model.fit(x_train,y_train)
  knn_predict = knn_model.predict(x_test)
  knn_pred_prob = knn_model.predict_proba(x_test)

  #evaluation metrics for the dataset
  print("Validation Accuracy: ",accuracy_score(y_test, knn_predict))

  print("\n")
  print("*********** CONFUSION MATRIX **************")
  print(confusion_matrix(y_test, knn_predict))

  print("\n")
  print("*********** CLASSIFICATION REPORT **************")
  print(classification_report(y_test, knn_predict))

  return y_test, knn_predict, knn_pred_prob

## **Plotting the ROC Curves for Multiclass Logistic Regression**

The ROC Curves are plotted and the corresponding AUC score for each of the categories has been plotted to get an idea about the accuracy of the model. Later, an aggregate AUC Score is also calculated which an average for all the categories' One VS Rest ROC Curves.

In [None]:
def plot_roc_curve(y_test, y_pred, no_categories = 13, lw=2):

  #calculating the ROC curve and area for each class
  false_positive_rate = dict()
  true_positive_rate = dict()

  for i in range(no_categories):
    false_positive_rate[i], true_positive_rate[i], _ = roc_curve(y_test[:,i], y_pred[:, i])

  #Compute micro-average ROC curve and area under the curve
  false_positive_rate["micro"], true_positive_rate["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())

  #plotting the ROC Curves for each of the 13 main categories in our model
  for category in range(no_categories):
    plt.figure()
    plt.plot(false_positive_rate[category],
             true_positive_rate[category],
             color='deeppink',
             lw=lw)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Primary Category: ' + category_mapping[category])
  plt.show()

## **Evaluation of the ML Model on unbalanced testing dataset**

The above mentioned functions are called and for each of the 6 algorithms, evaluation metrics are printed. This helps us in understanding which ML Model performs the best.

In [None]:
# the cleaned product description corresponds to the x value
x = unbalanced_df['cleaned_desc']
# the 13 labels/ product categories mentioned above correspond to the y value
y = unbalanced_df['primary_categories']

# **1) Evaluation using the Logistic Regression Model**

In [None]:
# the cleaned product description corresponds to the x value
x = unbalanced_df['cleaned_desc']

# Handle missing values in 'cleaned_desc'
x.fillna('', inplace=True)  # Replace missing values with empty strings

# the 13 labels/ product categories mentioned above correspond to the y value
y = unbalanced_df['primary_categories']

print("********** LOGISTIC REGRESSION **********")
y_test, lr_predict, lr_pred_prob = logistic_regression(x,y)

********** LOGISTIC REGRESSION **********
Validation Accuracy:  0.973


*********** CONFUSION MATRIX **************
[[ 184    0    0    6    0    0    0    0    0    0]
 [   1   28    3    0    1    4    0    0    0    1]
 [   0    6 1058    0    0    5    1    0    0    0]
 [   2    0    0  252    4    0    0    0    0    0]
 [   0    0    1    0  212    1    0    0    0    0]
 [   0    2    0    0    0  254    0    0    0    2]
 [   0    0    0    1    4    1  639    0    0    0]
 [   0    0    2    0    7    1    0  137    0    0]
 [   0    0    0    0    0    3    0    0   65    2]
 [   2    0    2    1    8    5    0    1    1   90]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.97      0.97      0.97       190
              babycare       0.78      0.74      0.76        38
              clothing       0.99      0.99      0.99      1070
           electronics       0.97      0.98  

**2) Evaluation using the Multiclass Variant of Logistic Regression**

In [None]:
print("********** MULTICLASS LOGISTIC REGRESSION **********")
y_test, lr_predict, lr_pred_prob = logistic_regression_multiclass(x,y)

********** MULTICLASS LOGISTIC REGRESSION **********
Validation Accuracy:  0.983


*********** CONFUSION MATRIX **************
[[ 187    0    0    3    0    0    0    0    0    0]
 [   2   29    3    0    0    3    0    0    0    1]
 [   2    7 1058    0    0    2    1    0    0    0]
 [   2    0    0  256    0    0    0    0    0    0]
 [   0    0    1    0  212    0    0    1    0    0]
 [   0    2    0    0    0  256    0    0    0    0]
 [   0    0    0    2    2    1  640    0    0    0]
 [   0    0    0    0    0    1    0  146    0    0]
 [   0    0    0    0    0    3    0    0   66    1]
 [   1    1    0    0    3    5    0    1    0   99]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.96      0.98      0.97       190
              babycare       0.74      0.76      0.75        38
              clothing       1.00      0.99      0.99      1070
           electronics       0.98 

### **Plotting the ROC Curves for each of the 13 categories and calculating the AUC score for them**

In [None]:
print("TESTING LABELS: {}".format(y_test.shape))
print("PREDICTED LABELS: {}".format(lr_predict.shape))
print("PROBABILITY OF THE PREDICTED LABELS: {}".format(lr_pred_prob.shape))

TESTING LABELS: (3000,)
PREDICTED LABELS: (3000,)
PROBABILITY OF THE PREDICTED LABELS: (3000, 10)


In [None]:
#converting the Test Classes (y_test) from Pandas Series object to Numpy array
y_test = y_test.to_numpy()
length = y_test.shape

#conerting the string classes into the respective numbers based on their mapping as described previously
for i in range(length[0]):
  y_test[i] = reverse_category_mapping[y_test[i]]
  lr_predict[i] = reverse_category_mapping[lr_predict[i]]

print("The last 10 actual labels: {}".format(y_test[:10]))
print("The last 10 predicted labels: {}".format(lr_predict[:10]))

The last 10 actual labels: [4 7 2 1 2 3 1 5 2 1]
The last 10 predicted labels: [4 7 2 1 2 3 1 5 2 1]


In [None]:
#converting the actual test labels into a binary 2d numpy array according to their classes

n_classes = 13
temp_array = [[0 for i in range(n_classes)] for i in range(length[0])]

j=0
for i in y_test:
  temp_array[j][i] = 1
  j+=1

#converting the temporary array into a numpy array
y_test = np.array(temp_array)
print(y_test)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
def plot_roc_curve(y_test, y_pred, lw=2):

    no_categories = y_test.shape[1] # Get the number of columns from y_test

    false_positive_rate = dict()
    true_positive_rate = dict()
    roc_auc = dict()

    for i in range(no_categories):
        false_positive_rate[i], true_positive_rate[i], _ = roc_curve(y_test[:,i], y_pred[:, i])

    #Compute micro-average ROC curve and area under the curve
    false_positive_rate["micro"], true_positive_rate["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
    roc_auc["micro"] = auc(false_positive_rate["micro"], true_positive_rate["micro"])

    #Plot of a ROC curve for a specific class
    plt.figure()
    plt.plot(false_positive_rate[2], true_positive_rate[2], label='ROC curve (area = %0.2f)' % roc_auc[2])
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver')

In [None]:
# Check the shapes of y_test and lr_pred_prob
print("Shape of y_test:", y_test.shape)
print("Shape of lr_pred_prob:", lr_pred_prob.shape)

# Modify the plot_roc_curve function to handle potential shape mismatches
def plot_roc_curve(y_test, y_pred, lw=2):
    """
    Plots the ROC curve for multi-class classification.

    Args:
        y_test: True labels.
        y_pred: Predicted probabilities.
        lw: Line width for the plot.
    """

    # Get the number of categories from the array with fewer columns
    no_categories = min(y_test.shape[1], y_pred.shape[1])

    # Initialize variables
    false_positive_rate = dict()
    true_positive_rate = dict()

    # Calculate ROC curve for each category
    for i in range(no_categories):
        false_positive_rate[i], true_positive_rate[i], _ = roc_curve(y_test[:,i], y_pred[:, i])

    # ... rest of the plot_roc_curve function ...

# Call the modified plot_roc_curve function
plot_roc_curve(y_test, lr_pred_prob)

Shape of y_test: (3000, 13)
Shape of lr_pred_prob: (3000, 10)


**Average Area Under the Curve**

The aggregate area under the curve score that is averaged across all of the One VS Rest curves is calculated below.

In [None]:
auc = roc_auc_score(y_test, lr_pred_prob, multi_class='ovr')
print('Average AUC score for all the categories is {}'.format(auc))

Average AUC score for all the categories is 0.5347936541610551


**3) Multinomial Naive Bayes Classifier**

In [None]:
print("********* NAIVE BAYES CLASSIFIER *********")
y_test, nb_predict, nb_pred_prob = naive_bayes(x,y)

********* NAIVE BAYES CLASSIFIER *********
Validation Accuracy:  0.856


*********** CONFUSION MATRIX **************
[[ 178    0    0    3    0    0    9    0    0    0]
 [   1    0   27    0    0    1    9    0    0    0]
 [   0    0 1064    0    0    0    6    0    0    0]
 [   0    0    1  200    0    0   57    0    0    0]
 [   0    0   37    0   83    0   94    0    0    0]
 [   0    0    8    1    0  242    7    0    0    0]
 [   0    0   11    0    0    2  632    0    0    0]
 [   0    0   26    0    0    1   11  109    0    0]
 [   0    0   12    1    0    5    6    0   46    0]
 [   1    0   15    2    0    8   70    0    0   14]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.94      0.96       190
              babycare       0.00      0.00      0.00        38
              clothing       0.89      0.99      0.94      1070
           electronics       0.97      0.78 

4) **Linear Support Vector Machine**

In [None]:
print("********* LINEAR SVM *********")
y_test, svm_predict = linear_svm(x,y)

********* LINEAR SVM *********
Validation Accuracy:  0.9866666666666667


*********** CONFUSION MATRIX **************
[[ 189    0    0    1    0    0    0    0    0    0]
 [   1   29    3    0    0    4    0    0    0    1]
 [   0    4 1064    0    0    1    1    0    0    0]
 [   2    0    0  256    0    0    0    0    0    0]
 [   0    1    0    1  212    0    0    0    0    0]
 [   0    2    0    0    0  256    0    0    0    0]
 [   0    0    0    2    0    1  642    0    0    0]
 [   0    0    0    0    0    1    0  146    0    0]
 [   0    0    0    0    0    3    0    0   67    0]
 [   1    0    0    1    0    8    0    1    0   99]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.98      0.99      0.99       190
              babycare       0.81      0.76      0.78        38
              clothing       1.00      0.99      1.00      1070
           electronics       0.98      0.99

**5) Decision Trees Classifier**

In [None]:
print("********** DECISION TREES CLASSIFIER *************")
y_test, dtree_predict, dtree_pred_prob = decision_trees(x,y)

********** DECISION TREES CLASSIFIER *************
Validation Accuracy:  0.777


*********** CONFUSION MATRIX **************
[[168   0   1   1   0   5   0   2   2  11]
 [  0  25   9   0   1   1   0   0   1   1]
 [  2   8 891   0   9   2   2 155   0   1]
 [  4   0  17 154  56   5   0   3   0  19]
 [  1   0   7   0 201   3   2   0   0   0]
 [  0   1 143   0   4  73   9   4   0  24]
 [  0   3   7   0  54  21 547   0   0  13]
 [  0   0   4   1   8   3   0 130   0   1]
 [  0   0   5   0   1   1   0   0  61   2]
 [  0   3  15   0   4   4   0   1   2  81]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.96      0.88      0.92       190
              babycare       0.62      0.66      0.64        38
              clothing       0.81      0.83      0.82      1070
           electronics       0.99      0.60      0.74       258
              footwear       0.59      0.94      0.73       214
homefurn

**6) Random Forest Classifier**

In [None]:
print("********** RANDOM FOREST CLASSIFIER *************")
y_test, rfc_predict, rfc_pred_prob = random_forest(x,y)

********** RANDOM FOREST CLASSIFIER *************
Validation Accuracy:  0.7936666666666666


*********** CONFUSION MATRIX **************
[[ 181    0    5    1    0    3    0    0    0    0]
 [   0   11   21    0    0    6    0    0    0    0]
 [   0    0 1067    0    0    2    1    0    0    0]
 [   1    0    4  185   58   10    0    0    0    0]
 [   0    0   26    0  188    0    0    0    0    0]
 [   0    1   11    0    0  246    0    0    0    0]
 [   0    0    4    0  350    0  291    0    0    0]
 [   0    0   11    0    8    2    0  126    0    0]
 [   0    0    8    0    0    3    1    0   58    0]
 [   0    0   11    1   55   15    0    0    0   28]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.95      0.97       190
              babycare       0.92      0.29      0.44        38
              clothing       0.91      1.00      0.95      1070
           electronics  

**7) K Nearest Neighbours**

In [None]:
print("********** K NEAREST NEIGHBOURS *************")
y_test, knn_predict, knn_pred_prob = k_nearest_neighbours(x,y)

********** K NEAREST NEIGHBOURS *************
Validation Accuracy:  0.9433333333333334


*********** CONFUSION MATRIX **************
[[ 183    0    0    3    1    3    0    0    0    0]
 [   0   23    7    3    1    2    2    0    0    0]
 [   0    4 1060    2    1    1    2    0    0    0]
 [   1    0    0  215   42    0    0    0    0    0]
 [   0    0    6    0  206    1    1    0    0    0]
 [   0    1    7    2    3  240    3    0    1    1]
 [   0    0    5    0   14    1  624    1    0    0]
 [   0    0    2    1   10    0    2  132    0    0]
 [   0    2    1    1    0    2    1    0   63    0]
 [   1    1    2    2   12    6    2    0    0   84]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.96      0.98       190
              babycare       0.74      0.61      0.67        38
              clothing       0.97      0.99      0.98      1070
           electronics      

## **PART 2) Reading the Balanced Dataset created using the Oversampling & Resampling Technique**

The dataset that was balanced using the oversampling technique (implemented uusing resampling) that was saved from the earlier [Notebook](https://colab.research.google.com/drive/1Ht6pbVFlkudK7PzrDPmepiytHxPyhVBe?usp=sharing) in the form of a csv file is loaded below in the form of a Pandas dataframe. This dataset consists consists of only those products which belong to the 13 primary categories (no noise). The Product Description in this dataset has also already been cleaned in the previous notebook (lowercasing, stopword removal, tokenization,lemmatization, etc).

In [None]:
oversampled_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/oversampling_balanced_products.csv")
oversampled_df

Unnamed: 0.1,Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,...,description,product_rating,overall_rating,brand,product_specifications,primary_categories,main_category,desc_pol,desc_len,cleaned_desc
0,14890,44fc0504c377109b5b137f540a8cf826,2015-12-20 08:26:17 +0000,http://www.flipkart.com/kenway-retail-metal-co...,"Kenway Retail Metal, Cotton Dori Silver Bracel...",Jewellery,BBAE9HP7CZME8UKA,1080.0,250.0,"[""http://img6a.flixcart.com/image/bangle-brace...",...,"Kenway Retail Metal, Cotton Dori Silver Bracel...",No rating available,No rating available,Kenway Retail,"{""product_specification""=>[{""key""=>""Stretchabl...",jewellery,1,0.062500,696.0,"kenway retail metal, cotton dori silver bracel..."
1,4859,deb35354e9d126c778d3243aafa7c7f7,2015-12-01 12:40:44 +0000,http://www.flipkart.com/willmore-bone-choker/p...,Willmore Bone Choker,Jewellery,NKCEAVZ3QEWNHQNR,399.0,250.0,"[""http://img5a.flixcart.com/image/necklace-cha...",...,Willmore Bone Choker - Buy Willmore Bone Choke...,No rating available,No rating available,Willmore,"{""product_specification""=>[{""key""=>""Collection...",jewellery,1,0.225000,169.0,willmore bone choker - buy willmore bone choke...
2,5788,a70f82d9f70603d9b4ebf4f450f496c9,2015-12-01 06:13:00 +0000,http://www.flipkart.com/galz4ever-multi-seed-b...,Galz4ever Multi Seed Bead Alloy Necklace,Jewellery,NKCEAGXHHUXGZKAV,249.0,221.0,"[""http://img5a.flixcart.com/image/necklace-cha...",...,Galz4ever Multi Seed Bead Alloy Necklace - Buy...,No rating available,No rating available,Galz4ever,"{""product_specification""=>[{""key""=>""Collection...",jewellery,1,0.225000,209.0,galz4ever multi seed bead alloy necklace - buy...
3,5517,d3621deb6d1cf32864b2a185628db9a8,2015-12-01 06:13:00 +0000,http://www.flipkart.com/dressberry-metal-neckl...,DressBerry Metal Necklace,Jewellery,NKCE9YX2HGNZSZRV,500.0,225.0,"[""http://img6a.flixcart.com/image/necklace-cha...",...,DressBerry Metal Necklace - Buy DressBerry Met...,No rating available,No rating available,DressBerry,"{""product_specification""=>[{""key""=>""Brand"", ""v...",jewellery,1,0.225000,179.0,dressberry metal necklace - buy dressberry met...
4,5457,18258fe5d0b1bb850cd477e784c11956,2015-12-01 06:13:00 +0000,http://www.flipkart.com/galz4ever-red-black-lo...,Galz4ever Red & Black Long Resin Alloy Necklace,Jewellery,NKCEAGXHGFZZZKBK,249.0,229.0,"[""http://img5a.flixcart.com/image/necklace-cha...",...,Galz4ever Red & Black Long Resin Alloy Necklac...,No rating available,No rating available,Galz4ever,"{""product_specification""=>[{""key""=>""Collection...",jewellery,1,0.046667,223.0,galz4ever red & black long resin alloy necklac...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,7966,725af0527b8b2bcccd020bd7a4ccb5f3,2016-01-07 05:50:25 +0000,http://www.flipkart.com/lamex-3326-s-black-ana...,Lamex 3326 S Black And Black Analog Watch - F...,Watches,WATE3NWH5FZWAMDA,725.0,725.0,"[""http://img6a.flixcart.com/image/watch/m/d/a/...",...,Lamex 3326 S Black And Black Analog Watch - F...,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Chronograp...",personalaccessories,1,0.063636,292.0,lamex 3326 s black and black analog watch - f...
29996,12145,de39565c5a59238698a60162052809bc,2016-06-03 11:49:18 +0000,http://www.flipkart.com/edel-shoulder-bag/p/it...,Edel Shoulder Bag,"Bags, Wallets & Belts",HMBEJAFGZEDFYZ5V,3499.0,999.0,"[""http://img5a.flixcart.com/image/hand-messeng...",...,Key Features of Edel Shoulder Bag Ideal For: G...,No rating available,No rating available,Edel,"{""product_specification""=>[{""key""=>""Closure"", ...",personalaccessories,1,0.364815,674.0,key features of edel shoulder bag ideal for: g...
29997,252,b707fb4e10ebf63c88be8c6a8e50d038,2015-12-04 07:25:36 +0000,http://www.flipkart.com/zoop-c3030pp05-analog-...,"Zoop C3030PP05 Analog Watch - For Boys, Girls",Watches,WATDZDSVWQRRGERC,650.0,650.0,"[""http://img5a.flixcart.com/image/watch/e/r/c/...",...,"Zoop C3030PP05 Analog Watch - For Boys, Girls...",4.8,4.8,,"{""product_specification""=>[{""key""=>""Diameter"",...",personalaccessories,1,0.242857,339.0,"zoop c3030pp05 analog watch - for boys, girls..."
29998,7763,11869c9a7b5d62385364c5739b711f4d,2016-01-07 05:50:25 +0000,http://www.flipkart.com/lamex-4250-rg-white-an...,Lamex 4250 Rg White And White Analog Watch - ...,Watches,WATE3NWHWHGSECNQ,849.0,849.0,"[""http://img5a.flixcart.com/image/watch/c/n/q/...",...,Lamex 4250 Rg White And White Analog Watch - ...,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Chronograp...",personalaccessories,1,0.154545,295.0,lamex 4250 rg white and white analog watch - ...


## **Evaluation of the ML Model on Balanced (Oversampling) testing dataset**

The 6 ML algorithms are then trained on the oversampled balanced dataset and evaluated on the testing dataset. Evaluation metrics are printed in order to ease the process of comparison to find which dataset and model works the best.

In [None]:
x = oversampled_df['cleaned_desc']
y = oversampled_df['primary_categories']

**1) Evaluation using the Logistic Regression Model**

In [None]:
# Assuming 'x' from ipython-input-53-a1ef3cfaf5d8 contains the data you want to preprocess
x = x.fillna('')  # Handle missing values in the 'x' variable

#bag of words implmentation
cv = CountVectorizer()
x_train = cv.fit_transform(x).toarray() # Now define x_train using the preprocessed 'x'

In [None]:
# Handle missing values by filling them with empty strings before converting to a NumPy array
x = x.fillna('')

#bag of words implmentation
cv = CountVectorizer()
x_train = cv.fit_transform(x).toarray()

In [None]:
print("********** LOGISTIC REGRESSION **********")
y_test, lr_predict, lr_pred_prob = logistic_regression(x,y)

********** LOGISTIC REGRESSION **********
Validation Accuracy:  0.9838333333333333


*********** CONFUSION MATRIX **************
[[616   0   0   2   0   0   0   0   0   0]
 [  0 568   7   0   0   0   0   0   0   1]
 [  0  22 569   1   0   2   0   1   0   0]
 [  7   0   0 597   0   0   0   2   0   3]
 [  0   0   1   0 624   0   0   0   0   0]
 [  0   4   0   1   2 621   2   1   3   1]
 [  0   2   0   4   2   5 553   1   0   0]
 [  0   0   0   0   0   0   0 571   0   1]
 [  0   0   0   0   0   1   0   0 590   0]
 [  1   0   0   0   4   6   0   7   0 594]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      1.00      0.99       618
              babycare       0.95      0.99      0.97       576
              clothing       0.99      0.96      0.97       595
           electronics       0.99      0.98      0.98       609
              footwear       0.99      1.00      0.99       625
home

**2) Evaluation using the Multiclass Variant of Logistic Regression Model**

In [None]:
print("********** LOGISTIC REGRESSION MULTICLASS **********")
y_test, lr_predict, lr_pred_prob = logistic_regression_multiclass(x,y)

********** LOGISTIC REGRESSION MULTICLASS **********
Validation Accuracy:  0.9935


*********** CONFUSION MATRIX **************
[[617   0   0   1   0   0   0   0   0   0]
 [  0 576   0   0   0   0   0   0   0   0]
 [  1  17 575   0   0   0   0   2   0   0]
 [  2   0   0 605   0   0   0   2   0   0]
 [  0   0   1   0 624   0   0   0   0   0]
 [  0   1   0   1   2 630   0   1   0   0]
 [  0   0   0   6   0   0 560   1   0   0]
 [  0   0   0   0   0   0   0 572   0   0]
 [  0   0   0   0   0   0   0   0 591   0]
 [  0   0   0   0   0   0   0   1   0 611]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       1.00      1.00      1.00       618
              babycare       0.97      1.00      0.98       576
              clothing       1.00      0.97      0.98       595
           electronics       0.99      0.99      0.99       609
              footwear       1.00      1.00      1.00       625
homef

### **Plotting the ROC Curves for each of the 13 categories and calculating the AUC score for them**

In [None]:
print("TESTING LABELS: {}".format(y_test.shape))
print("PREDICTED LABELS: {}".format(lr_predict.shape))
print("PROBABILITY OF THE PREDICTED LABELS: {}".format(lr_pred_prob.shape))

TESTING LABELS: (6000,)
PREDICTED LABELS: (6000,)
PROBABILITY OF THE PREDICTED LABELS: (6000, 10)


In [None]:
#converting the Test Classes (y_test) from Pandas Series object to Numpy array
y_test = y_test.to_numpy()
length = y_test.shape

#conerting the string classes into the respective numbers based on their mapping as described previously
for i in range(length[0]):
  y_test[i] = reverse_category_mapping[y_test[i]]
  lr_predict[i] = reverse_category_mapping[lr_predict[i]]

print("The last 10 actual labels: {}".format(y_test[:10]))
print("The last 10 predicted labels: {}".format(lr_predict[:10]))

The last 10 actual labels: [2 1 1 6 2 4 7 4 0 9]
The last 10 predicted labels: [2 1 1 6 2 4 7 4 0 9]


In [None]:
#converting the actual test labels into a binary 2d numpy array according to their classes

n_classes = 13
temp_array = [[0 for i in range(n_classes)] for i in range(length[0])]

j=0
for i in y_test:
  temp_array[j][i] = 1
  j+=1

#converting the temporary array into a numpy array
y_test = np.array(temp_array)
print(y_test)

[[0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
plot_roc_curve(y_test, lr_pred_prob)

**Average Area Under the Curve**

The aggregate area under the curve score that is averaged across all of the One VS Rest curves is calculated below.

In [None]:
auc = roc_auc_score(y_test, lr_pred_prob, multi_class='ovr')
print('Average AUC score for all the categories is {}'.format(auc))

Average AUC score for all the categories is 0.5154072401808458


**3) Evaluation using the Multinomial Naive Bayes Classifier**

In [None]:
print("********** NAIVE BAYES CLASSIFIER **********")
y_test, nb_predict, nb_pred_prob = naive_bayes(x,y)

********** NAIVE BAYES CLASSIFIER **********
Validation Accuracy:  0.9386666666666666


*********** CONFUSION MATRIX **************
[[602   0   0   5   0   0   0   0   2   9]
 [  0 511  28   0   0   3   0   0  11  23]
 [  0  74 514   0   0   4   1   0   2   0]
 [  4   0   0 561   1   0   0   2   0  41]
 [  0   0   2   0 623   0   0   0   0   0]
 [  0   6   0   1   0 604   2   0  20   2]
 [  0   4   4   4   0   8 546   1   0   0]
 [  0   4   3   4   3   4   0 519   1  34]
 [  0   0   0   0   0  16   0   0 575   0]
 [  0   8   0   8   0  11   0   6   2 577]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.97      0.98       618
              babycare       0.84      0.89      0.86       576
              clothing       0.93      0.86      0.90       595
           electronics       0.96      0.92      0.94       609
              footwear       0.99      1.00      1.00       625
h

**4) Evaluation using Linear Support Vector Machine**

In [None]:
print("********* LINEAR SVM *********")
y_test, svm_predict = linear_svm(x,y)

********* LINEAR SVM *********
Validation Accuracy:  0.997


*********** CONFUSION MATRIX **************
[[618   0   0   0   0   0   0   0   0   0]
 [  0 576   0   0   0   0   0   0   0   0]
 [  1   5 587   0   0   0   0   2   0   0]
 [  2   0   0 605   0   0   0   2   0   0]
 [  0   0   0   0 625   0   0   0   0   0]
 [  0   1   0   2   0 632   0   0   0   0]
 [  0   0   0   2   0   0 564   1   0   0]
 [  0   0   0   0   0   0   0 572   0   0]
 [  0   0   0   0   0   0   0   0 591   0]
 [  0   0   0   0   0   0   0   0   0 612]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       1.00      1.00      1.00       618
              babycare       0.99      1.00      0.99       576
              clothing       1.00      0.99      0.99       595
           electronics       0.99      0.99      0.99       609
              footwear       1.00      1.00      1.00       625
homefurnishing/kitchen      

**5) Evaluation using Decision Trees Classifier**

In [None]:
print("********** DECISION TREES CLASSIFIER *************")
y_test, dtree_predict, dtree_pred_prob = decision_trees(x,y)

********** DECISION TREES CLASSIFIER *************
Validation Accuracy:  0.7208333333333333


*********** CONFUSION MATRIX **************
[[473  10  12  97   2  21   0   0   1   2]
 [  0 507  42   6  16   2   0   0   0   3]
 [  0  17 541   3  27   5   1   1   0   0]
 [ 32  39   8 321  11   2   1 162   7  26]
 [  1   0  16   4 354   3   0 246   0   1]
 [  0 304   8  22  12 263   2  13   5   6]
 [  1   1   8   4   5   9 355 182   0   2]
 [  3   5  19   4   2   3   0 536   0   0]
 [  0  20   0   0   1  10   0   0 557   3]
 [  0  60   1   7  27  15   0  80   4 418]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.93      0.77      0.84       618
              babycare       0.53      0.88      0.66       576
              clothing       0.83      0.91      0.87       595
           electronics       0.69      0.53      0.60       609
              footwear       0.77      0.57      0.65      

**6) Evaluation using the Random Forest Classifier**

In [None]:
print("********** RANDOM FOREST CLASSIFIER *************")
y_test, rfc_predict, rfc_pred_prob = random_forest(x,y)

********** RANDOM FOREST CLASSIFIER *************
Validation Accuracy:  0.8636666666666667


*********** CONFUSION MATRIX **************
[[606   0   0   8   0   1   0   0   2   1]
 [  0 528  34   0   0   0   0   0  14   0]
 [  0   8 582   0   0   1   0   0   4   0]
 [  1   0   0 485 107  13   0   2   1   0]
 [  0   0   0   0 625   0   0   0   0   0]
 [  1   1   0   1   2 547   2   0  79   2]
 [  0   0   1   0 385   0 180   1   0   0]
 [  0   1   0   4  34   0   0 533   0   0]
 [  0   0   0   0   0   8   0   0 583   0]
 [  1   2   0   3  38   5   0   2  48 513]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       1.00      0.98      0.99       618
              babycare       0.98      0.92      0.95       576
              clothing       0.94      0.98      0.96       595
           electronics       0.97      0.80      0.87       609
              footwear       0.52      1.00      0.69       

**7) Evaluation using the K Nearest Neighbours Model**

In [None]:
print("********** K NEAREST NEIGHBOURS *************")
y_test, knn_predict, knn_pred_prob = k_nearest_neighbours(x,y)

********** K NEAREST NEIGHBOURS *************
Validation Accuracy:  0.976


*********** CONFUSION MATRIX **************
[[610   0   0   6   0   1   0   0   0   1]
 [  0 576   0   0   0   0   0   0   0   0]
 [  0  17 563   2   6   2   1   2   0   2]
 [  2   0   0 558  41   5   0   0   0   3]
 [  0   0   0   1 619   1   0   4   0   0]
 [  0   1   0   1   2 621   0   1   3   6]
 [  0   0   1   1   9   3 551   1   0   1]
 [  0   1   0   1   5   0   0 564   0   1]
 [  0   0   0   0   0   0   0   0 591   0]
 [  0   0   0   1   7   1   0   0   0 603]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       1.00      0.99      0.99       618
              babycare       0.97      1.00      0.98       576
              clothing       1.00      0.95      0.97       595
           electronics       0.98      0.92      0.95       609
              footwear       0.90      0.99      0.94       625
homefurnishin

## **PART 3) Reading the Balanced Dataset created using the Undersampling & Resampling Technique**

The dataset that was balanced using the undersampling technique (implemented uusing resampling) that was saved from the earlier [Notebook](https://colab.research.google.com/drive/1Ht6pbVFlkudK7PzrDPmepiytHxPyhVBe?usp=sharing) in the form of a csv file is loaded below in the form of a Pandas dataframe. This dataset consists consists of only those products which belong to the 13 primary categories (no noise). The Product Description in this dataset has also already been cleaned in the previous notebook (lowercasing, stopword removal, tokenization,lemmatization, etc).

In [None]:
undersampled_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/undersampling_balanced_products.csv")
undersampled_df

Unnamed: 0.1,Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,...,description,product_rating,overall_rating,brand,product_specifications,primary_categories,main_category,desc_pol,desc_len,cleaned_desc
0,0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.143750,410.0,key features of alisha solid women's cycling s...
1,1,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,Footwear,SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",...,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""...",footwear,1,0.027778,650.0,key features of aw bellies sandals wedges heel...
2,2,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.122917,403.0,key features of alisha solid women's cycling s...
3,3,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,Clothing,SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",...,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",clothing,1,0.127778,416.0,key features of alisha solid women's cycling s...
4,4,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals",Footwear,SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",...,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",...",footwear,1,-0.032143,428.0,"key features of dilli bazaaar bellies, corpora..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14994,14994,abda697c6da997f66c78d91d6c88078c,2015-12-20 08:26:17 +0000,http://www.flipkart.com/thelostpuppy-back-cove...,TheLostPuppy Back Cover for Apple iPad Air,Mobiles & Accessories,ACCE9ZY9K4BHVYNA,2199.0,599.0,"[""http://img6a.flixcart.com/image/cases-covers...",...,TheLostPuppy Back Cover for Apple iPad Air (Mu...,No rating available,No rating available,TheLostPuppy,"{""product_specification""=>[{""key""=>""Brand"", ""v...",electronics,1,0.444898,639.0,thelostpuppy back cover for apple ipad air (mu...
14995,14995,87bcdd46bb48bfc1045d7ee84aef7b7a,2015-12-20 08:26:17 +0000,http://www.flipkart.com/kenway-retail-brass-co...,Kenway Retail Brass Copper Cuff,Jewellery,BBAEA49HNDNQYGJU,529.0,295.0,"[""http://img5a.flixcart.com/image/bangle-brace...",...,Kenway Retail Brass Copper Cuff\n ...,No rating available,No rating available,Kenway Retail,"{""product_specification""=>[{""key""=>""Collection...",jewellery,1,0.500000,669.0,kenway retail brass copper cuff\n ...
14996,14996,b4fad612a9f72f1ffd10134f9be7cfe8,2015-12-20 08:26:17 +0000,http://www.flipkart.com/thelostpuppy-back-cove...,TheLostPuppy Back Cover for Apple iPad Air 2,Mobiles & Accessories,ACCE9Z2HKHDGH7JY,2199.0,599.0,"[""http://img5a.flixcart.com/image/cases-covers...",...,TheLostPuppy Back Cover for Apple iPad Air 2 (...,No rating available,No rating available,TheLostPuppy,"{""product_specification""=>[{""key""=>""Brand"", ""v...",electronics,1,0.444898,641.0,thelostpuppy back cover for apple ipad air 2 (...
14997,14997,1336909e5468b63c9b1281350eba647d,2015-12-20 08:26:17 +0000,http://www.flipkart.com/kenway-retail-brass-co...,Kenway Retail Brass Copper Cuff,Jewellery,BBAEA49HHKJTPHWV,547.0,322.0,"[""http://img5a.flixcart.com/image/bangle-brace...",...,Kenway Retail Brass Copper Cuff\n ...,No rating available,No rating available,Kenway Retail,"{""product_specification""=>[{""key""=>""Collection...",jewellery,1,0.083929,675.0,kenway retail brass copper cuff\n ...


## **Evaluation of the ML Model on Balanced (Undersampling) testing dataset**

The 6 ML algorithms are then trained on the undersampled balanced dataset and evaluated on the testing dataset. Evaluation metrics are printed in order to ease the process of comparison to find which dataset and model works the best.

In [None]:
x = undersampled_df['cleaned_desc']
y = undersampled_df['primary_categories']

**1) Evaluation using the Logistic Regression Model**

In [None]:
# Handle missing values in 'cleaned_desc' before vectorization
undersampled_df['cleaned_desc'] = undersampled_df['cleaned_desc'].fillna('')  # Replace NaN with empty strings

x = undersampled_df['cleaned_desc']
y = undersampled_df['primary_categories']

# Now proceed with your logistic regression function
print("********** LOGISTIC REGRESSION **********")
y_test, lr_predict, lr_pred_prob = logistic_regression(x,y)

********** LOGISTIC REGRESSION **********
Validation Accuracy:  0.973


*********** CONFUSION MATRIX **************
[[ 184    0    0    6    0    0    0    0    0    0]
 [   1   28    3    0    1    4    0    0    0    1]
 [   0    6 1058    0    0    5    1    0    0    0]
 [   2    0    0  252    4    0    0    0    0    0]
 [   0    0    1    0  212    1    0    0    0    0]
 [   0    2    0    0    0  254    0    0    0    2]
 [   0    0    0    1    4    1  639    0    0    0]
 [   0    0    2    0    7    1    0  137    0    0]
 [   0    0    0    0    0    3    0    0   65    2]
 [   2    0    2    1    8    5    0    1    1   90]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.97      0.97      0.97       190
              babycare       0.78      0.74      0.76        38
              clothing       0.99      0.99      0.99      1070
           electronics       0.97      0.98  

**2) Evaluation using the Multiclass Variant of Logistic Regression Model**

In [None]:
print("********** LOGISTIC REGRESSION MULTICLASS **********")
y_test, lr_predict, lr_pred_prob = logistic_regression_multiclass(x,y)

********** LOGISTIC REGRESSION MULTICLASS **********
Validation Accuracy:  0.983


*********** CONFUSION MATRIX **************
[[ 187    0    0    3    0    0    0    0    0    0]
 [   2   29    3    0    0    3    0    0    0    1]
 [   2    7 1058    0    0    2    1    0    0    0]
 [   2    0    0  256    0    0    0    0    0    0]
 [   0    0    1    0  212    0    0    1    0    0]
 [   0    2    0    0    0  256    0    0    0    0]
 [   0    0    0    2    2    1  640    0    0    0]
 [   0    0    0    0    0    1    0  146    0    0]
 [   0    0    0    0    0    3    0    0   66    1]
 [   1    1    0    0    3    5    0    1    0   99]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.96      0.98      0.97       190
              babycare       0.74      0.76      0.75        38
              clothing       1.00      0.99      0.99      1070
           electronics       0.98 

### **Plotting the ROC Curves for each of the 13 categories and calculating the AUC score for them**

In [None]:
print("TESTING LABELS: {}".format(y_test.shape))
print("PREDICTED LABELS: {}".format(lr_predict.shape))
print("PROBABILITY OF THE PREDICTED LABELS: {}".format(lr_pred_prob.shape))

TESTING LABELS: (3000,)
PREDICTED LABELS: (3000,)
PROBABILITY OF THE PREDICTED LABELS: (3000, 10)


In [None]:
#converting the Test Classes (y_test) from Pandas Series object to Numpy array
y_test = y_test.to_numpy()
length = y_test.shape

#conerting the string classes into the respective numbers based on their mapping as described previously
for i in range(length[0]):
  y_test[i] = reverse_category_mapping[y_test[i]]
  lr_predict[i] = reverse_category_mapping[lr_predict[i]]

print("The last 10 actual labels: {}".format(y_test[:10]))
print("The last 10 predicted labels: {}".format(lr_predict[:10]))

The last 10 actual labels: [4 7 2 1 2 3 1 5 2 1]
The last 10 predicted labels: [4 7 2 1 2 3 1 5 2 1]


In [None]:
#converting the actual test labels into a binary 2d numpy array according to their classes

n_classes = 13
temp_array = [[0 for i in range(n_classes)] for i in range(length[0])]

j=0
for i in y_test:
  temp_array[j][i] = 1
  j+=1

#converting the temporary array into a numpy array
y_test = np.array(temp_array)
print(y_test)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [None]:
plot_roc_curve(y_test, lr_pred_prob)

**Average Area Under the Curve**

The aggregate area under the curve score that is averaged across all of the One VS Rest curves is calculated below.

In [None]:
auc = roc_auc_score(y_test, lr_pred_prob, multi_class='ovr')
print('Average AUC score for all the categories is {}'.format(auc))

Average AUC score for all the categories is 0.5347936541610551


**3) Evaluation using the Multinomial Naive Bayes Classifier**

In [None]:
print("********** NAIVE BAYES CLASSIFIER **********")
y_test, nb_predict, nb_pred_prob = naive_bayes(x,y)

********** NAIVE BAYES CLASSIFIER **********
Validation Accuracy:  0.856


*********** CONFUSION MATRIX **************
[[ 178    0    0    3    0    0    9    0    0    0]
 [   1    0   27    0    0    1    9    0    0    0]
 [   0    0 1064    0    0    0    6    0    0    0]
 [   0    0    1  200    0    0   57    0    0    0]
 [   0    0   37    0   83    0   94    0    0    0]
 [   0    0    8    1    0  242    7    0    0    0]
 [   0    0   11    0    0    2  632    0    0    0]
 [   0    0   26    0    0    1   11  109    0    0]
 [   0    0   12    1    0    5    6    0   46    0]
 [   1    0   15    2    0    8   70    0    0   14]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.94      0.96       190
              babycare       0.00      0.00      0.00        38
              clothing       0.89      0.99      0.94      1070
           electronics       0.97      0.7

**4) Evaluation using Linear Support Vector Machine**

In [None]:
print("********* LINEAR SVM *********")
y_test, svm_predict = linear_svm(x,y)

********* LINEAR SVM *********
Validation Accuracy:  0.9866666666666667


*********** CONFUSION MATRIX **************
[[ 189    0    0    1    0    0    0    0    0    0]
 [   1   29    3    0    0    4    0    0    0    1]
 [   0    4 1064    0    0    1    1    0    0    0]
 [   2    0    0  256    0    0    0    0    0    0]
 [   0    1    0    1  212    0    0    0    0    0]
 [   0    2    0    0    0  256    0    0    0    0]
 [   0    0    0    2    0    1  642    0    0    0]
 [   0    0    0    0    0    1    0  146    0    0]
 [   0    0    0    0    0    3    0    0   67    0]
 [   1    0    0    1    0    8    0    1    0   99]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.98      0.99      0.99       190
              babycare       0.81      0.76      0.78        38
              clothing       1.00      0.99      1.00      1070
           electronics       0.98      0.99

**5) Evaluation using Decision Trees Classifier**

In [None]:
print("********** DECISION TREES CLASSIFIER *************")
y_test, dtree_predict, dtree_pred_prob = decision_trees(x,y)

********** DECISION TREES CLASSIFIER *************
Validation Accuracy:  0.8243333333333334


*********** CONFUSION MATRIX **************
[[ 169    0    2    0    0    8    0    0    0   11]
 [   0   24   10    1    0    0    0    1    1    1]
 [   0   10 1024    0   21    2    6    6    0    1]
 [   9    0   19  158   56    1    0    3    0   12]
 [   0    0   16    0  191    3    3    0    0    1]
 [   2    3  146    0    2   74    6    1    0   24]
 [   0    0    4    0   55    9  559    2    0   16]
 [   0    0    5    0    9    1    0  130    0    2]
 [   0    1    5    0    0    0    0    0   61    3]
 [   0    1   17    1    3    3    0    0    2   83]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.94      0.89      0.91       190
              babycare       0.62      0.63      0.62        38
              clothing       0.82      0.96      0.88      1070
           electronics 

**6) Evaluation using the Random Forest Classifier**

In [None]:
print("********** RANDOM FOREST CLASSIFIER *************")
y_test, rfc_predict, rfc_pred_prob = random_forest(x,y)

********** RANDOM FOREST CLASSIFIER *************
Validation Accuracy:  0.7936666666666666


*********** CONFUSION MATRIX **************
[[ 181    0    5    1    0    3    0    0    0    0]
 [   0   11   21    0    0    6    0    0    0    0]
 [   0    0 1067    0    0    2    1    0    0    0]
 [   1    0    4  185   58   10    0    0    0    0]
 [   0    0   26    0  188    0    0    0    0    0]
 [   0    1   11    0    0  246    0    0    0    0]
 [   0    0    4    0  350    0  291    0    0    0]
 [   0    0   11    0    8    2    0  126    0    0]
 [   0    0    8    0    0    3    1    0   58    0]
 [   0    0   11    1   55   15    0    0    0   28]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.95      0.97       190
              babycare       0.92      0.29      0.44        38
              clothing       0.91      1.00      0.95      1070
           electronics  

**7) Evaluation using the K Nearest Neighbours Model**

In [None]:
print("********** K NEAREST NEIGHBOURS *************")
y_test, knn_predict, knn_pred_prob = k_nearest_neighbours(x,y)

********** K NEAREST NEIGHBOURS *************
Validation Accuracy:  0.9433333333333334


*********** CONFUSION MATRIX **************
[[ 183    0    0    3    1    3    0    0    0    0]
 [   0   23    7    3    1    2    2    0    0    0]
 [   0    4 1060    2    1    1    2    0    0    0]
 [   1    0    0  215   42    0    0    0    0    0]
 [   0    0    6    0  206    1    1    0    0    0]
 [   0    1    7    2    3  240    3    0    1    1]
 [   0    0    5    0   14    1  624    1    0    0]
 [   0    0    2    1   10    0    2  132    0    0]
 [   0    2    1    1    0    2    1    0   63    0]
 [   1    1    2    2   12    6    2    0    0   84]]


*********** CLASSIFICATION REPORT **************
                        precision    recall  f1-score   support

            automotive       0.99      0.96      0.98       190
              babycare       0.74      0.61      0.67        38
              clothing       0.97      0.99      0.98      1070
           electronics      