<a href="https://colab.research.google.com/github/KelvinLam05/product_classification/blob/main/product_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

Assigning products to the right categories is crucial to allowing customers to find what they’re looking for, so product classification models are commonly used by online marketplaces to ensure that products are assigned to the right product categories when listed by third parties.
In this project, we will use a fastText classifier to predict product categories from product names.

In [266]:
import pandas as pd
import fasttext

**Load the data**

The dataset I’m using is a [PriceRunner](https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization?select=pricerunner_aggregate.csv) dataset which is ideal for product classification problems. It includes 35,311 product names from various online retailers which map to 12,849 product names assigned to 10 different product categories.

The names the vendors have used for the products are all slightly different. Our aim is to predict which category_label a product will have from its product name.

In [267]:
# Load dataset
df = pd.read_csv('/content/pricerunner_aggregate.csv', names = ['product_id', 'product_title', 'vendor_id', 'cluster_id', 'cluster_label', 'category_id', 'category_label'])

In [268]:
df = df[['category_label', 'product_title']]

In [269]:
# Examine the data
df.head()

Unnamed: 0,category_label,product_title
0,Mobile Phones,apple iphone 8 plus 64gb silver
1,Mobile Phones,apple iphone 8 plus 64 gb spacegrau
2,Mobile Phones,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...
3,Mobile Phones,apple iphone 8 plus 64gb space grey
4,Mobile Phones,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...


In [270]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   category_label  35311 non-null  object
 1   product_title   35311 non-null  object
dtypes: object(2)
memory usage: 551.9+ KB


**Split the training and test data**

Before training our classifier, we need to split the data into train and test. We will use the test set to evaluate how good the learned classifier is on new data.

In [271]:
X = df['product_title']

In [272]:
y = df['category_label']

For the train-test split we use stratify to keep the ratio of category labels.

In [273]:
from sklearn.model_selection import train_test_split

In [274]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

In [275]:
X_train = pd.DataFrame(X_train)

In [276]:
train = X_train.join(y_train)

In [277]:
X_test = pd.DataFrame(X_test)

In [278]:
test = X_test.join(y_test)

**Formatting the labels**

The format of the text that goes into a fastText is a series/list, with each element as a string of text including its respective labels. All the labels/categories in fastText start by the "`__label__`" prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document. So now we will add `__label__` in front of the category for fastText to read it as a label and then combine the labels and words into a single string.

In [279]:
train['category_label'] = '__label__' + train['category_label'] + ' ' + train['product_title']

In [280]:
test['category_label'] = '__label__' + test['category_label'] + ' ' + test['product_title']

In [281]:
train.head()

Unnamed: 0,product_title,category_label
22570,candy gvs148dc3b freestanding 8kg washing mach...,__label__Washing Machines candy gvs148dc3b fre...
31978,liebherr k2330 tall fridge white,__label__Fridges liebherr k2330 tall fridge white
6110,sharp lc 55cug8362ks 55 inch 4k ultra hd smart...,__label__TVs sharp lc 55cug8362ks 55 inch 4k u...
7588,samsung ue55hu8500 55 ultra hd 3d smart led tv,__label__TVs samsung ue55hu8500 55 ultra hd 3d...
19133,siemens iq300 sn436s01me unterbau 13stellen a,__label__Dishwashers siemens iq300 sn436s01me ...


In [282]:
test.head()

Unnamed: 0,product_title,category_label
28703,lec aff90185s/aff90185s,__label__Fridge Freezers lec aff90185s/aff90185s
18841,siemens geschirrsp ler sn636x01ke,__label__Dishwashers siemens geschirrsp ler sn...
4500,oled65b8 65 inch oled 4k ultra hd premium smar...,__label__TVs oled65b8 65 inch oled 4k ultra hd...
24503,bosch gsn36aw31g upright freezer freestanding ...,__label__Freezers bosch gsn36aw31g upright fre...
19305,nordmende dssn60ix semi integrated dishwasher ...,__label__Dishwashers nordmende dssn60ix semi i...


**Saving DataFrames as text files**

In the following step, we are saving our DataFrames as text files.

In [283]:
import csv

In [284]:
# Write test and train into files
train.to_csv('train.txt', 
             index = False, 
             sep = ' ',
             header = None, 
             quoting = csv.QUOTE_NONE, 
             quotechar = "", 
             escapechar = " ")

test.to_csv('test.txt', 
            index = False, 
            sep = ' ',
            header = None, 
            quoting = csv.QUOTE_NONE, 
            quotechar = "", 
            escapechar = " ")

**Our classifier**

We are now ready to train our classifier:

In [285]:
model = fasttext.train_supervised('train.txt')

Now let's see how the model does on the test set.

In [286]:
model.test('test.txt')  

(7063, 0.9835763839728161, 0.9835763839728161)

The output are the number of samples (here 7063), the precision at one (0.98) and the recall at one (0.98).