In [None]:
# Papers

""" 
1. Don’t Classify, Translate

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization.
Their method converts a product’s description into a sequence of tokens, which represent a root-to-leaf path to the correct category.
Using this method, they are also able to propose meaningful new paths in the taxonomy.
"""
""" 
2. Multi-Label Product Categorization Using Multi-Modal Fusion Models

In this paper, researchers from New York University and U.S. Bank investigate multi-modal approaches to categorize products on Amazon. 
Their approach utilizes multiple classifiers trained on each type of input data from the product listings.
Using a dataset of 9.4 million Amazon products, they developed a tri-modal model for product classification based on product images, titles, and descriptions.
Their tri-modal late fusion model retains an F1 score of 88.2%. 
"""
""" 
3. Bag of Tricks for Efficient Text Classification

This paper explores a simple and efficient baseline for text classification.
Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.
We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute.
"""

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from colour import Color
import re
%matplotlib inline

In [49]:
# Reading data from .csv file
dframe = pd.read_csv('productsfull.csv')
dframe.head()

Unnamed: 0,colorname,desc2,description,fit,fitinfo,fulldescription,itemid,name,url
0,[],[],"[""The world's #1-selling hunting knife in its ...",[],[],"[""The world's #1-selling hunting knife in its ...",['219529'],Buck 110 Folding Hunter's Knife,https://www.llbean.com/llb/shop/38592?page=buc...
1,"['Bright Sapphire', 'Cactus', 'Night', 'Teal S...",[],"[""We've combined the freedom of a sandal with ...",[],['Half sizes order down.'],"[""We've combined the freedom of a sandal with ...",['504069'],Kids' Explorer Sandals,https://www.llbean.com/llb/shop/120346?page=ki...
2,['Mossy Oak Country'],"[""Maine State Game Wardens are the pinnacle of...","[""This is the ultimate sportsman's day pack. D...",[],[],"[""This is the ultimate sportsman's day pack. D...",['505356'],"Maine Warden's Day Pack, Camo",https://www.llbean.com/llb/shop/121924?page=ma...
3,['Owl'],[],['These fur-ocious kids’ animal slippers will ...,[],"['Half sizes order up.', 'Sized for toddlers.']",['These fur-ocious kids’ animal slippers will ...,['219943'],Toddlers' Animal Paws Slippers,https://www.llbean.com/llb/shop/38106?page=tod...
4,['Dark Gray Multi'],[],['Our Baby Bogs set the standard in toddlers’ ...,[],[],['Our Baby Bogs set the standard in toddlers’ ...,['306187'],"Toddlers' Baby Bogs Boots, Classic Dino",https://www.llbean.com/llb/shop/118942?page=to...


In [50]:
# Dropping useless columns
df = dframe.drop(['desc2', 'fit', 'fitinfo', 'fulldescription', 'itemid', 'name', 'url'], axis=1)
df.head()

Unnamed: 0,colorname,description
0,[],"[""The world's #1-selling hunting knife in its ..."
1,"['Bright Sapphire', 'Cactus', 'Night', 'Teal S...","[""We've combined the freedom of a sandal with ..."
2,['Mossy Oak Country'],"[""This is the ultimate sportsman's day pack. D..."
3,['Owl'],['These fur-ocious kids’ animal slippers will ...
4,['Dark Gray Multi'],['Our Baby Bogs set the standard in toddlers’ ...


In [51]:
# Function that checks if string is a color name or not
def check_color(col):
    try:
        if col == "":
            return False
        return Color(col)
    except ValueError:
        return False

In [54]:
# Getting color names of every single dataframe row
colnames = []
for i in range(df.shape[0]):
    tmp = re.findall(r'\w+', df.iloc[i][1].lower())
    colnames.append([])
    for j in tmp:
        if check_color(j):
            colnames[i].append(j)
    

In [55]:
colnames_from_df = df['colorname'].apply(lambda x: re.findall(r'\w+', x.lower()))

In [58]:
# Finding intersection between colorname column and our extracted color names from description column
compared = []
for i in range(len(colnames)):
    if not colnames[i]:
        continue
    intersect = set(colnames[i]).intersection(colnames_from_df[i])
    if intersect:
        compared.append((i, list(intersect)[0])) # adding row indexes
        

In [59]:
compared

[(172, 'red'),
 (267, 'blue'),
 (319, 'black'),
 (624, 'red'),
 (630, 'blue'),
 (667, 'red'),
 (723, 'orange'),
 (733, 'orange'),
 (736, 'orange'),
 (748, 'orange'),
 (763, 'snow'),
 (781, 'snow'),
 (1007, 'black'),
 (1094, 'orange'),
 (1162, 'black'),
 (1254, 'red'),
 (1290, 'snow'),
 (1306, 'snow'),
 (1331, 'orange'),
 (1341, 'orange'),
 (1355, 'orange'),
 (1455, 'red'),
 (1674, 'indigo'),
 (1794, 'indigo'),
 (1840, 'indigo'),
 (1848, 'indigo'),
 (1958, 'black'),
 (2249, 'navy'),
 (2569, 'black'),
 (2783, 'red'),
 (2871, 'orange'),
 (3111, 'linen'),
 (3212, 'white'),
 (3216, 'blue'),
 (3435, 'indigo'),
 (3461, 'red'),
 (3817, 'red'),
 (3928, 'khaki'),
 (3946, 'indigo'),
 (4271, 'black'),
 (4369, 'black'),
 (4907, 'black'),
 (5079, 'black'),
 (5373, 'khaki'),
 (5379, 'khaki'),
 (5424, 'indigo'),
 (5512, 'indigo'),
 (6579, 'indigo'),
 (6646, 'orange'),
 (6892, 'khaki'),
 (7435, 'khaki'),
 (7498, 'khaki'),
 (7516, 'khaki'),
 (7571, 'khaki'),
 (7586, 'red'),
 (8499, 'navy'),
 (8786, 'blu

In [6]:
#!pip install fasttext

In [349]:
# Reading second .csv file

df_classified = pd.read_csv('productsclassified.csv')
df_classified.drop(['desc2', 'fit', 'fitinfo', 'fulldescription', 'itemid', 'name', 'url', 'colorname'], axis=1, inplace=True)

In [350]:
# Train-test splitting our data

from sklearn.model_selection import train_test_split
import csv

train, test = train_test_split(df_classified, test_size=0.2, shuffle=False)

In [351]:
train.head()

Unnamed: 0,description,Classification
0,['These waterproof hiking boots for men are ru...,Boots
1,"[""The next level of weather protection. This l...",Jackets
2,['Great grip and extra breathability make thes...,Shoes
3,"[""This lightweight fly rod delivers outstandin...",Fishing Tools
4,"['Add a pop of paddling fun to your bed, chair...",Pillows


In [354]:
# Processing data to train our future model

labeled_data = "__label__"+train['Classification'].apply(lambda x: "".join(x.split())) + " " + train['description'].apply(lambda  x:x[2:-2])

In [355]:
# Converting labeled data into .txt file

labeled_data.to_csv('train.txt', index = False, sep = ' ', header = None, quoting=csv.QUOTE_NONE, quotechar="", escapechar = " ")

In [358]:
# Importing and training our model

import fasttext
model = fasttext.train_supervised(input='train.txt', epoch=2000, wordNgrams=2, lr=1)

In [359]:
# Predicting for a random description

model.predict('These  waterproof  hiking  boots  for  men  are  rugged  enough  for  peak  performance  yet  light  and  quick  enough  to  keep  feet  from  feeling  weighed  down.')

(('__label__Boots',), array([0.99993289]))

In [360]:
# Cleaning our test descriptions from useless symbols 

test['description'] = test['description'].apply(lambda x:x[2:-2])

In [361]:
# Making knows classes and predicted classes similar to compare

test['Classified'] = test['description'].apply(lambda x: model.predict(x)[0][0][9:])
test['Classification'] = test['Classification'].apply(lambda x: ''.join(x.split()))

In [362]:
test.head(18)

Unnamed: 0,description,Classification,Classified
197,Our popular men's long-sleeve quarter-zip cycl...,Shirts,Shirts
198,"A technical ultralight pack, the Osprey Atmos ...",Bags,Bags
199,"Perfectly cool and casual, this striped linen ...",Dresses,Bags
200,With their athletic profile and heavy-duty tre...,Shoes,Shoes
201,Few things are as rewarding as trading in your...,Slippers,Pants
202,"Field tested in a variety of situations, from ...",FishingTools,FishingTools
203,"100% unshrinkable, and exclusively ours. This ...",Shirts,Shirts
204,"Now lighter, these are the perfect trail shoes...",Shoes,Shoes
205,We've put a colorful spin on a perennial favor...,Boots,Jackets
206,A lightweight three-season sleeping pad that o...,SleepingBags,Shorts


In [363]:
# Counting model's accuracy

acc = sum(test['Classified']==test['Classification'])
print(f"Accuracy: {acc} / {test.shape[0]}")

Accuracy: 23 / 50


In [None]:
"""
You have successfully created an ML model that has a great performance. They have been using the model for 2 years.
Then suddenly the model performance drops. What could have possibly resulted in this outcome.
Could you have prevented it? If yes how. If not, what can you do now?
"""

"""
Model's degradation proccess is called Concept Drift. Main reason that could affect on our model performance over time is incomplete or old data.
Trends and world-wide events such as crisis, pandemic, etc, can impact our data, and there is no way to avoid them.
Solution could be refreshing our old data with new one and training a new model with it.
"""