# Mercor ML Engineer Role Vetting Project

The aim of the project is to build an ML model that takes a product description as input and outputs the top 10 similar product links.

## Import the necessary libraries
For this project, I'm using the pandas library for handling and cleaning data that is present in a csv format. The nltk library which contains useful functions related to human language processing, and the gensim library to extract features from the product descriptions. The scikit-learn library is also used perform a random shuffle of the dataset.

In [46]:
import pandas as pd
import numpy as np
import multiprocessing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.utils import shuffle

In [6]:
english_stopwords = stopwords.words('english') #gather a list of english stopwords that is built-in to the nltk library.
english_stopwords.extend(['ikat', 'aline', '&']) #extend the list of stopwords to include special words in the dataset.

In [7]:
def remove_stop_words(string):
    """A helper function to remove commonly occuring stopping words such as 'and', 'with', 'of', etc. that may
       frequently occur in a set of sentences from a given string.
       input: a string of words
       returns: input without stopping words."""
    tokens = word_tokenize(str(string))
    tokens_wo_stops = [t for t in tokens if t not in english_stopwords]
    clean_string = " ".join(tokens_wo_stops)
    return clean_string

## Myntra Dataset Cleaning
This dataset contains product information from the e-commerce website Myntra, which is one of India's largest online fashion retail store. This dataset is available on Kaggle.

https://www.kaggle.com/datasets/manishmathias/myntra-fashion-dataset

In [8]:
dfm = pd.read_csv('../datasets/myntra_dataset.csv', usecols=['URL', 'Description', 'BrandName'])
dfm.head()

Unnamed: 0,URL,BrandName,Description
0,https://www.myntra.com/jeans/roadster/roadster...,Roadster,roadster men navy blue slim fit mid rise clean...
1,https://www.myntra.com/track-pants/locomotive/...,LOCOMOTIVE,locomotive men black white solid slim fit tra...
2,https://www.myntra.com/shirts/roadster/roadste...,Roadster,roadster men navy white black geometric print...
3,https://www.myntra.com/shapewear/zivame/zivame...,Zivame,zivame women black saree shapewear zi3023core0...
4,https://www.myntra.com/tshirts/roadster/roadst...,Roadster,roadster women white solid v neck pure cotton ...


In [9]:
dfm.columns = ['url', 'brand', 'desc'] #rename column names for better accessibility.

In [10]:
dfm['brand'], dfm['desc'] = dfm['brand'].str.lower(), dfm['desc'].str.lower() #convert columns to lower case.

In [11]:
dfm['brand'] = dfm['brand'].str.replace('here&now', 'herenow') #replace a pesky brand name that is not consistent with the rest.

In [12]:
dfm['brand'] = dfm['brand'].str.replace('&', '') #replace the '&' character in brand name with empty string for better handling.
dfm['desc'] = dfm['desc'].str.replace('-', '') #replace commonly occuring hyphen with empty string for better model performance.

In [13]:
dfm['desc'] = dfm.apply(lambda row : row['desc'].replace(str(row['brand']), ''), axis=1) #remove brand name from product descriptions.

In [14]:
dfm_copy = dfm.copy()
dfm['desc'] = dfm['desc'].str.replace(r'(\s[A-Za-z]){1}\s{1}(\w+)', r'\1\2', regex=True) #combine commonly occurring words such as 't shirt', 'v neck'.

In [15]:
dfm_copy = dfm.copy() #create a backup in case something goes wrong.

In [52]:
dfm['desc'] = dfm['desc'].apply(remove_stop_words) #remove stop words in product descriptions using the remove_stop_words function.

In [21]:
dfm.head()

Unnamed: 0,url,brand,desc
500,https://www.myntra.com/shirts/highlander/highl...,highlander,men black slim fit casual shirt
501,https://www.myntra.com/tops/pluss/pluss-women-...,pluss,women blue printed peplum top
502,https://www.myntra.com/dresses/pluss/pluss-wom...,pluss,women red embroidered fit flare dress
503,https://www.myntra.com/shorts/dressberry/dress...,dressberry,women stylish black solid fringed shorts
504,https://www.myntra.com/dresses/street-9/street...,street 9,navy blue one shoulder fit flare dress
...,...,...,...
595,https://www.myntra.com/trousers/invictus/invic...,invictus,black slim fit formal trousers
596,https://www.myntra.com/jackets/hrx-by-hrithik-...,hrx by hrithik roshan,hrx active by hrithik roshan men grey solid sp...
597,https://www.myntra.com/jackets/highlander/high...,highlander,men blue solid denim jacket jacket
598,https://www.myntra.com/shirts/highlander/highl...,highlander,men rust slim fit solid casual shirt


## Ajio Dataset Cleaning
This dataset contains product information from the ecommerce website Ajio, which is another large online fashion retailer in India that is owned by the Reliance corporation. This dataset too is available on Kaggle.

https://www.kaggle.com/datasets/manishmathias/ajio-clothing-fashion

In [57]:
dfa = pd.read_csv('../datasets/ajio_dataset.csv', usecols=['Product_URL', 'Description', 'Color', 'Category_by_gender'])
dfa.head()

Unnamed: 0,Product_URL,Description,Category_by_gender,Color
0,https://www.ajio.com/netplay-checked-polo-t-sh...,Checked Polo T-shirt,Men,white
1,https://www.ajio.com/netplay-tapered-fit-flat-...,Tapered Fit Flat-Front Trousers,Men,navy
2,https://www.ajio.com/the-indian-garage-co-stri...,Striped Slim Fit Shirt with Patch Pocket,Men,white
3,https://www.ajio.com/performax-heathered-crew-...,Heathered Crew-Neck T-shirt,Men,charcoal
4,https://www.ajio.com/john-players-jeans-washed...,Washed Skinny Fit Jeans with Whiskers,Men,jetblack


In [59]:
dfa.columns = ['url', 'desc', 'gender', 'color'] #change column names for better accessibility.

In [60]:
dfa['desc'], dfa['gender'], dfa['color'] = dfa['desc'].str.lower(), dfa['gender'].str.lower(), dfa['color'].str.lower() #convert columns to lower case.

In [62]:
dfa_copy = dfa.copy() #create a copy in case something goes wrong.

In [63]:
dfa['desc'] = dfa['desc'].str.replace('-', '') #replace commonly occurring hyphen with empty string for better model performance.

In [66]:
dfa['desc'] = dfa['desc'].str.replace('dresss', 'dress') #correct the spelling for dress.

In [68]:
#combine the three columns into a single product description that is in the same format of the myntra dataset descriptions.
dfa['desc'] = dfa['gender']+' '+dfa['color']+' '+dfa['desc'] 

In [70]:
dfa['desc'] = dfa['desc'].apply(remove_stop_words) #remove stop words in the product descriptions using the remove_stop_words function.

In [72]:
dfa.head()

Unnamed: 0,url,desc,gender,color
0,https://www.ajio.com/netplay-checked-polo-t-sh...,men white checked polo tshirt,men,white
1,https://www.ajio.com/netplay-tapered-fit-flat-...,men navy tapered fit flatfront trousers,men,navy
2,https://www.ajio.com/the-indian-garage-co-stri...,men white striped slim fit shirt patch pocket,men,white
3,https://www.ajio.com/performax-heathered-crew-...,men charcoal heathered crewneck tshirt,men,charcoal
4,https://www.ajio.com/john-players-jeans-washed...,men jetblack washed skinny fit jeans whiskers,men,jetblack


## Dataset Merging & Final Cleaning

In [75]:
df_consol = pd.concat([dfm[['url', 'desc']], dfa[['url', 'desc']]], axis=0) #merge the two datasets into one dataframe.

In [77]:
df_full.to_csv('../datasets/mercor_final_production_dataset_v1.csv', index_label='pid') #save the dataframe to csv as backup.

In [1]:
#replace any alphanumeric strings or numbers with empty string for better model performance.
df_full['desc'] = df_full['desc'].str.replace(r'\b\w*\d\w*\b', '', regex=True) 

## Model Training & Testing
Once the dataset has been cleaned and prepared, the doc2vec model is trained. The doc2vec model is an extension of the word2vec model which is a machine learning model to vectorize a word. The doc2vec model is able to vectorize a sentence as compared to the word2vec model which can vectorize a single word. This makes it easier to compute sentence similarities. For more information refer to the following paper.

Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. ArXiv. /abs/1405.4053
[https://arxiv.org/abs/1405.4053]

In [None]:
df_shuf = shuffle(df_full) #random shuffle the dataframe
df_shuf = df_shuf.reset_index(drop=True) #reset the index of the dataframe

In [None]:
df_shuf.head()

In [None]:
list_desc = list(df_shuf['desc']) #convert the Series of product descriptions into a python list.

In [None]:
#Convert the list into a model suitable format for training.
tag_data = [TaggedDocument(words=word_tokenize(str(_d).lower()), tags=[str(i)]) for i, _d in enumerate(list_desc)]

In [None]:
#initialize the model hyperparameters.
workers = multiprocessing.cpu_count() #the number of CPU cores for training
epochs=30 #the number of iterations for which to train the data.
vector_size = 12 #the size of the vectorized representations of the sentences.
min_count = 2 #the minimum occurences of a word to be considered for including that word into the vector representation.

In [None]:
model = Doc2Vec(vector_size=vector_size, min_count=min_count, epochs=epochs, workers=workers) #initialize the model.

In [None]:
model.build_vocab(tag_data) #build a vocabulary that is required by the model.
#vocabulary is the number of occurences of each unique word in our dataset.

In [None]:
model.train(tag_data_new, total_examples=model.corpus_count, epochs=model.epochs) #train the model with the hyperparameters.

In [None]:
model.save('mercor_fashion_model.model') #after model training, save the model

## Testing the model

In [3]:
test_model = Doc2Vec.load('mercor_fashion_model.model') #load the saved model.

In [4]:
test_string = "women lemon graphic print roundneck top"
test_string_lst = test_string.split()

In [6]:
vec_string = test_model.infer_vector(test_string_lst) #get the vector representation of our test string from the model.

In [8]:
ranks = test_model.dv.most_similar(vec_string) #returns the 10 most similar sentences from the dataset.

In [9]:
ranks

[('652492', 0.9663132429122925),
 ('821398', 0.9590542912483215),
 ('688935', 0.9586809277534485),
 ('573220', 0.9551078081130981),
 ('739014', 0.9541603922843933),
 ('572264', 0.9533801674842834),
 ('148495', 0.9511833786964417),
 ('800582', 0.9510716795921326),
 ('281112', 0.9483296275138855),
 ('621408', 0.9480392336845398)]

In [12]:
for ind, similarity in ranks:
    print(df_shuf.iloc[int(ind)])

url     https://www.ajio.com/kotty-colourblock-round-n...
desc              women pink colourblock roundneck tshirt
Name: 652492, dtype: object
url     https://www.ajio.com/ovs-textured-panelled-rou...
desc       women burgundy textured panelled roundneck top
Name: 821398, dtype: object
url     https://www.ajio.com/fusion-geometric-print-st...
desc    women grey geometric print straight kurta high...
Name: 688935, dtype: object
url     https://www.ajio.com/project-eve-round-neck-to...
desc        women rust roundneck top placement embroidery
Name: 573220, dtype: object
url     https://www.ajio.com/project-eve-sleeveless-po...
desc              women red sleeveless polkadot print top
Name: 739014, dtype: object
url     https://www.myntra.com/dresses/mayra/mayra-pin...
desc               pink animal print ruffles detail dress
Name: 572264, dtype: object
url     https://www.ajio.com/style-quotient-animal-pri...
desc                  women brown animal print fitted top
Name: 148495, dtype: