# RECOMMENDATION SYSTEM
Dealing with Research Environment

## Problem Statement
This project understands the recommendation of Myntra's fashion products based on the product selected by the user so it is the content based recommendation problem statement having variables like index, ProductID, Title, Price in INR.,Colour Name, Brand Name, Image URL, Count of Ratings, Average Rating of Product and Product Description.

## Data Collection
Dataset is collected from the kaggle

## Importing the libraries

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Importing the dataset

In [2]:
fashion = pd.read_csv('Fashion Dataset.csv')

## Showing Top 5 Records 

In [3]:
fashion.head()

Unnamed: 0.1,Unnamed: 0,p_id,name,price,colour,brand,img,ratingCount,avg_rating,description,p_attributes
0,0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,5099.0,Black,Khushal K,http://assets.myntassets.com/assets/images/170...,4522.0,4.418399,Black printed Kurta with Palazzos with dupatta...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."
1,1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,5899.0,Orange,InWeave,http://assets.myntassets.com/assets/images/165...,1081.0,4.119334,Orange solid Kurta with Palazzos with dupatta<...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."
2,2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,4899.0,Navy Blue,Anubhutee,http://assets.myntassets.com/assets/images/163...,1752.0,4.16153,Navy blue embroidered Kurta with Trousers with...,"{'Add-Ons': 'NA', 'Body Shape ID': '333,424', ..."
3,3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,3699.0,Red,Nayo,http://assets.myntassets.com/assets/images/147...,4113.0,4.088986,Red printed kurta with trouser and dupatta<br>...,"{'Add-Ons': 'NA', 'Body Shape ID': '333,424', ..."
4,4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,1350.0,Black,AHIKA,http://assets.myntassets.com/assets/images/110...,21274.0,3.978377,"Black and green printed straight kurta, has a ...","{'Body Shape ID': '424', 'Body or Garment Size..."


## Getting the shape of the dataset

In [4]:
fashion.shape

(14330, 11)

## Insights
- We have 14330 Observations and 11 Features in the dataset

## Dataset Information


- Unnamed: 0 = This feature is just like a serial number which is uniquely representing each and every observation.
- p_id = It represents the id of fashion product which is uniquely representing each and every product and we will use it to fetch the image of fashion product in the development environment.
- name = It represents the title of the fashion product.
- price = It represents the price of the product in INR.
- colour = It represents the color name used in fashion product.
- brand = It represents the brand name of the fashion product.
- img = It represents the image URL for the fashion product.
- ratingCount = It represents the count of the rating given by the user to that particular fashion product.
- avg_rating = It represents the average rating of the fashion product.
- description = It represents the brief information regarding fashion product and it contains some of the HTML elements.
- p_attributes = It represents the product attributes such as Add-Ons, we have dupatta or not in fashion product stuff like that.

## Dropping the Index Feature
As it will not help in recommendation it just uniquely identifying each and every observation over here in the dataset

In [5]:
fashion.drop(["Unnamed: 0"],axis=1,inplace=True)

## Checking the existence of missing values

In [6]:
fashion.isna().sum()

p_id              18
name              18
price             18
colour            21
brand             18
img               18
ratingCount     7749
avg_rating      7749
description       18
p_attributes      18
dtype: int64

## There are missing values in the dataset

In [7]:
features_with_na=[features for features in fashion.columns if fashion[features].isnull().sum()>1]

for feature in features_with_na:
    print(feature, np.round(fashion[feature].isnull().mean(), 4),  ' % missing values')

p_id 0.0013  % missing values
name 0.0013  % missing values
price 0.0013  % missing values
colour 0.0015  % missing values
brand 0.0013  % missing values
img 0.0013  % missing values
ratingCount 0.5408  % missing values
avg_rating 0.5408  % missing values
description 0.0013  % missing values
p_attributes 0.0013  % missing values


In [8]:
features_with_na

['p_id',
 'name',
 'price',
 'colour',
 'brand',
 'img',
 'ratingCount',
 'avg_rating',
 'description',
 'p_attributes']

## Planning to deal with missing values

- As we are building content based recommendation System so therefore features like ratingCount and avg_rating are not essential for the recommendation purpose so therefore we are dropping them
- after dropping them we have very less amount of missing values in remaining features with nan so therefore simply we will drop na because they will not affect the system

In [9]:
fashion.drop(["ratingCount","avg_rating"],axis=1,inplace=True)

In [10]:
fashion.dropna(inplace=True)

In [11]:
fashion.isna().sum()

p_id            0
name            0
price           0
colour          0
brand           0
img             0
description     0
p_attributes    0
dtype: int64

In [12]:
fashion.shape

(14309, 8)

## Checking the duplicates

In [13]:
fashion.duplicated().sum()

np.int64(44)

In [14]:
fashion.drop_duplicates(inplace=True)

In [15]:
fashion.duplicated().sum()

np.int64(0)

In [16]:
fashion.shape

(14265, 8)

## Checking data types

In [17]:
fashion.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14265 entries, 0 to 14329
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   p_id          14265 non-null  float64
 1   name          14265 non-null  object 
 2   price         14265 non-null  float64
 3   colour        14265 non-null  object 
 4   brand         14265 non-null  object 
 5   img           14265 non-null  object 
 6   description   14265 non-null  object 
 7   p_attributes  14265 non-null  object 
dtypes: float64(2), object(6)
memory usage: 1003.0+ KB


## Checking the number of unique values of each column

In [18]:
fashion.nunique()

p_id            14220
name            13879
price            1208
colour             49
brand            1021
img             14220
description     14232
p_attributes    13094
dtype: int64

## Check statistics of data set

In [19]:
fashion.describe()

Unnamed: 0,p_id,price
count,14265.0,14265.0
mean,15688540.0,2967.732072
std,3155228.0,2566.815871
min,70166.0,169.0
25%,14136180.0,1599.0
50%,16382030.0,2200.0
75%,18076050.0,3497.0
max,19415760.0,47999.0


## Insight

Min price of womens fashion product on myntra is 169 INR and Max is 47999 INR

In [20]:
fashion.head(2)

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,5099.0,Black,Khushal K,http://assets.myntassets.com/assets/images/170...,Black printed Kurta with Palazzos with dupatta...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,5899.0,Orange,InWeave,http://assets.myntassets.com/assets/images/165...,Orange solid Kurta with Palazzos with dupatta<...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."


In [21]:
fashion.shape

(14265, 8)

## Over here we are dealing with description feature

In [22]:
fashion.iloc[0]['description']

"Black printed Kurta with Palazzos with dupatta <br> <br> <b> Kurta design:  </b> <ul> <li> Ethnic motifs printed </li> <li> Anarkali shape </li> <li> Regular style </li> <li> Mandarin collar,  three-quarter regular sleeves </li> <li> Calf length with flared hem </li> <li> Viscose rayon machine weave fabric </li> </ul> <br> <b> Palazzos design:  </b> <ul> <li> Printed Palazzos </li> <li> Elasticated waistband </li> <li> Slip-on closure </li> </ul>Dupatta Length 2.43 meters Width:&nbsp;88 cm<br>The model (height 5'8) is wearing a size S100% Rayon<br>Machine wash"

## We are removing the HTML elements from the description feature over here using regular expressions

In [23]:
import re

regex1 = re.compile(r'<[^>]+>')

def remove_html(string):
    return regex1.sub('', string)


## Removing &nbsp symbol from the string using regular expression over here

In [24]:
clean_nbsp = re.compile(r'&.*?;')
def remove_nbsp(string):
    return clean_nbsp.sub('',string)

In [25]:
fashion['description']=fashion['description'].apply(remove_html)
fashion['description']=fashion['description'].apply(remove_nbsp)

In [26]:
fashion.iloc[0]['description']

"Black printed Kurta with Palazzos with dupatta    Kurta design:     Ethnic motifs printed   Anarkali shape   Regular style   Mandarin collar,  three-quarter regular sleeves   Calf length with flared hem   Viscose rayon machine weave fabric     Palazzos design:     Printed Palazzos   Elasticated waistband   Slip-on closure  Dupatta Length 2.43 meters Width:88 cmThe model (height 5'8) is wearing a size S100% RayonMachine wash"

In [27]:
fashion.head(5)

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,5099.0,Black,Khushal K,http://assets.myntassets.com/assets/images/170...,Black printed Kurta with Palazzos with dupatta...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,5899.0,Orange,InWeave,http://assets.myntassets.com/assets/images/165...,Orange solid Kurta with Palazzos with dupattaK...,"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,32..."
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,4899.0,Navy Blue,Anubhutee,http://assets.myntassets.com/assets/images/163...,Navy blue embroidered Kurta with Trousers with...,"{'Add-Ons': 'NA', 'Body Shape ID': '333,424', ..."
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,3699.0,Red,Nayo,http://assets.myntassets.com/assets/images/147...,Red printed kurta with trouser and dupattaKurt...,"{'Add-Ons': 'NA', 'Body Shape ID': '333,424', ..."
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,1350.0,Black,AHIKA,http://assets.myntassets.com/assets/images/110...,"Black and green printed straight kurta, has a ...","{'Body Shape ID': '424', 'Body or Garment Size..."


## Converting the description feature into the list over here

In [28]:
fashion['description'] = fashion['description'].apply(lambda x:x.split())
fashion.sample(4)

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
4360,18290008.0,Go Colors Women Black Tapered Fit Trousers,799.0,Black,Go Colors,http://assets.myntassets.com/assets/images/182...,"[Black, woven, trousers, Tapered, fit, Mid-ris...","{'Add-Ons': 'NA', 'Body Shape ID': '443,333,42..."
2190,19171276.0,Kraus Jeans Women Blue Straight Fit High-Rise ...,1795.0,Blue,Kraus Jeans,http://assets.myntassets.com/assets/images/191...,"[Light, shade,, no, fade, blue, jeans, Straigh...","{'Add-Ons': 'NA', 'Body or Garment Size': 'To-..."
1771,14447568.0,Tokyo Talkies Pink Off-Shoulder Bardot Top,699.0,Pink,Tokyo Talkies,http://assets.myntassets.com/assets/images/pro...,"[Pink, regular, bardot, top, Solid, Off-should...","{'Body Shape ID': '333,324,424', 'Body or Garm..."
13178,9338261.0,SCORPIOUS Women Pack of Shrug & Top,2198.0,Multi,SCORPIUS,http://assets.myntassets.com/assets/images/933...,"[Pack, of, shrug, and, topSCORPIUS, Women, Red...",{'Body or Garment Size': 'Garment Measurements...


In [29]:
fashion.iloc[0]['description']

['Black',
 'printed',
 'Kurta',
 'with',
 'Palazzos',
 'with',
 'dupatta',
 'Kurta',
 'design:',
 'Ethnic',
 'motifs',
 'printed',
 'Anarkali',
 'shape',
 'Regular',
 'style',
 'Mandarin',
 'collar,',
 'three-quarter',
 'regular',
 'sleeves',
 'Calf',
 'length',
 'with',
 'flared',
 'hem',
 'Viscose',
 'rayon',
 'machine',
 'weave',
 'fabric',
 'Palazzos',
 'design:',
 'Printed',
 'Palazzos',
 'Elasticated',
 'waistband',
 'Slip-on',
 'closure',
 'Dupatta',
 'Length',
 '2.43',
 'meters',
 'Width:88',
 'cmThe',
 'model',
 '(height',
 "5'8)",
 'is',
 'wearing',
 'a',
 'size',
 'S100%',
 'RayonMachine',
 'wash']

## Over here we are dealing with product attributes

In [30]:
fashion.iloc[0]['p_attributes']

"{'Add-Ons': 'NA', 'Body Shape ID': '443,333,324,424', 'Body or Garment Size': 'Garment Measurements in', 'Bottom Closure': 'Slip-On', 'Bottom Fabric': 'Viscose Rayon', 'Bottom Pattern': 'Printed', 'Bottom Type': 'Palazzos', 'Character': 'NA', 'Dupatta': 'With Dupatta', 'Dupatta Border': 'Solid', 'Dupatta Fabric': 'Viscose Rayon', 'Dupatta Pattern': 'Printed', 'Main Trend': 'Indie Prints', 'Neck': 'Mandarin Collar', 'Number of Pockets': 'NA', 'Occasion': 'Festive', 'Ornamentation': 'NA', 'Pattern Coverage': 'Placement', 'Print or Pattern Type': 'Ethnic Motifs', 'Sleeve Length': 'Three-Quarter Sleeves', 'Sleeve Styling': 'Regular Sleeves', 'Slit Detail': 'NA', 'Stitch': 'Ready to Wear', 'Sustainable': 'Regular', 'Technique': 'Screen', 'Top Design Styling': 'Regular', 'Top Fabric': 'Viscose Rayon', 'Top Hemline': 'Flared', 'Top Length': 'Calf Length', 'Top Pattern': 'Printed', 'Top Shape': 'Anarkali', 'Top Type': 'Kurta', 'Waistband': 'Elasticated', 'Wash Care': 'Machine Wash', 'Weave Pa

## Converting the dict into list

In [31]:
import ast
def dict_to_list(input_string):
    input_dict = ast.literal_eval(input_string)
    return list(input_dict.values())

In [32]:
fashion['p_attributes'] = fashion['p_attributes'].apply(dict_to_list)


In [33]:
fashion

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,5099.0,Black,Khushal K,http://assets.myntassets.com/assets/images/170...,"[Black, printed, Kurta, with, Palazzos, with, ...","[NA, 443,333,324,424, Garment Measurements in,..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,5899.0,Orange,InWeave,http://assets.myntassets.com/assets/images/165...,"[Orange, solid, Kurta, with, Palazzos, with, d...","[NA, 443,333,324,424, Garment Measurements in,..."
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,4899.0,Navy Blue,Anubhutee,http://assets.myntassets.com/assets/images/163...,"[Navy, blue, embroidered, Kurta, with, Trouser...","[NA, 333,424, Garment Measurements in, Slip-On..."
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,3699.0,Red,Nayo,http://assets.myntassets.com/assets/images/147...,"[Red, printed, kurta, with, trouser, and, dupa...","[NA, 333,424, Garment Measurements in, Slip-On..."
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,1350.0,Black,AHIKA,http://assets.myntassets.com/assets/images/110...,"[Black, and, green, printed, straight, kurta,,...","[424, Garment Measurements in, Monochrome, Reg..."
...,...,...,...,...,...,...,...,...
14325,15415116.0,Flying Machine Women Blue Solid Mock-Collar Fr...,2299.0,Blue,Flying Machine,http://assets.myntassets.com/assets/images/154...,"[Blue, solid, front-open, sweatshirt, has, a, ...","[443,424,324, Garment Measurements in, Yes, NA..."
14326,16470114.0,Juelle Women Green Printed Hooded Sweatshirt,2299.0,Green,Juelle,http://assets.myntassets.com/assets/images/164...,"[Green, printed, sweatshirt, has, a, hooded,, ...","[443,424,324, Garment Measurements in, No, NA,..."
14327,16382150.0,Vero Moda Women Pink Sweatshirt,2299.0,Pink,Vero Moda,http://assets.myntassets.com/assets/images/163...,"[Pink, solid, sweatshirt, has, a, mock, collar...","[443,424,324, To-Fit Denotes Body Measurements..."
14328,16379664.0,Vero Moda Women Blue Sweatshirt,2299.0,Blue,Vero Moda,http://assets.myntassets.com/assets/images/163...,"[Blue, solid, sweatshirt, has, a, round, neck,...","[443,424,324, To-Fit Denotes Body Measurements..."


In [34]:
fashion.iloc[0]['p_attributes']

['NA',
 '443,333,324,424',
 'Garment Measurements in',
 'Slip-On',
 'Viscose Rayon',
 'Printed',
 'Palazzos',
 'NA',
 'With Dupatta',
 'Solid',
 'Viscose Rayon',
 'Printed',
 'Indie Prints',
 'Mandarin Collar',
 'NA',
 'Festive',
 'NA',
 'Placement',
 'Ethnic Motifs',
 'Three-Quarter Sleeves',
 'Regular Sleeves',
 'NA',
 'Ready to Wear',
 'Regular',
 'Screen',
 'Regular',
 'Viscose Rayon',
 'Flared',
 'Calf Length',
 'Printed',
 'Anarkali',
 'Kurta',
 'Elasticated',
 'Machine Wash',
 'Regular',
 'Machine Weave']

In [35]:
fashion['p_attributes']

0        [NA, 443,333,324,424, Garment Measurements in,...
1        [NA, 443,333,324,424, Garment Measurements in,...
2        [NA, 333,424, Garment Measurements in, Slip-On...
3        [NA, 333,424, Garment Measurements in, Slip-On...
4        [424, Garment Measurements in, Monochrome, Reg...
                               ...                        
14325    [443,424,324, Garment Measurements in, Yes, NA...
14326    [443,424,324, Garment Measurements in, No, NA,...
14327    [443,424,324, To-Fit Denotes Body Measurements...
14328    [443,424,324, To-Fit Denotes Body Measurements...
14329    [443,424,324, To-Fit Denotes Body Measurements...
Name: p_attributes, Length: 14265, dtype: object

In [36]:
fashion['description']

0        [Black, printed, Kurta, with, Palazzos, with, ...
1        [Orange, solid, Kurta, with, Palazzos, with, d...
2        [Navy, blue, embroidered, Kurta, with, Trouser...
3        [Red, printed, kurta, with, trouser, and, dupa...
4        [Black, and, green, printed, straight, kurta,,...
                               ...                        
14325    [Blue, solid, front-open, sweatshirt, has, a, ...
14326    [Green, printed, sweatshirt, has, a, hooded,, ...
14327    [Pink, solid, sweatshirt, has, a, mock, collar...
14328    [Blue, solid, sweatshirt, has, a, round, neck,...
14329    [Grey, and, green, colourblocked, sweatshirt, ...
Name: description, Length: 14265, dtype: object

In [37]:
fashion.head(2)

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,5099.0,Black,Khushal K,http://assets.myntassets.com/assets/images/170...,"[Black, printed, Kurta, with, Palazzos, with, ...","[NA, 443,333,324,424, Garment Measurements in,..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,5899.0,Orange,InWeave,http://assets.myntassets.com/assets/images/165...,"[Orange, solid, Kurta, with, Palazzos, with, d...","[NA, 443,333,324,424, Garment Measurements in,..."


## Converting the price into float catentation purpose

In [38]:
def convert_float_to_string(input_float):

    return str(input_float)


In [39]:
fashion['price'] = fashion['price'].apply(convert_float_to_string)


## Converting colour, brand and price into list over here

In [40]:
def convert_string_to_list(input_string):
    word_list = input_string.split()
    return word_list


In [41]:
fashion['colour'] = fashion['colour'].apply(convert_string_to_list)
fashion['brand'] = fashion['brand'].apply(convert_string_to_list)
fashion['price'] = fashion['price'].apply(convert_string_to_list)


In [42]:
fashion.head(2)

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,[5099.0],[Black],"[Khushal, K]",http://assets.myntassets.com/assets/images/170...,"[Black, printed, Kurta, with, Palazzos, with, ...","[NA, 443,333,324,424, Garment Measurements in,..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,[5899.0],[Orange],[InWeave],http://assets.myntassets.com/assets/images/165...,"[Orange, solid, Kurta, with, Palazzos, with, d...","[NA, 443,333,324,424, Garment Measurements in,..."


## Performing Concatentation to get a single feature called as tags with all the information regarding the fashion product

In [43]:
fashion['tags'] = fashion['price'] + fashion['colour'] + fashion['brand'] + fashion['description'] + fashion['p_attributes']


In [44]:
fashion.head()

Unnamed: 0,p_id,name,price,colour,brand,img,description,p_attributes,tags
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,[5099.0],[Black],"[Khushal, K]",http://assets.myntassets.com/assets/images/170...,"[Black, printed, Kurta, with, Palazzos, with, ...","[NA, 443,333,324,424, Garment Measurements in,...","[5099.0, Black, Khushal, K, Black, printed, Ku..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,[5899.0],[Orange],[InWeave],http://assets.myntassets.com/assets/images/165...,"[Orange, solid, Kurta, with, Palazzos, with, d...","[NA, 443,333,324,424, Garment Measurements in,...","[5899.0, Orange, InWeave, Orange, solid, Kurta..."
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,[4899.0],"[Navy, Blue]",[Anubhutee],http://assets.myntassets.com/assets/images/163...,"[Navy, blue, embroidered, Kurta, with, Trouser...","[NA, 333,424, Garment Measurements in, Slip-On...","[4899.0, Navy, Blue, Anubhutee, Navy, blue, em..."
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,[3699.0],[Red],[Nayo],http://assets.myntassets.com/assets/images/147...,"[Red, printed, kurta, with, trouser, and, dupa...","[NA, 333,424, Garment Measurements in, Slip-On...","[3699.0, Red, Nayo, Red, printed, kurta, with,..."
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,[1350.0],[Black],[AHIKA],http://assets.myntassets.com/assets/images/110...,"[Black, and, green, printed, straight, kurta,,...","[424, Garment Measurements in, Monochrome, Reg...","[1350.0, Black, AHIKA, Black, and, green, prin..."


In [45]:
fashion.iloc[0]['tags']

['5099.0',
 'Black',
 'Khushal',
 'K',
 'Black',
 'printed',
 'Kurta',
 'with',
 'Palazzos',
 'with',
 'dupatta',
 'Kurta',
 'design:',
 'Ethnic',
 'motifs',
 'printed',
 'Anarkali',
 'shape',
 'Regular',
 'style',
 'Mandarin',
 'collar,',
 'three-quarter',
 'regular',
 'sleeves',
 'Calf',
 'length',
 'with',
 'flared',
 'hem',
 'Viscose',
 'rayon',
 'machine',
 'weave',
 'fabric',
 'Palazzos',
 'design:',
 'Printed',
 'Palazzos',
 'Elasticated',
 'waistband',
 'Slip-on',
 'closure',
 'Dupatta',
 'Length',
 '2.43',
 'meters',
 'Width:88',
 'cmThe',
 'model',
 '(height',
 "5'8)",
 'is',
 'wearing',
 'a',
 'size',
 'S100%',
 'RayonMachine',
 'wash',
 'NA',
 '443,333,324,424',
 'Garment Measurements in',
 'Slip-On',
 'Viscose Rayon',
 'Printed',
 'Palazzos',
 'NA',
 'With Dupatta',
 'Solid',
 'Viscose Rayon',
 'Printed',
 'Indie Prints',
 'Mandarin Collar',
 'NA',
 'Festive',
 'NA',
 'Placement',
 'Ethnic Motifs',
 'Three-Quarter Sleeves',
 'Regular Sleeves',
 'NA',
 'Ready to Wear',
 'Re

## Dropping those extra columns

In [46]:
new_df = fashion[['p_id','name','img','tags']]

In [47]:
new_df.head()

Unnamed: 0,p_id,name,img,tags
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,http://assets.myntassets.com/assets/images/170...,"[5099.0, Black, Khushal, K, Black, printed, Ku..."
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,http://assets.myntassets.com/assets/images/165...,"[5899.0, Orange, InWeave, Orange, solid, Kurta..."
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,http://assets.myntassets.com/assets/images/163...,"[4899.0, Navy, Blue, Anubhutee, Navy, blue, em..."
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,http://assets.myntassets.com/assets/images/147...,"[3699.0, Red, Nayo, Red, printed, kurta, with,..."
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,http://assets.myntassets.com/assets/images/110...,"[1350.0, Black, AHIKA, Black, and, green, prin..."


##  Converting list to str

In [48]:
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))


Unnamed: 0,p_id,name,img,tags
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,http://assets.myntassets.com/assets/images/170...,5099.0 Black Khushal K Black printed Kurta wit...
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,http://assets.myntassets.com/assets/images/165...,5899.0 Orange InWeave Orange solid Kurta with ...
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,http://assets.myntassets.com/assets/images/163...,4899.0 Navy Blue Anubhutee Navy blue embroider...
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,http://assets.myntassets.com/assets/images/147...,3699.0 Red Nayo Red printed kurta with trouser...
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,http://assets.myntassets.com/assets/images/110...,1350.0 Black AHIKA Black and green printed str...


In [49]:
new_df.iloc[0]['tags']

"5099.0 Black Khushal K Black printed Kurta with Palazzos with dupatta Kurta design: Ethnic motifs printed Anarkali shape Regular style Mandarin collar, three-quarter regular sleeves Calf length with flared hem Viscose rayon machine weave fabric Palazzos design: Printed Palazzos Elasticated waistband Slip-on closure Dupatta Length 2.43 meters Width:88 cmThe model (height 5'8) is wearing a size S100% RayonMachine wash NA 443,333,324,424 Garment Measurements in Slip-On Viscose Rayon Printed Palazzos NA With Dupatta Solid Viscose Rayon Printed Indie Prints Mandarin Collar NA Festive NA Placement Ethnic Motifs Three-Quarter Sleeves Regular Sleeves NA Ready to Wear Regular Screen Regular Viscose Rayon Flared Calf Length Printed Anarkali Kurta Elasticated Machine Wash Regular Machine Weave"

## Converting to lower case

In [50]:
new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


In [51]:
new_df.head()

Unnamed: 0,p_id,name,img,tags
0,17048614.0,Khushal K Women Black Ethnic Motifs Printed Ku...,http://assets.myntassets.com/assets/images/170...,5099.0 black khushal k black printed kurta wit...
1,16524740.0,InWeave Women Orange Solid Kurta with Palazzos...,http://assets.myntassets.com/assets/images/165...,5899.0 orange inweave orange solid kurta with ...
2,16331376.0,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,http://assets.myntassets.com/assets/images/163...,4899.0 navy blue anubhutee navy blue embroider...
3,14709966.0,Nayo Women Red Floral Printed Kurta With Trous...,http://assets.myntassets.com/assets/images/147...,3699.0 red nayo red printed kurta with trouser...
4,11056154.0,AHIKA Women Black & Green Printed Straight Kurta,http://assets.myntassets.com/assets/images/110...,1350.0 black ahika black and green printed str...


In [52]:
new_df.iloc[0]['tags']

"5099.0 black khushal k black printed kurta with palazzos with dupatta kurta design: ethnic motifs printed anarkali shape regular style mandarin collar, three-quarter regular sleeves calf length with flared hem viscose rayon machine weave fabric palazzos design: printed palazzos elasticated waistband slip-on closure dupatta length 2.43 meters width:88 cmthe model (height 5'8) is wearing a size s100% rayonmachine wash na 443,333,324,424 garment measurements in slip-on viscose rayon printed palazzos na with dupatta solid viscose rayon printed indie prints mandarin collar na festive na placement ethnic motifs three-quarter sleeves regular sleeves na ready to wear regular screen regular viscose rayon flared calf length printed anarkali kurta elasticated machine wash regular machine weave"

## Using nltk library to perform stemming where we are reducing the words into base words like love loving and lovely will become live as they convey same meaning over here

In [53]:
import nltk
from nltk.stem import PorterStemmer

In [54]:
ps = PorterStemmer()

In [55]:
def stems(text):
    T = []
    
    for i in text.split():
        T.append(ps.stem(i))
    
    return " ".join(T)

In [56]:
new_df['tags'] = new_df['tags'].apply(stems)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stems)


In [57]:
new_df.iloc[0]['tags']

"5099.0 black khushal k black print kurta with palazzo with dupatta kurta design: ethnic motif print anarkali shape regular style mandarin collar, three-quart regular sleev calf length with flare hem viscos rayon machin weav fabric palazzo design: print palazzo elast waistband slip-on closur dupatta length 2.43 meter width:88 cmthe model (height 5'8) is wear a size s100% rayonmachin wash na 443,333,324,424 garment measur in slip-on viscos rayon print palazzo na with dupatta solid viscos rayon print indi print mandarin collar na festiv na placement ethnic motif three-quart sleev regular sleev na readi to wear regular screen regular viscos rayon flare calf length print anarkali kurta elast machin wash regular machin weav"

In [58]:
fashion.shape

(14265, 9)

## Performing text feature extraction using CountVectorizer for fetching the count of how many times that word appears in the correspoding document

In [59]:
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi

  from .autonotebook import tqdm as notebook_tqdm


In [60]:
new_df['tags']

0        5099.0 black khushal k black print kurta with ...
1        5899.0 orang inweav orang solid kurta with pal...
2        4899.0 navi blue anubhute navi blue embroid ku...
3        3699.0 red nayo red print kurta with trouser a...
4        1350.0 black ahika black and green print strai...
                               ...                        
14325    2299.0 blue fli machin blue solid front-open s...
14326    2299.0 green juell green print sweatshirt ha a...
14327    2299.0 pink vero moda pink solid sweatshirt ha...
14328    2299.0 blue vero moda blue solid sweatshirt ha...
14329    2299.0 grey vero moda grey and green colourblo...
Name: tags, Length: 14265, dtype: object

In [61]:
new_df = new_df.reset_index(drop=True)

In [62]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(new_df['tags'], convert_to_tensor=True)

In [63]:
from sentence_transformers import util

In [64]:
similarity = util.pytorch_cos_sim(embeddings, embeddings).numpy()

In [65]:
tokenized_corpus = [doc.split() for doc in new_df['tags']]
bm25 = BM25Okapi(tokenized_corpus)


In [66]:
def hybrid_recommend(name, alpha=0.5):
    """Hybrid recommendation combining BM25 and Semantic Similarity"""
    if name not in new_df['name'].values:
        print("Item not found!")
        return

    index = new_df[new_df['name'] == name].index[0]

    # BM25 Scores
    bm25_scores = bm25.get_scores(new_df['tags'][index].split())
    bm25_scores = np.array(bm25_scores)
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())  # Normalize

    # Semantic Similarity Scores
    sem_scores = similarity[index]
    sem_scores = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min())  # Normalize

    # Hybrid Score Calculation
    hybrid_scores = alpha * sem_scores + (1 - alpha) * bm25_scores
    ranked_items = sorted(list(enumerate(hybrid_scores)), key=lambda x: x[1], reverse=True)

    print("Your Recommendations:")
    for i in ranked_items[1:6]:  # Skip the first one as it's the input itself
        print(f"{new_df.iloc[i[0]]['name']} - Score: {i[1]:.4f}")


## We are using cosine similarity to measure the similarity between two documents based on their content as we are building content based recommendation system

In [70]:
hybrid_recommend('Libas Women Navy Blue Pure Cotton Floral Print Kurta with Palazzos & Dupatta')

Your Recommendations:
Indo Era Women Peach-Coloured & Pink Pure Cotton Printed Kurta with Trousers & Dupatta - Score: 0.6558
Indo Era Women Blue Floral Printed Regular Pure Cotton Kurta with Palazzos & With Dupatta - Score: 0.6526
Khushal K Women Green & Pink Printed Pure Cotton Kurta with Palazzos & Dupatta - Score: 0.6463
Rudra Bazaar Women Grey & Pink Floral Pleated Pure Cotton Kurta & Trousers & Dupatta - Score: 0.6440
Biba Women Red Floral Printed Regular Pure Cotton Kurta with Palazzos & With Dupatta - Score: 0.6439


In [71]:
import pickle


In [72]:
pickle.dump(new_df,open('artifacts/product_list.pkl','wb'))
pickle.dump(similarity,open('artifacts/similarity.pkl','wb'))