# Product Recommendation System
- Recommending similar product based on text features.
- Recommendation System is now a days very very common whether its e-commerce companies like Amazon, Flipkart, Myntra, etc.  

# Business Objective
- Lots of people uses e-commerce platforms to buy a product. People are expecting to see similar products when one choose a product.
- A model which can recommend people similar products is our Business Objective.

# Constaints

1. Low Latency - In real time within in few nano seconds the model should be able to recommend similar product to people.

2. Speed - We need speedy model not 100% accurate model. So accuracy is not important. 

3. Same product recommendation error - People doesn't like to see same product again and again.

4. Interpretablity is important.

# Data
- Data contains ladies tops fashion of amazon , initially data has 183k rows and 19 features.
- Out of 19 features we will use only 7 importnt features.
- By cleaning data we brought down the number of data points from 183K to 25K.

# Approch
- Titles are fairly descriptive of what the product is. 
- We will use title feature to recommend product as they are short and informative.

In [113]:
#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import gridspec
import seaborn as sns
import re
import math
import time
import os
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances
from scipy.sparse import hstack
from io import BytesIO
from PIL import Image
import warnings
warnings.filterwarnings("ignore")

In [114]:
# we have a json file which consists of all information about the products.
data = pd.read_json('tops_fashion.json')

In [115]:
print ('Number of data points :', data.shape[0], \
       '\nNumber of features/variables:', data.shape[1])

Number of data points : 183138 
Number of features/variables: 19


In [116]:
data.head() 

Unnamed: 0,sku,asin,product_type_name,formatted_price,author,color,brand,publisher,availability,reviews,large_image_url,availability_type,small_image_url,editorial_review,title,model,medium_image_url,manufacturer,editorial_reivew
0,,B016I2TS4W,SHIRT,,,,FNC7C,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Minions Como Superheroes Ironman Women's O Nec...,Minions Como Superheroes Ironman Long Sleeve R...,,https://images-na.ssl-images-amazon.com/images...,,
1,,B01N49AI08,SHIRT,,,,FIG Clothing,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Sizing runs on the small side. FIG® recommends...,FIG Clothing Womens Izo Tunic,,https://images-na.ssl-images-amazon.com/images...,,
2,,B01JDPCOHO,SHIRT,,,,FIG Clothing,,,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,Sizing runs on the small side. FIG® recommends...,FIG Clothing Womens Won Top,,https://images-na.ssl-images-amazon.com/images...,,
3,,B01N19U5H5,SHIRT,,,,Focal18,,,"[True, https://www.amazon.com/reviews/iframe?a...",https://images-na.ssl-images-amazon.com/images...,,https://images-na.ssl-images-amazon.com/images...,100% Brand New & Fashion<br> Quantity: 1 Piece...,Focal18 Sailor Collar Bubble Sleeve Blouse Shi...,,https://images-na.ssl-images-amazon.com/images...,,
4,,B004GSI2OS,SHIRT,$26.26,,Onyx Black/ Stone,FeatherLite,,Usually ships in 6-10 business days,"[False, https://www.amazon.com/reviews/iframe?...",https://images-na.ssl-images-amazon.com/images...,now,https://images-na.ssl-images-amazon.com/images...,,Featherlite Ladies' Long Sleeve Stain Resistan...,,https://images-na.ssl-images-amazon.com/images...,,


In [117]:
# Lets see features we have in dataset.
data.columns

Index(['sku', 'asin', 'product_type_name', 'formatted_price', 'author',
       'color', 'brand', 'publisher', 'availability', 'reviews',
       'large_image_url', 'availability_type', 'small_image_url',
       'editorial_review', 'title', 'model', 'medium_image_url',
       'manufacturer', 'editorial_reivew'],
      dtype='object')

Of these 19 features, we will be using only 7 features in this project.
- 1. asin  (Amazon standard identification number)
- 2. brand (brand name of the product)
- 3. medium_image_url  (url of the product image)
- 4. product_type_name (type of the product)
- 5. color (color information of the product) 
- 6. title (title of the product)
- 7. formatted_price (price of the product)

In [118]:
data = data[['asin', 'brand', 'medium_image_url', 'product_type_name', 'color', 'title', 'formatted_price']]

In [119]:
print ('Number of data points :', data.shape[0], \
       '\nNumber of features:', data.shape[1])
data.head()

Number of data points : 183138 
Number of features: 7


Unnamed: 0,asin,brand,medium_image_url,product_type_name,color,title,formatted_price
0,B016I2TS4W,FNC7C,https://images-na.ssl-images-amazon.com/images...,SHIRT,,Minions Como Superheroes Ironman Long Sleeve R...,
1,B01N49AI08,FIG Clothing,https://images-na.ssl-images-amazon.com/images...,SHIRT,,FIG Clothing Womens Izo Tunic,
2,B01JDPCOHO,FIG Clothing,https://images-na.ssl-images-amazon.com/images...,SHIRT,,FIG Clothing Womens Won Top,
3,B01N19U5H5,Focal18,https://images-na.ssl-images-amazon.com/images...,SHIRT,,Focal18 Sailor Collar Bubble Sleeve Blouse Shi...,
4,B004GSI2OS,FeatherLite,https://images-na.ssl-images-amazon.com/images...,SHIRT,Onyx Black/ Stone,Featherlite Ladies' Long Sleeve Stain Resistan...,$26.26


- Lets understand each feature in detail

In [120]:
data.describe()

Unnamed: 0,asin,brand,medium_image_url,product_type_name,color,title,formatted_price
count,183138,182987,183138,183138,64956,183138,28395
unique,183138,10577,170782,72,7380,175985,3135
top,B016I2TS4W,Zago,https://images-na.ssl-images-amazon.com/images...,SHIRT,Black,Nakoda Cotton Self Print Straight Kurti For Women,$19.99
freq,1,223,23,167794,13207,77,945


# 1. "asin" feature

In [121]:
print("Products",data.shape[0])
print(data['asin'].describe())
if data['asin'].isnull().sum() == 0:
    print("- All products have 'asin' information")
else:
    print("- {0} products does not have 'asin' information.".format(data['asin'].isnull().sum()))
if data['asin'].duplicated().sum() == 0:
    print("- Each products have unique 'asin' information")
else:
    print("- {0} products have same 'asin' information.".format(data['asin'].duplicated().sum()))

Products 183138
count         183138
unique        183138
top       B016I2TS4W
freq               1
Name: asin, dtype: object
- All products have 'asin' information
- Each products have unique 'asin' information


# 2. "brand" feature 

In [122]:
print("Products ",data.shape[0])
print(data['brand'].describe())
if data['brand'].isnull().sum() == 0:
    print("- All products have 'brand' information")
else:
    print("- {0} products does not have 'brand' information.".format(data['brand'].isnull().sum()))
if data['brand'].duplicated().sum() == 0:
    print("- Each products have unique 'brand' information")
else:
    print("- {0} products have same 'brand' information.".format(data['brand'].duplicated().sum()))

Products  183138
count     182987
unique     10577
top         Zago
freq         223
Name: brand, dtype: object
- 151 products does not have 'brand' information.
- 172560 products have same 'brand' information.


- 182987 of 183138 products have 'brand' information. That's apporx 99.91%
- We have 10577 unique 'brand'.
- 2.1% (223/10577) of the brands are 'Zago'.
- Same brands have different products, so will only drop missing 'brand' value information.
- Let's drop 151 data points / rows that does not have 'brand' information.

In [124]:
# for duplicate in range(len(data['brand'].duplicated())):
#     if data['brand'].duplicated()[duplicate] == True:
#         duplicate_df = data[data['brand'] == data['brand'][duplicate]]
#         duplicate_df = duplicate_df.reset_index(drop=True)
#         break
# for i in range(duplicate_df.shape[0]):
#     print(duplicate_df['title'][i])
#     print(duplicate_df['brand'][i])
#     print()
#     if i >= 5:
#         break

FIG Clothing Womens Izo Tunic
FIG Clothing

FIG Clothing Womens Won Top
FIG Clothing

FIG Clothing Womens Mag Top
FIG Clothing

FIG Clothing Womens Zof Top
FIG Clothing

FIG Clothing Womens Tuc Top
FIG Clothing

FIG Clothing Women's Dublin Top Macaroon Shirt
FIG Clothing



In [125]:
data =data.loc[~data['brand'].isnull()]

# Reset dataframe index.
data = data.reset_index(drop=True)
print("- Number of data points after eliminating brand = NULL :", data.shape[0])

- Number of data points after eliminating 'brand' = NULL : 182987


In [126]:
# Let's print top 10 'brand' and their frequency
brand_count = Counter(list(data['brand']))
for i in range(10):
    print("{0} - {1}".format(brand_count.most_common()[i][0],brand_count.most_common()[i][1])) 

Zago - 223
XQS - 222
Yayun - 215
YUNY - 198
XiaoTianXin-women clothes - 193
Generic - 192
Boohoo - 190
Alion - 188
Abetteric - 187
TheMogan - 187


# 3. "medium_image_url" feature 

In [127]:
print("Products",data.shape[0])
print(data['medium_image_url'].describe())
if data['medium_image_url'].isnull().sum() == 0:
    print("- All products have 'medium_image_url' information")
else:
    print("- {0} products does not have 'medium_image_url' information.".format(data['medium_image_url'].isnull().sum()))
if data['medium_image_url'].duplicated().sum() == 0:
    print("- Each products have unique 'medium_image_url' information")
else:
    print("- {0} products have same 'medium_image_url' information.".format(data['medium_image_url'].duplicated().sum()))

Products 182987
count                                                182987
unique                                               170667
top       https://images-na.ssl-images-amazon.com/images...
freq                                                     23
Name: medium_image_url, dtype: object
- All products have 'medium_image_url' information
- 12320 products have same 'medium_image_url' information.


- All of the products have image URL.
- But we have only 1,70,667 unique 'medium_image_url'.
- Some products have same image URL so lets check 'title' information for that URLs.

In [None]:
# duplicate = 0
# for duplicate in range(len(data['medium_image_url'].duplicated())):
#     # print(data['medium_image_url'].duplicated()[duplicate])
#     # print(data['medium_image_url'][duplicate])

#     if data['medium_image_url'].duplicated()[duplicate] == True:
#         duplicate_df = data[data['medium_image_url'] == data['medium_image_url'][duplicate]]
#         print(data['brand'][duplicate])
#         print(data['title'][duplicate])
#         print(data['medium_image_url'][duplicate])
#         break
    

In [112]:
for duplicate in range(len(data['medium_image_url'].duplicated())):
    if data['medium_image_url'].duplicated()[duplicate] == True:
        duplicate_df = data[data['medium_image_url'] == data['medium_image_url'][duplicate]]
        duplicate_df = duplicate_df.reset_index(drop=True)
        break
for i in range(duplicate_df.shape[0]):
    print(duplicate_df['medium_image_url'][i])
    print(duplicate_df['title'][i])
    print()

Supernatural Chibis Sam Dean And Castiel Short Sleeve T Shirts For Lady Purple S
https://images-na.ssl-images-amazon.com/images/I/41Cn7jZ3O9L._SL160_.jpg

Supernatural Chibis Sam Dean And Castiel O Neck T-shirts For Female Purple L
https://images-na.ssl-images-amazon.com/images/I/41Cn7jZ3O9L._SL160_.jpg

Supernatural Chibis Sam Dean And Castiel 100% Cotton T Shirts For Women Purple XXL
https://images-na.ssl-images-amazon.com/images/I/41Cn7jZ3O9L._SL160_.jpg



- From above titles and URLs it is clear that,some products may have same image URL but their title is different.
- So will 'not' drop those data points having same image 'URL'.

In [69]:
# # Let's drop duplicate URL
# data['medium_image_url'] = data['medium_image_url'].drop_duplicates()

# # Now we will drop data points that have 'Null/None" 'medium_image_url'.
# data = data.loc[~data['medium_image_url'].isnull()]

# # Reset index of data farme.
# data = data.reset_index(drop=True)
# print("Number of data points after eliminating 'medium_image_url' = NULL :", data.shape[0])

Number of data points after eliminating 'medium_image_url' = NULL : 170667


# 4. "product_type_name" feature 

In [128]:
print("Products ",data.shape[0])
print(data['product_type_name'].describe())
if data['product_type_name'].isnull().sum() == 0:
    print("- All products have 'product_type_name' information")
else:
    print("- {0} products does not have 'product_type_name' information.".format(data['product_type_name'].isnull().sum()))
if data['product_type_name'].duplicated().sum() == 0:
    print("- Each products have unique 'product_type_name' information")
else:
    print("- {0} products have same 'product_type_name' information.".format(data['product_type_name'].duplicated().sum()))

Products  182987
count     182987
unique        72
top        SHIRT
freq      167741
Name: product_type_name, dtype: object
 - All products have 'product_type_name' information
- 182915 products have same 'product_type_name' information.


- All products have 'product_type_name'.
- We have total 72 unique type of 'product_type_name'.
- 91.62% (167794/183138) of the products are shirts.
- Let's check 'title' information for same 'product_type_name'.

In [129]:
for duplicate in range(len(data['product_type_name'].duplicated())):
    if data['product_type_name'].duplicated()[duplicate] == True:
        duplicate_df = data[data['product_type_name'] == data['product_type_name'][duplicate]]
        duplicate_df = duplicate_df.reset_index(drop=True)
        break
for i in range(duplicate_df.shape[0]):
    print(duplicate_df['product_type_name'][i])
    print(duplicate_df['title'][i])
    print()
    if i >= 5:
        break

SHIRT
Minions Como Superheroes Ironman Long Sleeve Round Neck T-Shirt For Women

SHIRT
FIG Clothing Womens Izo Tunic

SHIRT
FIG Clothing Womens Won Top

SHIRT
Focal18 Sailor Collar Bubble Sleeve Blouse Shirt Women Mori Girl Casual Top Harajuku

SHIRT
Featherlite Ladies' Long Sleeve Stain Resistant Tapered Twill Shirt, 2XL, Onyx Black/ Stone

SHIRT
[Fits Cloth] Grape Solid Modern Long Sleeve Plain T Shirt



- It is clear that, for same 'product_type_name' we have different 'title' information.

In [130]:
# Let's print top 10 'product_type_name' and their frequency
product_list=Counter(data['product_type_name'])
for i in range(10):
    print("{0} - {1}".format(product_list.most_common()[i][0],product_list.most_common()[i][1])) 

SHIRT - 167741
APPAREL - 3548
BOOKS_1973_AND_LATER - 3336
DRESS - 1584
SPORTING_GOODS - 1272
SWEATER - 837
OUTERWEAR - 796
OUTDOOR_RECREATION_PRODUCT - 729
ACCESSORY - 636
UNDERWEAR - 425


# 5. "color" feature 

In [131]:
print("Products ",data.shape[0])
print(data['color'].describe())
if data['color'].isnull().sum() == 0:
    print("- All products have 'color' information")
else:
    print("- {0} products does not have 'color' information.".format(data['color'].isnull().sum()))
if data['color'].duplicated().sum() == 0:
    print("- Each products have unique 'color' information")
else:
    print("- {0} products have same 'color' information.".format(data['color'].duplicated().sum()))

Products  182987
count     64843
unique     7377
top       Black
freq      13181
Name: color, dtype: object
- 118144 products does not have 'color' information.
- 175609 products have same 'color' information.


- 64843 of 182987 products have color information. That's approx 35.4%.
- We have 7377 unique 'colors'
- 11.37% of products are black in color.
- Let's drop records that does not have color information

In [142]:
data =data.loc[~data['color'].isnull()]

# Reset dataframe index.
data = data.reset_index(drop=True)
print('Number of data points after eliminating color = NULL :', data.shape[0])

Number of data points after eliminating color = NULL : 64843


- Let's check 'title' information for same 'color' information.

In [143]:
for duplicate in range(len(data['color'].duplicated())):
    if data['color'].duplicated()[duplicate] == True:
        duplicate_df = data[data['color'] == data['color'][duplicate]]
        duplicate_df = duplicate_df.reset_index(drop=True)
        break
for i in range(duplicate_df.shape[0]):
    print(duplicate_df['color'][i])
    print(duplicate_df['title'][i])
    print()
    if i >= 5:
        break

Black
Standing on His Promises Rhinestones T-Shirt Ripped Cut Out Red Long

Black
Ladies Cotton Tank 2x1 Ribbed Tank Top

Black
Sexy Open Shoulder X Long Soft Shark Hemline Womans Plus Blouse Top 1x2x3x

Black
Women Roses Patch & Rhinestones T-Shirt Ripped Cut Out Short 01

Black
[Fits Cloth] Black Days Like Bless Plain T Shirt

Black
Fifth Degree Womens Gold Foil Graphic Tees Junior Top Short Sleeve Printed T Shirt - L



- It is clear that, same color information have different titles.

In [144]:
# Let's print top 10 'color' and their frequency
color_count = Counter(list(data['color']))
for i in range(10):
    print("{0} - {1}".format(color_count.most_common()[i][0],color_count.most_common()[i][1])) 

Black - 13181
White - 8599
Blue - 3558
Red - 2283
Pink - 1837
Grey - 1496
* - 1388
Green - 1257
Multi - 1203
Gray - 1189


# 6. "title" feature 

In [145]:
print("Products",data.shape[0])
print(data['title'].describe())
if data['title'].isnull().sum() == 0:
    print("- All products have 'title' information")
else:
    print("- {0} products does not have 'title' information.".format(data['title'].isnull().sum()))
if data['title'].duplicated().sum() == 0:
    print("- Each products have unique 'title' information")
else:
    print("- {0} products have same 'title' information.".format(data['title'].duplicated().sum()))

Products 64843
count                                                 64843
unique                                                61568
top       FINEJO Casual Women Long Sleeve Lace Irregular...
freq                                                     47
Name: title, dtype: object
- All products have 'title' information
- 3275 products have same 'title' information.


- All of the 64843 products have title. 
- We have 61568 unique 'titles'.
- Let's check duplicate titles.

In [153]:
for duplicate in range(len(data['title'].duplicated())):
    if data['title'].duplicated()[duplicate] == True:
        duplicate_df = data[data['title'] == data['title'][duplicate]]
        duplicate_df = duplicate_df.reset_index(drop=True)
        break
for i in range(duplicate_df.shape[0]):
    print("asin - {0} , Brand - {1} , Color - {2}".format(duplicate_df['asin'][i], duplicate_df['brand'][i], duplicate_df['color'][i]))
    print(duplicate_df['medium_image_url'][i])
    print(duplicate_df['title'][i])
    print()
    if i >= 5:
        break

asin - B003BTPVGA , Brand - FeatherLite , Color - Pacific Blue
https://images-na.ssl-images-amazon.com/images/I/41ygbhq127L._SL160_.jpg
FeatherLite Ladies Long Sleeve Stain Resistant Tapered Twill Shirt, Pacific B.

asin - B003BTPVHE , Brand - FeatherLite , Color - Pacific Blue
https://images-na.ssl-images-amazon.com/images/I/41ygbhq127L._SL160_.jpg
FeatherLite Ladies Long Sleeve Stain Resistant Tapered Twill Shirt, Pacific B.

asin - B003BTPVEM , Brand - FeatherLite , Color - Pacific Blue
https://images-na.ssl-images-amazon.com/images/I/41ygbhq127L._SL160_.jpg
FeatherLite Ladies Long Sleeve Stain Resistant Tapered Twill Shirt, Pacific B.

asin - B003BTOHHO , Brand - FeatherLite , Color - Pacific Blue
https://images-na.ssl-images-amazon.com/images/I/41ygbhq127L._SL160_.jpg
FeatherLite Ladies Long Sleeve Stain Resistant Tapered Twill Shirt, Pacific B.

asin - B003BTOHG0 , Brand - FeatherLite , Color - Pacific Blue
https://images-na.ssl-images-amazon.com/images/I/41ygbhq127L._SL160_.jpg


- It is clear that except 'asin' every feature is same for 'title' feature.
- So will drop all these duplicate titles.

In [154]:
# Let's drop duplicate title
data['title'] = data['title'].drop_duplicates()

# Now we will drop records that have 'Null/None" title.
data = data.loc[~data['title'].isnull()]

# Reset dataframe index.
data = data.reset_index(drop=True)
print('Number of data points after eliminating duplicte titles :', data.shape[0])

Number of data points after eliminating titles = NULL : 61568


# 7. "formatted_price" feature 

In [156]:
print("Products ",data.shape[0])
print(data['formatted_price'].describe())
if data['formatted_price'].isnull().sum() == 0:
    print("- All products have 'formatted_price' information")
else:
    print("- {0} products does not have 'formatted_price' information.".format(data['formatted_price'].isnull().sum()))
if data['formatted_price'].duplicated().sum() == 0:
    print("- Each products have unique 'formatted_price' information")
else:
    print("- {0} products have same 'formatted_price' information.".format(data['formatted_price'].duplicated().sum()))

Products  61568
count      25955
unique      3095
top       $19.99
freq         899
Name: formatted_price, dtype: object
- 35613 products does not have 'formatted_price' information.
- 58472 products have same 'formatted_price' information.


- Only 25,955 (42.15%) products have price information.
- 899 products have same price. That's "$19.99"
- Some products have same price but different 'title','brand' and other informations.
- So we will drop only those products which does not have price information.

In [157]:
data = data.loc[~data['formatted_price'].isnull()]

# Reset dataframe index.
data = data.reset_index(drop=True)
print('Number of data points after eliminating formatted_price = NULL :', data.shape[0])

Number of data points after eliminating price = NULL : 25955


In [159]:
# Lets print top 10 'formatted_price' and their frequency
price_count = Counter(list(data['formatted_price']))
for i in range(10):
    print("{0} - {1}".format(price_count.most_common()[i][0],price_count.most_common()[i][1])) 

$19.99 - 899
$9.99 - 664
$9.50 - 541
$14.99 - 412
$24.99 - 400
$7.50 - 395
$29.99 - 355
$16.99 - 288
$12.99 - 273
$8.99 - 263


- We brought down the number of data points from 183K to 25K.

In [25]:
data.to_csv('tops_fashion.csv',index=False)