<a href="https://colab.research.google.com/github/DSabarish/Product-Recommenders--Amazon/blob/main/working.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><b><font color='blue'>Product Recommendation Model using Content-Based Filtering</font></b></h1>


## <b>Introduction</b>

This **IPython notebook** introduces a robust **Product Recommendation Model** aimed at optimizing user engagement on e-commerce platforms like **Amazon** and **Flipkart**. Recommendation systems typically utilize two main approaches:

## <b>Methods of Recommendation</b>

>### <b>Content-Based Filtering</b>
In a **content-based recommendation system**, product recommendations are generated based on the **attributes** of the products themselves, such as **textual descriptions** or **image features**. This approach ensures that products similar in content to those searched or viewed by the user are suggested.

>### <b>Collaborative Filtering</b>
**Collaborative filtering**, another widely used method, relies on **user behavior data**. It suggests items based on the preferences and behavior patterns of similar users. However, due to data constraints in this project, I focus exclusively on **content-based recommendation**.


---


### [4.3] Overview of the data

In [4]:
# Import All Necessary Packages

from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import math
import time
import re
import os

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
from matplotlib import gridspec
from scipy.sparse import hstack
import plotly
import plotly.figure_factory as ff
from plotly.graph_objs import Scatter, Layout

plotly.offline.init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")

In [5]:
import gdown
import pandas as pd

# Google Drive URL
url = "https://drive.google.com/file/d/1y3tPyepPOweL1oez05g1O_bAZZShiaoZ/view?usp=drive_link"

# Extract the file ID from the URL
file_id = url.split('/')[5]
download_url = f'https://drive.google.com/uc?id={file_id}'

# Download the file
output = 'tops_fashion.json'
gdown.download(download_url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1y3tPyepPOweL1oez05g1O_bAZZShiaoZ
From (redirected): https://drive.google.com/uc?id=1y3tPyepPOweL1oez05g1O_bAZZShiaoZ&confirm=t&uuid=12f909b3-7b33-4dcf-a10b-f3a1295c456f
To: /content/tops_fashion.json
100%|██████████| 263M/263M [00:03<00:00, 82.6MB/s]


'tops_fashion.json'

In [6]:
data = pd.read_json(output)
"Data Loaded Successfully"

'Data Loaded Successfully'

In [7]:
print(f"Number of data points: {data.shape[0]}")
print(f"Number of features/variables: {data.shape[1]}")

Number of data points: 183138
Number of features/variables: 19


### **Terminology:**
- **Dataset:** A collection of data points or observations.
- **Rows and columns:** The rows represent individual data points, while columns represent variables or features describing each data point.
- **Data-point:** A single instance or observation within a dataset.
- **Feature/variable:** A measurable property or characteristic of a data-point, stored in columns.


In [8]:
# Each product/item has 19 features in the raw dataset
print("Number of features per product/item:", data.shape[1])
print("Column names (feature names) and data types:")
print(data.info())

Number of features per product/item: 19
Column names (feature names) and data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183138 entries, 0 to 183137
Data columns (total 19 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   sku                363 non-null     object
 1   asin               183138 non-null  object
 2   product_type_name  183138 non-null  object
 3   formatted_price    28395 non-null   object
 4   author             1 non-null       object
 5   color              64956 non-null   object
 6   brand              182987 non-null  object
 7   publisher          42899 non-null   object
 8   availability       24532 non-null   object
 9   reviews            183138 non-null  object
 10  large_image_url    183138 non-null  object
 11  availability_type  24559 non-null   object
 12  small_image_url    183138 non-null  object
 13  editorial_review   2758 non-null    object
 14  title              183138 non-n

**Features:**

Of these 19 features, we will be using only 6 features in this analysis and modeling:

1. **asin**: Amazon Standard Identification Number
2. **brand**: Brand to which the product belongs
3. **color**: Color information of the apparel (can contain multiple colors)
4. **product_type_name**: Type of the apparel (e.g., SHIRT, TSHIRT)
5. **medium_image_url**: URL of the product image
6. **title**: Title of the product
7. **formatted_price**: Price of the product


In [9]:
# List of columns to extract
columns_to_extract = ['asin', 'brand', 'color', 'medium_image_url',
                      'product_type_name', 'title', 'formatted_price']

# Extracting specified columns from the DataFrame
data = data[columns_to_extract]
data.head()

Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
0,B016I2TS4W,FNC7C,,https://images-na.ssl-images-amazon.com/images...,SHIRT,Minions Como Superheroes Ironman Long Sleeve R...,
1,B01N49AI08,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,FIG Clothing Womens Izo Tunic,
2,B01JDPCOHO,FIG Clothing,,https://images-na.ssl-images-amazon.com/images...,SHIRT,FIG Clothing Womens Won Top,
3,B01N19U5H5,Focal18,,https://images-na.ssl-images-amazon.com/images...,SHIRT,Focal18 Sailor Collar Bubble Sleeve Blouse Shi...,
4,B004GSI2OS,FeatherLite,Onyx Black/ Stone,https://images-na.ssl-images-amazon.com/images...,SHIRT,Featherlite Ladies' Long Sleeve Stain Resistan...,$26.26


In [10]:
print(f"Number of data points: {data.shape[0]}")
print(f"Number of features/variables: {data.shape[1]}")

Number of data points: 183138
Number of features/variables: 7


### [5.1] Missing data for various features.

In [11]:
def describe_feature(df, column_name):
    """
    Describe the given feature with various statistics and details.

    Parameters:
    - df: Pandas DataFrame, the DataFrame containing the feature to analyze.
    - column_name: str, the name of the column (feature) to analyze.

    Returns:
    - None (outputs results directly).
    """
    # Extract the feature (column) from the DataFrame
    feature = df[column_name]

    # Calculate top group statistics
    top_group = feature.mode().iloc[0]
    top_group_count = feature.value_counts().max()
    total_count = df.shape[0]
    top_group_percentage = (top_group_count / total_count) * 100

    # Display top 10 values and their counts
    print("\nTop 10 values and their counts:")
    print(feature.value_counts().head(10))

    # Display percentage of items from top group
    print(f"\n◼ {top_group_percentage:.2f}% of items are from the '{top_group}' group.")

    # Display number of unique values
    unique_values_count = feature.nunique()
    print(f"◼ Number of unique values: {unique_values_count}")

    # Display number of null values
    null_values_count = feature.isnull().sum()
    null_percentage = (null_values_count / df.shape[0]) * 100
    print(f"◼ Number of null values: {null_values_count} ({null_percentage:.2f}% of total)")

# Example usage:
# Assuming 'data' is your DataFrame and 'brand' is the column of interest
# describe_feature(data, 'brand')


####  Basic stats for the feature: product_type_name

In [12]:
describe_feature(data, 'product_type_name')


Top 10 values and their counts:
product_type_name
SHIRT                         167794
APPAREL                         3549
BOOKS_1973_AND_LATER            3336
DRESS                           1584
SPORTING_GOODS                  1281
SWEATER                          837
OUTERWEAR                        796
OUTDOOR_RECREATION_PRODUCT       729
ACCESSORY                        636
UNDERWEAR                        425
Name: count, dtype: int64

◼ 91.62% of items are from the 'SHIRT' group.
◼ Number of unique values: 72
◼ Number of null values: 0 (0.00% of total)


In [13]:
# Names of different product types
data['product_type_name'].unique()

array(['SHIRT', 'SWEATER', 'APPAREL', 'OUTDOOR_RECREATION_PRODUCT',
       'BOOKS_1973_AND_LATER', 'PANTS', 'HAT', 'SPORTING_GOODS', 'DRESS',
       'UNDERWEAR', 'SKIRT', 'OUTERWEAR', 'BRA', 'ACCESSORY',
       'ART_SUPPLIES', 'SLEEPWEAR', 'ORCA_SHIRT', 'HANDBAG',
       'PET_SUPPLIES', 'SHOES', 'KITCHEN', 'ADULT_COSTUME',
       'HOME_BED_AND_BATH', 'MISC_OTHER', 'BLAZER',
       'HEALTH_PERSONAL_CARE', 'TOYS_AND_GAMES', 'SWIMWEAR',
       'CONSUMER_ELECTRONICS', 'SHORTS', 'HOME', 'AUTO_PART',
       'OFFICE_PRODUCTS', 'ETHNIC_WEAR', 'BEAUTY',
       'INSTRUMENT_PARTS_AND_ACCESSORIES', 'POWERSPORTS_PROTECTIVE_GEAR',
       'SHIRTS', 'ABIS_APPAREL', 'AUTO_ACCESSORY', 'NONAPPARELMISC',
       'TOOLS', 'BABY_PRODUCT', 'SOCKSHOSIERY',
       'POWERSPORTS_RIDING_SHIRT', 'EYEWEAR', 'SUIT', 'OUTDOOR_LIVING',
       'POWERSPORTS_RIDING_JACKET', 'HARDWARE', 'SAFETY_SUPPLY',
       'ABIS_DVD', 'VIDEO_DVD', 'GOLF_CLUB', 'MUSIC_POPULAR_VINYL',
       'HOME_FURNITURE_AND_DECOR', 'TABLET_COMPUTER',

####  Basic stats for the feature: brand

In [14]:
describe_feature(data, 'brand')


Top 10 values and their counts:
brand
Zago                         223
XQS                          222
Yayun                        215
YUNY                         198
XiaoTianXin-women clothes    193
Generic                      192
Boohoo                       190
Alion                        188
TheMogan                     187
Abetteric                    187
Name: count, dtype: int64

◼ 0.12% of items are from the 'Zago' group.
◼ Number of unique values: 10577
◼ Number of null values: 151 (0.08% of total)


####  Basic stats for the feature: color

In [15]:
describe_feature(data, 'color')


Top 10 values and their counts:
color
Black    13207
White     8616
Blue      3570
Red       2289
Pink      1842
Grey      1499
*         1388
Green     1258
Multi     1203
Gray      1189
Name: count, dtype: int64

◼ 7.21% of items are from the 'Black' group.
◼ Number of unique values: 7380
◼ Number of null values: 118182 (64.53% of total)


####  Basic stats for the feature: formatted_price

In [16]:
describe_feature(data, 'formatted_price')


Top 10 values and their counts:
formatted_price
$19.99    945
$9.99     749
$9.50     601
$14.99    472
$7.50     463
$24.99    414
$29.99    370
$8.99     343
$9.01     336
$16.99    317
Name: count, dtype: int64

◼ 0.52% of items are from the '$19.99' group.
◼ Number of unique values: 3135
◼ Number of null values: 154743 (84.50% of total)


#### Basic stats for the feature: title


In [17]:
# All products have a title.
# Titles are typically descriptive of the product.
# Titles are used extensively because they are concise and informative.

describe_feature(data, 'title')


Top 10 values and their counts:
title
Nakoda Cotton Self Print Straight Kurti For Women                                77
Q-rious Women's Racerback Cotton Lycra Camsioles                                 56
FINEJO Casual Women Long Sleeve Lace Irregular Hem Blouse Tops                   47
Girlzwalk Women Cami Sleeveless Printed Swing Vest Top Plus Sizes                44
ELINA FASHION Women's Indo-Western Tunic Top Cotton Kurti                        43
Victoria Scoop Neck Front Lace Floral High-Low Top in 4 Sizes                    40
Cenizas Women's Indian Tunic Top Cotton Kurti                                    39
Indistar Womens Premium Cotton Half Sleeves Printed T-Shirts/Tops (Pack of 3)    37
Rajnandini Women's Cotton Printed Kurti                                          35
Long Sleeve Mock Neck Top                                                        32
Name: count, dtype: int64

◼ 0.04% of items are from the 'Nakoda Cotton Self Print Straight Kurti For Women' group.
◼ Num

In [18]:
# Create a directory for pickle files if it doesn't exist
pickle_folder = 'pickles'
os.makedirs(pickle_folder, exist_ok=True)

In [19]:
# Save DataFrame to pickle file
data.to_pickle(os.path.join(pickle_folder, '180k_apparel_data.pkl'))
print(f"DataFrame saved as '180k_apparel_data.pkl' in '{pickle_folder}' folder.")

DataFrame saved as '180k_apparel_data.pkl' in 'pickles' folder.


In [20]:
# We save data files at every major step in our processing as "pickle" files.
# If you are stuck or if some code takes too long to run on your laptop,
# you may use the pickle files we provide to speed things up.

# Load DataFrame from pickle file
data = pd.read_pickle('pickles/180k_apparel_data.pkl')
"pickle Loaded"

'pickle Loaded'

In [21]:
# Include only products with available price information.
# data['formatted_price'].isnull() identifies rows where price is None or Null.
# Exclude rows where price is Null.

data = data[data['formatted_price'].notnull()]
print('Number of data points after removing price=NULL:', data.shape[0])


Number of data points after removing price=NULL: 28395


In [22]:
# Consider products with color information.
# Filter out rows where color is Null using data['color'].notnull().

data = data[data['color'].notnull()]
print('Number of data points after filtering out color=NULL:', data.shape[0])

Number of data points after filtering out color=NULL: 28385


✅ We've reduced the dataset from 183K to 28K points. <br>
✅ This change ensures most workshop participants can run the code on their laptops in a reasonable time.<br>
✅ For those with powerful computers and more time, using all 183K images is recommended.


In [23]:
# Save DataFrame to pickle file
data.to_pickle(os.path.join(pickle_folder, '28k_apparel_data.pkl'))
print(f"DataFrame saved as '28k_apparel_data.pkl' in '{pickle_folder}' folder.")

DataFrame saved as '28k_apparel_data.pkl' in 'pickles' folder.


In [24]:
# Load DataFrame from pickle file
data = pd.read_pickle('pickles/28k_apparel_data.pkl')
"pickle Loaded"

'pickle Loaded'

In [25]:
data.columns

Index(['asin', 'brand', 'color', 'medium_image_url', 'product_type_name',
       'title', 'formatted_price'],
      dtype='object')

In [26]:

# '''
# from PIL import Image
# import requests
# from io import BytesIO

# for index, row in images.iterrows():
#         url = row['medium_image_url']
#         response = requests.get(url)
#         img = Image.open(BytesIO(response.content))
#         img.save('images/28k_images/'+row['asin']+'.jpeg')


# '''

### [5.2] Remove near duplicate items

#### [5.2.1] Understand about duplicates.

In [27]:
data = pd.read_pickle('pickles/28k_apparel_data.pkl')                                              # Read data from the pickle file from the previous stage

duplicate_titles_count = sum(data.duplicated('title'))                                             # Find the number of products that have duplicate titles
print(f"Number of products with duplicate titles: {duplicate_titles_count}")

print(f"We have {duplicate_titles_count} products that have the same title but different colors.") # Output a specific note about the duplicates

Number of products with duplicate titles: 2325
We have 2325 products that have the same title but different colors.


#### These shirts are exactly same, except in size (S, M,L,XL)

<table>
<tr>
<td><img src="dedupe/B00AQ4GMCK.jpeg" width="100" height="100"> :B00AQ4GMCK</td>
<td><img src="dedupe/B00AQ4GMTS.jpeg" width="100" height="100"> :B00AQ4GMTS</td>
</tr>
<tr>
<td><img src="dedupe/B00AQ4GMLQ.jpeg" width="100" height="100"> :B00AQ4GMLQ</td>
<td><img src="dedupe/B00AQ4GN3I.jpeg" width="100" height="100"> :B00AQ4GN3I</td>
</tr>
</table>


#### These shirts exactly same except  in color

<table>
<tr>
<td><img src="dedupe/B00G278GZ6.jpeg" width="100" height="100"> : B00G278GZ6</td>
<td><img src="dedupe/B00G278W6O.jpeg" width="100" height="100"> : B00G278W6O</td>
</tr>
<tr>
<td><img src="dedupe/B00G278Z2A.jpeg" width="100" height="100"> : B00G278Z2A</td>
<td><img src="dedupe/B00G2786X8.jpeg" width="100" height="100"> : B00G2786X8</td>
</tr>
</table>


#### In our data there are many duplicate products like the above examples, we need to de-dupe them for better results.


#### [5.2.2] Remove duplicates : Part 1

In [28]:
# Read data from the pickle file from the previous stage
data = pd.read_pickle('pickles/28k_apparel_data.pkl')
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 28385 entries, 4 to 183136
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   asin               28385 non-null  object
 1   brand              28292 non-null  object
 2   color              28385 non-null  object
 3   medium_image_url   28385 non-null  object
 4   product_type_name  28385 non-null  object
 5   title              28385 non-null  object
 6   formatted_price    28385 non-null  object
dtypes: object(7)
memory usage: 1.7+ MB


In [29]:
# Remove All products with very few words in title
data_sorted = data[data['title'].apply(lambda x: len(x.split())>4)]
print("After removal of products with short description:", data_sorted.shape[0])

After removal of products with short description: 27949


In [30]:
# Sort the whole data based on title (alphabetical order of title)
data_sorted.sort_values('title',inplace=True, ascending=False)
data_sorted[:5]

Unnamed: 0,asin,brand,color,medium_image_url,product_type_name,title,formatted_price
61973,B06Y1KZ2WB,Éclair,Black/Pink,https://images-na.ssl-images-amazon.com/images...,SHIRT,Éclair Women's Printed Thin Strap Blouse Black...,$24.99
133820,B010RV33VE,xiaoming,Pink,https://images-na.ssl-images-amazon.com/images...,SHIRT,xiaoming Womens Sleeveless Loose Long T-shirts...,$18.19
81461,B01DDSDLNS,xiaoming,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,xiaoming Women's White Long Sleeve Single Brea...,$21.58
75995,B00X5LYO9Y,xiaoming,Red Anchors,https://images-na.ssl-images-amazon.com/images...,SHIRT,xiaoming Stripes Tank Patch/Bear Sleeve Anchor...,$15.91
151570,B00WPJG35K,xiaoming,White,https://images-na.ssl-images-amazon.com/images...,SHIRT,xiaoming Sleeve Sheer Loose Tassel Kimono Woma...,$14.32


#### Some examples of dupliacte titles that differ only in the last few words.

<pre>
Titles 1:
16. woman's place is in the house and the senate shirts for Womens XXL White
17. woman's place is in the house and the senate shirts for Womens M Grey

Title 2:
25. tokidoki The Queen of Diamonds Women's Shirt X-Large
26. tokidoki The Queen of Diamonds Women's Shirt Small
27. tokidoki The Queen of Diamonds Women's Shirt Large

Title 3:
61. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
62. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
63. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
64. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
</pre>

In [31]:
indices = []
for i,row in data_sorted.iterrows():
    indices.append(i)

In [32]:
import itertools
stage1_dedupe_asins = []
i = 0
j = 0
num_data_points = data_sorted.shape[0]
while i < num_data_points and j < num_data_points:

    previous_i = i

    # store the list of words of ith string in a, ex: a = ['tokidoki', 'The', 'Queen', 'of', 'Diamonds', 'Women's', 'Shirt', 'X-Large']
    a = data['title'].loc[indices[i]].split()

    # search for the similar products sequentially
    j = i+1
    while j < num_data_points:

        # store the list of words of jth string in b, ex: b = ['tokidoki', 'The', 'Queen', 'of', 'Diamonds', 'Women's', 'Shirt', 'Small']
        b = data['title'].loc[indices[j]].split()

        # store the maximum length of two strings
        length = max(len(a), len(b))

        # count is used to store the number of words that are matched in both strings
        count  = 0

        # itertools.zip_longest(a,b): will map the corresponding words in both strings, it will appened None in case of unequal strings
        # example: a =['a', 'b', 'c', 'd']
        # b = ['a', 'b', 'd']
        # itertools.zip_longest(a,b): will give [('a','a'), ('b','b'), ('c','d'), ('d', None)]
        for k in itertools.zip_longest(a,b):
            if (k[0] == k[1]):
                count += 1

        # if the number of words in which both strings differ are > 2 , we are considering it as those two apperals are different
        # if the number of words in which both strings differ are < 2 , we are considering it as those two apperals are same, hence we are ignoring them
        if (length - count) > 2: # number of words in which both sensences differ
            # if both strings are differ by more than 2 words we include the 1st string index
            stage1_dedupe_asins.append(data_sorted['asin'].loc[indices[i]])

            # if the comaprision between is between num_data_points, num_data_points-1 strings and they differ in more than 2 words we include both
            if j == num_data_points-1: stage1_dedupe_asins.append(data_sorted['asin'].loc[indices[j]])

            # start searching for similar apperals corresponds 2nd string
            i = j
            break
        else:
            j += 1
    if previous_i == i:
        break

In [33]:
data = data.loc[data['asin'].isin(stage1_dedupe_asins)]

#### We removed  the dupliactes which differ only at the end.

In [34]:
print('Number of data points : ', data.shape[0])

Number of data points :  17593


In [35]:
# Save DataFrame to pickle file
pickle_folder = 'pickles'  # Ensure this directory exists
data.to_pickle(os.path.join(pickle_folder, '17k_apparel_data.pkl'))
print(f"DataFrame saved as '17k_apparel_data.pkl' in '{pickle_folder}' folder.")

DataFrame saved as '17k_apparel_data.pkl' in 'pickles' folder.


#### [5.2.3] Remove duplicates : Part 2

<pre>

In the previous cell, we sorted whole data in alphabetical order of  titles.Then, we removed titles which are adjacent and very similar title

But there are some products whose titles are not adjacent but very similar.

Examples:

Titles-1
86261.  UltraClub Women's Classic Wrinkle-Free Long Sleeve Oxford Shirt, Pink, XX-Large
115042. UltraClub Ladies Classic Wrinkle-Free Long-Sleeve Oxford Light Blue XXL

TItles-2
75004.  EVALY Women's Cool University Of UTAH 3/4 Sleeve Raglan Tee
109225. EVALY Women's Unique University Of UTAH 3/4 Sleeve Raglan Tees
120832. EVALY Women's New University Of UTAH 3/4-Sleeve Raglan Tshirt

</pre>

In [36]:
# Read data from the pickle file from the previous stage
data = pd.read_pickle('pickles/17k_apparel_data.pkl')
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17593 entries, 4 to 183120
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   asin               17593 non-null  object
 1   brand              17543 non-null  object
 2   color              17593 non-null  object
 3   medium_image_url   17593 non-null  object
 4   product_type_name  17593 non-null  object
 5   title              17593 non-null  object
 6   formatted_price    17593 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


In [39]:
# This code snippet takes significant amount of time.
# O(n^2) time.
# Takes about an hour to run on a decent computer.

indices = []
for i,row in data.iterrows():
    indices.append(i)

stage2_dedupe_asins = []
while len(indices)!=0:
    i = indices.pop()
    stage2_dedupe_asins.append(data['asin'].loc[i])
    # consider the first apperal's title
    a = data['title'].loc[i].split()
    # store the list of words of ith string in a, ex: a = ['tokidoki', 'The', 'Queen', 'of', 'Diamonds', 'Women's', 'Shirt', 'X-Large']
    for j in indices:

        b = data['title'].loc[j].split()
        # store the list of words of jth string in b, ex: b = ['tokidoki', 'The', 'Queen', 'of', 'Diamonds', 'Women's', 'Shirt', 'X-Large']

        length = max(len(a),len(b))

        # count is used to store the number of words that are matched in both strings
        count  = 0

        # itertools.zip_longest(a,b): will map the corresponding words in both strings, it will appened None in case of unequal strings
        # example: a =['a', 'b', 'c', 'd']
        # b = ['a', 'b', 'd']
        # itertools.zip_longest(a,b): will give [('a','a'), ('b','b'), ('c','d'), ('d', None)]
        for k in itertools.zip_longest(a,b):
            if (k[0]==k[1]):
                count += 1

        # if the number of words in which both strings differ are < 3 , we are considering it as those two apperals are same, hence we are ignoring them
        if (length - count) < 3:
            indices.remove(j)

In [40]:
# from whole previous products we will consider only
# the products that are found in previous cell
data = data.loc[data['asin'].isin(stage2_dedupe_asins)]

In [41]:
print('Number of data points after stage two of dedupe: ',data.shape[0])
# from 17k apperals we reduced to 16k apperals

Number of data points after stage two of dedupe:  16435


In [45]:
# data.to_pickle('pickels/16k_apperal_data')
# Storing these products in a pickle file
# candidates who wants to download these files instead
# of 180K they can download and use them from the Google Drive folder.
# Save DataFrame to pickle file
pickle_folder = 'pickles'  # Ensure this directory exists
data.to_pickle(os.path.join(pickle_folder, '16k_apperal_data.pkl'))
print(f"DataFrame saved as '16k_apperal_data.pkl' in '{pickle_folder}' folder.")

DataFrame saved as '16k_apperal_data.pkl' in 'pickles' folder.


# 6. Text pre-processing

In [46]:
data = pd.read_pickle('pickels/16k_apperal_data')

# NLTK download stop words. [RUN ONLY ONCE]
# goto Terminal (Linux/Mac) or Command-Prompt (Window)
# In the temrinal, type these commands
# $python3
# $import nltk
# $nltk.download()

FileNotFoundError: [Errno 2] No such file or directory: 'pickels/16k_apperal_data'

In [None]:
# we use the list of stop words that are downloaded from nltk lib.
stop_words = set(stopwords.words('english'))
print ('list of stop words:', stop_words)

def nlp_preprocessing(total_text, index, column):
    if type(total_text) is not int:
        string = ""
        for words in total_text.split():
            # remove the special chars in review like '"#$@!%^&*()_+-~?>< etc.
            word = ("".join(e for e in words if e.isalnum()))
            # Conver all letters to lower-case
            word = word.lower()
            # stop-word removal
            if not word in stop_words:
                string += word + " "
        data[column][index] = string

In [None]:
start_time = time.clock()
# we take each title and we text-preprocess it.
for index, row in data.iterrows():
    nlp_preprocessing(row['title'], index, 'title')
# we print the time it took to preprocess whole titles
print(time.clock() - start_time, "seconds")

In [None]:
data.head()

In [None]:
data.to_pickle('pickels/16k_apperal_data_preprocessed')

## Stemming

In [None]:
from nltk.stem.porter import *
stemmer = PorterStemmer()
print(stemmer.stem('arguing'))
print(stemmer.stem('fishing'))


# We tried using stemming on our titles and it didnot work very well.


# [8] Text based product similarity

In [None]:
data = pd.read_pickle('pickels/16k_apperal_data_preprocessed')
data.head()

In [None]:
# Utility Functions which we will use through the rest of the workshop.


#Display an image
def display_img(url,ax,fig):
    # we get the url of the apparel and download it
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    # we will display it in notebook
    plt.imshow(img)

#plotting code to understand the algorithm's decision.
def plot_heatmap(keys, values, labels, url, text):
        # keys: list of words of recommended title
        # values: len(values) ==  len(keys), values(i) represents the occurence of the word keys(i)
        # labels: len(labels) == len(keys), the values of labels depends on the model we are using
                # if model == 'bag of words': labels(i) = values(i)
                # if model == 'tfidf weighted bag of words':labels(i) = tfidf(keys(i))
                # if model == 'idf weighted bag of words':labels(i) = idf(keys(i))
        # url : apparel's url

        # we will devide the whole figure into two parts
        gs = gridspec.GridSpec(2, 2, width_ratios=[4,1], height_ratios=[4,1])
        fig = plt.figure(figsize=(25,3))

        # 1st, ploting heat map that represents the count of commonly ocurred words in title2
        ax = plt.subplot(gs[0])
        # it displays a cell in white color if the word is intersection(lis of words of title1 and list of words of title2), in black if not
        ax = sns.heatmap(np.array([values]), annot=np.array([labels]))
        ax.set_xticklabels(keys) # set that axis labels as the words of title
        ax.set_title(text) # apparel title

        # 2nd, plotting image of the the apparel
        ax = plt.subplot(gs[1])
        # we don't want any grid lines for image and no labels on x-axis and y-axis
        ax.grid(False)
        ax.set_xticks([])
        ax.set_yticks([])

        # we call dispaly_img based with paramete url
        display_img(url, ax, fig)

        # displays combine figure ( heat map and image together)
        plt.show()

def plot_heatmap_image(doc_id, vec1, vec2, url, text, model):

    # doc_id : index of the title1
    # vec1 : input apparels's vector, it is of a dict type {word:count}
    # vec2 : recommended apparels's vector, it is of a dict type {word:count}
    # url : apparels image url
    # text: title of recomonded apparel (used to keep title of image)
    # model, it can be any of the models,
        # 1. bag_of_words
        # 2. tfidf
        # 3. idf

    # we find the common words in both titles, because these only words contribute to the distance between two title vec's
    intersection = set(vec1.keys()) & set(vec2.keys())

    # we set the values of non intersecting words to zero, this is just to show the difference in heatmap
    for i in vec2:
        if i not in intersection:
            vec2[i]=0

    # for labeling heatmap, keys contains list of all words in title2
    keys = list(vec2.keys())
    #  if ith word in intersection(lis of words of title1 and list of words of title2): values(i)=count of that word in title2 else values(i)=0
    values = [vec2[x] for x in vec2.keys()]

    # labels: len(labels) == len(keys), the values of labels depends on the model we are using
        # if model == 'bag of words': labels(i) = values(i)
        # if model == 'tfidf weighted bag of words':labels(i) = tfidf(keys(i))
        # if model == 'idf weighted bag of words':labels(i) = idf(keys(i))

    if model == 'bag_of_words':
        labels = values
    elif model == 'tfidf':
        labels = []
        for x in vec2.keys():
            # tfidf_title_vectorizer.vocabulary_ it contains all the words in the corpus
            # tfidf_title_features[doc_id, index_of_word_in_corpus] will give the tfidf value of word in given document (doc_id)
            if x in  tfidf_title_vectorizer.vocabulary_:
                labels.append(tfidf_title_features[doc_id, tfidf_title_vectorizer.vocabulary_[x]])
            else:
                labels.append(0)
    elif model == 'idf':
        labels = []
        for x in vec2.keys():
            # idf_title_vectorizer.vocabulary_ it contains all the words in the corpus
            # idf_title_features[doc_id, index_of_word_in_corpus] will give the idf value of word in given document (doc_id)
            if x in  idf_title_vectorizer.vocabulary_:
                labels.append(idf_title_features[doc_id, idf_title_vectorizer.vocabulary_[x]])
            else:
                labels.append(0)

    plot_heatmap(keys, values, labels, url, text)


# this function gets a list of wrods along with the frequency of each
# word given "text"
def text_to_vector(text):
    word = re.compile(r'\w+')
    words = word.findall(text)
    # words stores list of all words in given string, you can try 'words = text.split()' this will also gives same result
    return Counter(words) # Counter counts the occurence of each word in list, it returns dict type object {word1:count}



def get_result(doc_id, content_a, content_b, url, model):
    text1 = content_a
    text2 = content_b

    # vector1 = dict{word11:#count, word12:#count, etc.}
    vector1 = text_to_vector(text1)

    # vector1 = dict{word21:#count, word22:#count, etc.}
    vector2 = text_to_vector(text2)

    plot_heatmap_image(doc_id, vector1, vector2, url, text2, model)

## [8.2] Bag of Words (BoW) on product titles.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
title_vectorizer = CountVectorizer()
title_features   = title_vectorizer.fit_transform(data['title'])
title_features.get_shape() # get number of rows and columns in feature matrix.
# title_features.shape = #data_points * #words_in_corpus
# CountVectorizer().fit_transform(corpus) returns
# the a sparase matrix of dimensions #data_points * #words_in_corpus

# What is a sparse vector?

# title_features[doc_id, index_of_word_in_corpus] = number of times the word occured in that doc



In [None]:
def bag_of_words_model(doc_id, num_results):
    # doc_id: apparel's id in given corpus

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # the metric we used here is cosine, the coside distance is mesured as K(X, Y) = <X, Y> / (||X||*||Y||)
    # http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
    pairwise_dist = pairwise_distances(title_features,title_features[doc_id])

    # np.argsort will return indices of the smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])

    for i in range(0,len(indices)):
        # we will pass 1. doc_id, 2. title1, 3. title2, url, model
        get_result(indices[i],data['title'].loc[df_indices[0]], data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], 'bag_of_words')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print ('Brand:', data['brand'].loc[df_indices[i]])
        print ('Title:', data['title'].loc[df_indices[i]])
        print ('Euclidean similarity with the query image :', pdists[i])
        print('='*60)

#call the bag-of-words model for a product to get similar products.
bag_of_words_model(12566, 20) # change the index if you want to.
# In the output heat map each value represents the count value
# of the label word, the color represents the intersection
# with inputs title.

#try 12566
#try 931

## [8.5] TF-IDF based product similarity

In [None]:
tfidf_title_vectorizer = TfidfVectorizer(min_df = 0)
tfidf_title_features = tfidf_title_vectorizer.fit_transform(data['title'])
# tfidf_title_features.shape = #data_points * #words_in_corpus
# CountVectorizer().fit_transform(courpus) returns the a sparase matrix of dimensions #data_points * #words_in_corpus
# tfidf_title_features[doc_id, index_of_word_in_corpus] = tfidf values of the word in given doc

In [None]:
def tfidf_model(doc_id, num_results):
    # doc_id: apparel's id in given corpus

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # the metric we used here is cosine, the coside distance is mesured as K(X, Y) = <X, Y> / (||X||*||Y||)
    # http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
    pairwise_dist = pairwise_distances(tfidf_title_features,tfidf_title_features[doc_id])

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])

    for i in range(0,len(indices)):
        # we will pass 1. doc_id, 2. title1, 3. title2, url, model
        get_result(indices[i], data['title'].loc[df_indices[0]], data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], 'tfidf')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('BRAND :',data['brand'].loc[df_indices[i]])
        print ('Eucliden distance from the given image :', pdists[i])
        print('='*125)
tfidf_model(12566, 20)
# in the output heat map each value represents the tfidf values of the label word, the color represents the intersection with inputs title

## [8.5] IDF based product similarity

In [None]:
idf_title_vectorizer = CountVectorizer()
idf_title_features = idf_title_vectorizer.fit_transform(data['title'])

# idf_title_features.shape = #data_points * #words_in_corpus
# CountVectorizer().fit_transform(courpus) returns the a sparase matrix of dimensions #data_points * #words_in_corpus
# idf_title_features[doc_id, index_of_word_in_corpus] = number of times the word occured in that doc

In [None]:
def n_containing(word):
    # return the number of documents which had the given word
    return sum(1 for blob in data['title'] if word in blob.split())

def idf(word):
    # idf = log(#number of docs / #number of docs which had the given word)
    return math.log(data.shape[0] / (n_containing(word)))

In [None]:
# we need to convert the values into float
idf_title_features  = idf_title_features.astype(np.float)

for i in idf_title_vectorizer.vocabulary_.keys():
    # for every word in whole corpus we will find its idf value
    idf_val = idf(i)

    # to calculate idf_title_features we need to replace the count values with the idf values of the word
    # idf_title_features[:, idf_title_vectorizer.vocabulary_[i]].nonzero()[0] will return all documents in which the word i present
    for j in idf_title_features[:, idf_title_vectorizer.vocabulary_[i]].nonzero()[0]:

        # we replace the count values of word i in document j with  idf_value of word i
        # idf_title_features[doc_id, index_of_word_in_courpus] = idf value of word
        idf_title_features[j,idf_title_vectorizer.vocabulary_[i]] = idf_val


In [None]:
def idf_model(doc_id, num_results):
    # doc_id: apparel's id in given corpus

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # the metric we used here is cosine, the coside distance is mesured as K(X, Y) = <X, Y> / (||X||*||Y||)
    # http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
    pairwise_dist = pairwise_distances(idf_title_features,idf_title_features[doc_id])

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])

    for i in range(0,len(indices)):
        get_result(indices[i],data['title'].loc[df_indices[0]], data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], 'idf')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('Brand :',data['brand'].loc[df_indices[i]])
        print ('euclidean distance from the given image :', pdists[i])
        print('='*125)



idf_model(12566,20)
# in the output heat map each value represents the idf values of the label word, the color represents the intersection with inputs title

# [9] Text Semantics based product similarity

In [None]:

# credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors
# Custom Word2Vec using your own text data.
# Do NOT RUN this code.
# It is meant as a reference to build your own Word2Vec when you have
# lots of data.

'''
# Set values for various parameters
num_features = 300    # Word vector dimensionality
min_word_count = 1    # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
from gensim.models import word2vec
print ("Training model...")
model = word2vec.Word2Vec(sen_corpus, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context)

'''

In [None]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict ,
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin"
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.

'''
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
'''

#if you do NOT have RAM >= 12GB, use the code below.
with open('word2vec_model', 'rb') as handle:
    model = pickle.load(handle)


In [None]:
# Utility functions

def get_word_vec(sentence, doc_id, m_name):
    # sentence : title of the apparel
    # doc_id: document id in our corpus
    # m_name: model information it will take two values
        # if  m_name == 'avg', we will append the model[i], w2v representation of word i
        # if m_name == 'weighted', we will multiply each w2v[word] with the idf(word)
    vec = []
    for i in sentence.split():
        if i in vocab:
            if m_name == 'weighted' and i in  idf_title_vectorizer.vocabulary_:
                vec.append(idf_title_features[doc_id, idf_title_vectorizer.vocabulary_[i]] * model[i])
            elif m_name == 'avg':
                vec.append(model[i])
        else:
            # if the word in our courpus is not there in the google word2vec corpus, we are just ignoring it
            vec.append(np.zeros(shape=(300,)))
    # we will return a numpy array of shape (#number of words in title * 300 ) 300 = len(w2v_model[word])
    # each row represents the word2vec representation of each word (weighted/avg) in given sentance
    return  np.array(vec)

def get_distance(vec1, vec2):
    # vec1 = np.array(#number_of_words_title1 * 300), each row is a vector of length 300 corresponds to each word in give title
    # vec2 = np.array(#number_of_words_title2 * 300), each row is a vector of length 300 corresponds to each word in give title

    final_dist = []
    # for each vector in vec1 we caluclate the distance(euclidean) to all vectors in vec2
    for i in vec1:
        dist = []
        for j in vec2:
            # np.linalg.norm(i-j) will result the euclidean distance between vectors i, j
            dist.append(np.linalg.norm(i-j))
        final_dist.append(np.array(dist))
    # final_dist = np.array(#number of words in title1 * #number of words in title2)
    # final_dist[i,j] = euclidean distance between vectors i, j
    return np.array(final_dist)


def heat_map_w2v(sentence1, sentence2, url, doc_id1, doc_id2, model):
    # sentance1 : title1, input apparel
    # sentance2 : title2, recommended apparel
    # url: apparel image url
    # doc_id1: document id of input apparel
    # doc_id2: document id of recommended apparel
    # model: it can have two values, 1. avg 2. weighted

    #s1_vec = np.array(#number_of_words_title1 * 300), each row is a vector(weighted/avg) of length 300 corresponds to each word in give title
    s1_vec = get_word_vec(sentence1, doc_id1, model)
    #s2_vec = np.array(#number_of_words_title1 * 300), each row is a vector(weighted/avg) of length 300 corresponds to each word in give title
    s2_vec = get_word_vec(sentence2, doc_id2, model)

    # s1_s2_dist = np.array(#number of words in title1 * #number of words in title2)
    # s1_s2_dist[i,j] = euclidean distance between words i, j
    s1_s2_dist = get_distance(s1_vec, s2_vec)



    # devide whole figure into 2 parts 1st part displays heatmap 2nd part displays image of apparel
    gs = gridspec.GridSpec(2, 2, width_ratios=[4,1],height_ratios=[2,1])
    fig = plt.figure(figsize=(15,15))

    ax = plt.subplot(gs[0])
    # ploting the heap map based on the pairwise distances
    ax = sns.heatmap(np.round(s1_s2_dist,4), annot=True)
    # set the x axis labels as recommended apparels title
    ax.set_xticklabels(sentence2.split())
    # set the y axis labels as input apparels title
    ax.set_yticklabels(sentence1.split())
    # set title as recommended apparels title
    ax.set_title(sentence2)

    ax = plt.subplot(gs[1])
    # we remove all grids and axis labels for image
    ax.grid(False)
    ax.set_xticks([])
    ax.set_yticks([])
    display_img(url, ax, fig)

    plt.show()

In [None]:
# vocab = stores all the words that are there in google w2v model
# vocab = model.wv.vocab.keys() # if you are using Google word2Vec

vocab = model.keys()
# this function will add the vectors of each word and returns the avg vector of given sentance
def build_avg_vec(sentence, num_features, doc_id, m_name):
    # sentace: its title of the apparel
    # num_features: the lenght of word2vec vector, its values = 300
    # m_name: model information it will take two values
        # if  m_name == 'avg', we will append the model[i], w2v representation of word i
        # if m_name == 'weighted', we will multiply each w2v[word] with the idf(word)

    featureVec = np.zeros((num_features,), dtype="float32")
    # we will intialize a vector of size 300 with all zeros
    # we add each word2vec(wordi) to this fetureVec
    nwords = 0

    for word in sentence.split():
        nwords += 1
        if word in vocab:
            if m_name == 'weighted' and word in  idf_title_vectorizer.vocabulary_:
                featureVec = np.add(featureVec, idf_title_features[doc_id, idf_title_vectorizer.vocabulary_[word]] * model[word])
            elif m_name == 'avg':
                featureVec = np.add(featureVec, model[word])
    if(nwords>0):
        featureVec = np.divide(featureVec, nwords)
    # returns the avg vector of given sentance, its of shape (1, 300)
    return featureVec

### [9.2] Average Word2Vec product similarity.

In [None]:
doc_id = 0
w2v_title = []
# for every title we build a avg vector representation
for i in data['title']:
    w2v_title.append(build_avg_vec(i, 300, doc_id,'avg'))
    doc_id += 1

# w2v_title = np.array(# number of doc in courpus * 300), each row corresponds to a doc
w2v_title = np.array(w2v_title)


In [None]:
def avg_w2v_model(doc_id, num_results):
    # doc_id: apparel's id in given corpus

    # dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))
    pairwise_dist = pairwise_distances(w2v_title, w2v_title[doc_id].reshape(1,-1))

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])

    for i in range(0, len(indices)):
        heat_map_w2v(data['title'].loc[df_indices[0]],data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], indices[0], indices[i], 'avg')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('BRAND :',data['brand'].loc[df_indices[i]])
        print ('euclidean distance from given input image :', pdists[i])
        print('='*125)


avg_w2v_model(12566, 20)
# in the give heat map, each cell contains the euclidean distance between words i, j

### [9.4]  IDF weighted Word2Vec for product similarity

In [None]:
doc_id = 0
w2v_title_weight = []
# for every title we build a weighted vector representation
for i in data['title']:
    w2v_title_weight.append(build_avg_vec(i, 300, doc_id,'weighted'))
    doc_id += 1
# w2v_title = np.array(# number of doc in courpus * 300), each row corresponds to a doc
w2v_title_weight = np.array(w2v_title_weight)

In [None]:
def weighted_w2v_model(doc_id, num_results):
    # doc_id: apparel's id in given corpus

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # the metric we used here is cosine, the coside distance is mesured as K(X, Y) = <X, Y> / (||X||*||Y||)
    # http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
    pairwise_dist = pairwise_distances(w2v_title_weight, w2v_title_weight[doc_id].reshape(1,-1))

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])

    for i in range(0, len(indices)):
        heat_map_w2v(data['title'].loc[df_indices[0]],data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], indices[0], indices[i], 'weighted')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('Brand :',data['brand'].loc[df_indices[i]])
        print('euclidean distance from input :', pdists[i])
        print('='*125)

weighted_w2v_model(12566, 20)
#931
#12566
# in the give heat map, each cell contains the euclidean distance between words i, j

### [9.6] Weighted similarity using brand and color.

In [None]:
# some of the brand values are empty.
# Need to replace Null with string "NULL"
data['brand'].fillna(value="Not given", inplace=True )

# replace spaces with hypen
brands = [x.replace(" ", "-") for x in data['brand'].values]
types = [x.replace(" ", "-") for x in data['product_type_name'].values]
colors = [x.replace(" ", "-") for x in data['color'].values]

brand_vectorizer = CountVectorizer()
brand_features = brand_vectorizer.fit_transform(brands)

type_vectorizer = CountVectorizer()
type_features = type_vectorizer.fit_transform(types)

color_vectorizer = CountVectorizer()
color_features = color_vectorizer.fit_transform(colors)

extra_features = hstack((brand_features, type_features, color_features)).tocsr()

In [None]:
def heat_map_w2v_brand(sentance1, sentance2, url, doc_id1, doc_id2, df_id1, df_id2, model):

    # sentance1 : title1, input apparel
    # sentance2 : title2, recommended apparel
    # url: apparel image url
    # doc_id1: document id of input apparel
    # doc_id2: document id of recommended apparel
    # df_id1: index of document1 in the data frame
    # df_id2: index of document2 in the data frame
    # model: it can have two values, 1. avg 2. weighted

    #s1_vec = np.array(#number_of_words_title1 * 300), each row is a vector(weighted/avg) of length 300 corresponds to each word in give title
    s1_vec = get_word_vec(sentance1, doc_id1, model)
    #s2_vec = np.array(#number_of_words_title2 * 300), each row is a vector(weighted/avg) of length 300 corresponds to each word in give title
    s2_vec = get_word_vec(sentance2, doc_id2, model)

    # s1_s2_dist = np.array(#number of words in title1 * #number of words in title2)
    # s1_s2_dist[i,j] = euclidean distance between words i, j
    s1_s2_dist = get_distance(s1_vec, s2_vec)

    data_matrix = [['Asin','Brand', 'Color', 'Product type'],
               [data['asin'].loc[df_id1],brands[doc_id1], colors[doc_id1], types[doc_id1]], # input apparel's features
               [data['asin'].loc[df_id2],brands[doc_id2], colors[doc_id2], types[doc_id2]]] # recommonded apparel's features

    colorscale = [[0, '#1d004d'],[.5, '#f2e5ff'],[1, '#f2e5d1']] # to color the headings of each column

    # we create a table with the data_matrix
    table = ff.create_table(data_matrix, index=True, colorscale=colorscale)
    # plot it with plotly
    plotly.offline.iplot(table, filename='simple_table')

    # devide whole figure space into 25 * 1:10 grids
    gs = gridspec.GridSpec(25, 15)
    fig = plt.figure(figsize=(25,5))

    # in first 25*10 grids we plot heatmap
    ax1 = plt.subplot(gs[:, :-5])
    # ploting the heap map based on the pairwise distances
    ax1 = sns.heatmap(np.round(s1_s2_dist,6), annot=True)
    # set the x axis labels as recommended apparels title
    ax1.set_xticklabels(sentance2.split())
    # set the y axis labels as input apparels title
    ax1.set_yticklabels(sentance1.split())
    # set title as recommended apparels title
    ax1.set_title(sentance2)

    # in last 25 * 10:15 grids we display image
    ax2 = plt.subplot(gs[:, 10:16])
    # we dont display grid lins and axis labels to images
    ax2.grid(False)
    ax2.set_xticks([])
    ax2.set_yticks([])

    # pass the url it display it
    display_img(url, ax2, fig)

    plt.show()

In [None]:
def idf_w2v_brand(doc_id, w1, w2, num_results):
    # doc_id: apparel's id in given corpus
    # w1: weight for  w2v features
    # w2: weight for brand and color features

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # the metric we used here is cosine, the coside distance is mesured as K(X, Y) = <X, Y> / (||X||*||Y||)
    # http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity
    idf_w2v_dist  = pairwise_distances(w2v_title_weight, w2v_title_weight[doc_id].reshape(1,-1))
    ex_feat_dist = pairwise_distances(extra_features, extra_features[doc_id])
    pairwise_dist   = (w1 * idf_w2v_dist +  w2 * ex_feat_dist)/float(w1 + w2)

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])


    for i in range(0, len(indices)):
        heat_map_w2v_brand(data['title'].loc[df_indices[0]],data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], indices[0], indices[i],df_indices[0], df_indices[i], 'weighted')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('Brand :',data['brand'].loc[df_indices[i]])
        print('euclidean distance from input :', pdists[i])
        print('='*125)

idf_w2v_brand(12566, 5, 5, 20)
# in the give heat map, each cell contains the euclidean distance between words i, j

In [None]:
# brand and color weight =50
# title vector weight = 5

idf_w2v_brand(12566, 5, 50, 20)

# [10.2] Keras and Tensorflow to extract features

In [None]:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import applications
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import requests
from PIL import Image
import pandas as pd
import pickle

In [None]:
# https://gist.github.com/fchollet/f35fbc80e066a49d65f1688a7e99f069
# Code reference: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html



# This code takes 40 minutes to run on a modern GPU (graphics card)
# like Nvidia  1050.
# GPU (NVidia 1050): 0.175 seconds per image

# This codse takes 160 minutes to run on a high end i7 CPU
# CPU (i7): 0.615 seconds per image.

#Do NOT run this code unless you want to wait a few hours for it to generate output

# each image is converted into 25088 length dense-vector


'''
# dimensions of our images.
img_width, img_height = 224, 224

top_model_weights_path = 'bottleneck_fc_model.h5'
train_data_dir = 'images2/'
nb_train_samples = 16042
epochs = 50
batch_size = 1


def save_bottlebeck_features():

    #Function to compute VGG-16 CNN for image feature extraction.

    asins = []
    datagen = ImageDataGenerator(rescale=1. / 255)

    # build the VGG16 network
    model = applications.VGG16(include_top=False, weights='imagenet')
    generator = datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=batch_size,
        class_mode=None,
        shuffle=False)

    for i in generator.filenames:
        asins.append(i[2:-5])

    bottleneck_features_train = model.predict_generator(generator, nb_train_samples // batch_size)
    bottleneck_features_train = bottleneck_features_train.reshape((16042,25088))

    np.save(open('16k_data_cnn_features.npy', 'wb'), bottleneck_features_train)
    np.save(open('16k_data_cnn_feature_asins.npy', 'wb'), np.array(asins))


save_bottlebeck_features()

'''

# [10.3] Visual features based product similarity.

In [None]:
#load the features and corresponding ASINS info.
bottleneck_features_train = np.load('16k_data_cnn_features.npy')
asins = np.load('16k_data_cnn_feature_asins.npy')
asins = list(asins)

# load the original 16K dataset
data = pd.read_pickle('pickels/16k_apperal_data_preprocessed')
df_asins = list(data['asin'])


from IPython.display import display, Image, SVG, Math, YouTubeVideo


#get similar products using CNN features (VGG-16)
def get_similar_products_cnn(doc_id, num_results):
    doc_id = asins.index(df_asins[doc_id])
    pairwise_dist = pairwise_distances(bottleneck_features_train, bottleneck_features_train[doc_id].reshape(1,-1))

    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    for i in range(len(indices)):
        rows = data[['medium_image_url','title']].loc[data['asin']==asins[indices[i]]]
        for indx, row in rows.iterrows():
            display(Image(url=row['medium_image_url'], embed=True))
            print('Product Title: ', row['title'])
            print('Euclidean Distance from input image:', pdists[i])
            print('Amazon Url: www.amzon.com/dp/'+ asins[indices[i]])

get_similar_products_cnn(12566, 20)


In [None]:
## Assignment
=========================

In [None]:
def weighted_similarity(doc_id,num_results,text_vector,wt = 1,wb = 1,wc = 1,wi = 1):
    """
    This function consider 4 vectors.
    1. Text (by default tfidf w2v in considered.)
    2. Brand
    3. Color
    4. Feature Extracted image vector

    Weighted similarity vector is calculated based on weights provided.
    By default all weights are 1.

    Please pass the string which vectorizer to used.

    """

    # pairwise_dist will store the distance from given input apparel to all remaining apparels
    # getting all the similarity distances for the given doc_id into into corresponding lists.

    if text_vector == "bow":
        # bow
        text_dist = pairwise_distances(title_features,title_features[doc_id])
    elif text_vector =="tfidf":
        # tfidf
        text_dist = pairwise_distances(tfidf_title_features,tfidf_title_features[doc_id])
    elif text_vector == "avg_w2v":
        # avg_w2v
        text_dist = pairwise_distances(w2v_title, w2v_title[doc_id].reshape(1,-1))
    elif text_vector == "idf_w2v":
        # tfidf_avg_w2v
        text_dist  = pairwise_distances(w2v_title_weight, w2v_title_weight[doc_id].reshape(1,-1))

    brand_dist = pairwise_distances(brand_features, brand_features[doc_id])
    color_dist = pairwise_distances(color_features, color_features[doc_id])
    image_dist = pairwise_distances(bottleneck_features_train, bottleneck_features_train[doc_id].reshape(1,-1))

    # calculating weighted vector.
    pairwise_dist   = ((wt * text_dist) +(wb * brand_dist) +(wc * color_dist) + (wi * image_dist))/float(wt + wb + wc + wi)

    # np.argsort will return indices of 9 smallest distances
    indices = np.argsort(pairwise_dist.flatten())[0:num_results]
    #pdists will store the 9 smallest distances
    pdists  = np.sort(pairwise_dist.flatten())[0:num_results]

    #data frame indices of the 9 smallest distace's
    df_indices = list(data.index[indices])


    for i in range(0, len(indices)):
        heat_map_w2v_brand(data['title'].loc[df_indices[0]],data['title'].loc[df_indices[i]], data['medium_image_url'].loc[df_indices[i]], indices[0], indices[i],df_indices[0], df_indices[i], 'weighted')
        print('ASIN :',data['asin'].loc[df_indices[i]])
        print('Brand :',data['brand'].loc[df_indices[i]])
        print('Color :',data['color'].loc[df_indices[i]])
        print('euclidean distance from input :', pdists[i])
        print('='*125)

#### 1. Consider [text bow] + [brand] + [color] + [image]
======================================================
- Weights are assigned for above
- By default all weights are 1.

In [None]:
weighted_similarity(12566,10)

Weight to image_vector = 0

In [None]:
weighted_similarity(12566,10,text_vector="tfidf",wt=0,wb=1,wc=3,wi=5)

Weight to text_vector = "bow"

In [None]:
weighted_similarity(12566,10,text_vector="bow",wt=5,wb=2,wc=2,wi=5)

Weight to text_vector = "avg_w2v"

In [None]:
weighted_similarity(12566,10,text_vector="avg_w2v",wt=5,wb=2,wc=2,wi=5)

#### 2. Consider  [text tfidf ] + [brand] + [color] + [image]
======================================================
- Weights are assigned for above
- By default all weights are 1.

In [None]:
weighted_similarity(12566,10,text_vector="tfidf",wt=5,wb=5,wc=5,wi=5)

More weight to text,brand and color as compare to image

In [None]:
weighted_similarity(12566,10,text_vector="tfidf",wt=10,wb=10,wc=10,wi=5)

Brand weight to zero. More to Text,color and  image

In [None]:
weighted_similarity(12566,10,text_vector="tfidf",wt=0.10,wb=0,wc=0.10,wi=0.25)

#### 3. Consider [text avg_w2v] + [brand] + [color] + [image]
======================================================
- Weights are assigned for above
- By default all weights are 1.

In [None]:
weighted_similarity(12566,10,text_vector="avg_w2v",wt=10,wb=10,wc=10,wi=10)

- More weight to image and text.
- Brand and color to zero.

In [None]:
weighted_similarity(12566,10,text_vector="avg_w2v",wt=10,wb=10,wc=10,wi=1)

In [None]:
weighted_similarity(12566,10,text_vector="avg_w2v",wt=10,wb=10,wc=0,wi=20)

#### 4. Consider [text tfidf_w2v] + [brand] + [color] + [image]
======================================================
- Weights are assigned for above
- By default all weights are 1.

In [None]:
weighted_similarity(12566,10,text_vector="idf_w2v",wt=10,wb=10,wc=0,wi=20)

Equal weights to text,brand,color and lesser weight to image

In [None]:
weighted_similarity(12566,10,text_vector="idf_w2v",wt=10,wb=10,wc=10,wi=2)

- Zero weight to brand.
- Lesser to image.
- More weight to text.

In [None]:
weighted_similarity(12566,10,text_vector="idf_w2v",wt=15,wb=0,wc=10,wi=5)

In [None]:
weighted_similarity(12566,10,text_vector="idf_w2v",wt=15,wb=10,wc=15,wi=5)

### Observations:
=====================

1. Results were best when considered only text,brand and color.
2. Recommended products were completly different when considered combination of all the vectors.
3. Wide range of different apparels were recommended when **feature extracted image vector** considered.
4. When feature extracted image considered
    - Increase in similarity distance found
    - Recommended products were completly different and non relevant.
    - Performance deterioted when higher weights were assignrd.

5. The reason behind it could be, that **apparel image dataset** is far different than **imagenet dataset.**.
6. Feature extracted image vector can be improove the product recommendation if **VGG16 fine tuned on apparel image dataset.**
7. Among various text encoding, best result was found in case of tfidf_avg_w2v.


### CaseStudy Flow:
=========================

- The objective of case study is to recommend similar apparel products(women's top)
- The data was obtained from **Amazon Product advertising API.**
- Total **183k data point was obtained with 19 features** such as asin,author,availability,availability_type,  brand, color  formatted_price etc.,
- We are considering only 9 features.
    1. asin  ( Amazon standard identification number)
    2. brand ( brand to which the product belongs to )
    3. color ( Color information of apparel, it can contain many colors as   a value ex: red and black stripes )
    4. product_type_name (type of the apperal, ex: SHIRT/TSHIRT )
    5. medium_image_url  ( url of the image )
    6. title (title of the product.)
    7. formatted_price (price of product)
- Data pre-processing and cleaning was done. Null values for brand,color  was replaced with hypen.
- EDA was done on title of product to remoove similar/analogus titles.
- Final dataframe contains 16k data points with 9 features.
- Categorical features were one hot encoded with the help of countvectorizer.
- Text feature was encoded using bow,tfidf,avg_w2v and tfidf_avg_w2v.
- Similar products based on text was shown to check weather it is working correctly or not.
- **VGG16** was used to extract **feature from product images**.
- Combination of weighted vector [text + (brand + color) + image ] is used to get similar products.
- Observation are noted down whenever necessary.
- Results of case study is summarized at the end.
