#**RECOMMENDATION SYSTEM BASED ON CONTENT**

A content-based recommendation system using TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer suggests items by comparing the similarity between item descriptions and a user's preferences. It analyzes the content of items a user has interacted with and recommends similar items by comparing their characteristics.
**TF-IDF** transforms text into numerical vectors, reflecting the importance of words in relation to the entire dataset. By calculating the cosine similarity between these vectors, the system recommends items most similar to those the user has interacted with previously.

**TF-IDF Vectorizer** stands for **Term Frequency-Inverse Document Frequency Vectorizer**. It’s a numerical statistic used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). It is widely used in information retrieval and text mining, as it transforms text data into a numerical format that machine learning algorithms can understand.

 **Term Frequency (TF)**: *This measures how frequently a term appears in a document. It’s calculated by dividing the number of times a term appears in a document by the total number of terms in that document. Higher frequencies indicate that the term is more important in that particular document.*

 **Inverse Document Frequency (IDF)**: *This measures how important a term is across the entire corpus. It’s calculated by taking the logarithm of the total number of documents divided by the number of documents containing the term. This helps reduce the weight of common terms (like "the" or "and") that appear in many documents*.

**Combining TF and IDF**: *The TF-IDF score is the product of TF and IDF. This means that a term will have a high score if it appears frequently in a specific document but is rare across the entire corpus, highlighting its significance*.



In [28]:
# Importing libraries

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [29]:
# Loading Dataset

data = pd.read_csv('myntra_products_catalog.csv')

#The Dataset.


In [30]:
data.head()

Unnamed: 0,Product_ID,Product_Name,Product_Brand,Gender,Price,Num_Images,Description,Primary_Color
0,10017413,DKNY Unisex Black & Grey Printed Medium Trolle...,DKNY,Unisex,11745,7,"Black and grey printed medium trolley bag, sec...",Black
1,10016283,EthnoVogue Women Beige & Grey Made to Measure ...,EthnoVogue,Women,5810,7,Beige & Grey made to measure kurta with churid...,Beige
2,10009781,SPYKAR Women Pink Alexa Super Skinny Fit High-...,SPYKAR,Women,899,7,Pink coloured wash 5-pocket high-rise cropped ...,Pink
3,10015921,Raymond Men Blue Self-Design Single-Breasted B...,Raymond,Men,5599,5,Blue self-design bandhgala suitBlue self-desig...,Blue
4,10017833,Parx Men Brown & Off-White Slim Fit Printed Ca...,Parx,Men,759,5,"Brown and off-white printed casual shirt, has ...",White


#The number of rows and columns in the dataset.


In [31]:
print('Number of rows: ', data.shape[0])
print('Number of columns: ', data.shape[1])

Number of rows:  12491
Number of columns:  8


#Information of datset

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12491 entries, 0 to 12490
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product_ID     12491 non-null  int64 
 1   Product_Name   12491 non-null  object
 2   Product_Brand  12491 non-null  object
 3   Gender         12491 non-null  object
 4   Price          12491 non-null  int64 
 5   Num_Images     12491 non-null  int64 
 6   Description    12491 non-null  object
 7   Primary_Color  11597 non-null  object
dtypes: int64(3), object(5)
memory usage: 780.8+ KB


#Display the summary statistics of the dataset.

In [33]:
data.describe()

Unnamed: 0,Product_ID,Price,Num_Images
count,12491.0,12491.0,12491.0
mean,9917160.0,1452.660956,4.913698
std,1438006.0,2118.503976,1.092333
min,101206.0,90.0,1.0
25%,10062150.0,649.0,5.0
50%,10154630.0,920.0,5.0
75%,10215650.0,1499.0,5.0
max,10275140.0,63090.0,10.0


In [34]:
# Summary of dataset only the columns contains objects as values

data.describe(include='object')

Unnamed: 0,Product_Name,Product_Brand,Gender,Description,Primary_Color
count,12491,12491,12491,12491,11597
unique,10761,677,6,10435,27
top,Parx Men Blue Slim Fit Checked Casual Shirt,Indian Terrain,Women,"Blue medium wash 5-pocket mid-rise jeans, clea...",Blue
freq,16,971,5126,54,3443



#The most expensive products for each category.

In [35]:
data.groupby(['Gender']).agg({'Price': 'max', 'Product_Name':'max'})

Unnamed: 0_level_0,Price,Product_Name
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Boys,3999,t-base Boys Yellow Colourblocked Lightweight J...
Girls,3800,t-base Girls Red Printed Lightweight Puffer Ja...
Men,58854,plusS Men Navy Blue Solid Straight-Fit Trackpants
Unisex,63090,fancy mart Yellow & Green Artificial Flowers a...
Unisex Kids,1799,berrytree Unisex Navy Blue Solid Polo Collar S...
Women,56192,yelloe Black Solid Sling Bag


#Preprocess the data and handle missing values as necessary


In [36]:
data.isnull().sum()

Unnamed: 0,0
Product_ID,0
Product_Name,0
Product_Brand,0
Gender,0
Price,0
Num_Images,0
Description,0
Primary_Color,894


In [37]:
# There are 894 null values present in Primary_color column which need to filled with blank for smooth processing

data['Primary_Color'] = data['Primary_Color'].fillna('')

In [38]:
data.isnull().sum()

Unnamed: 0,0
Product_ID,0
Product_Name,0
Product_Brand,0
Gender,0
Price,0
Num_Images,0
Description,0
Primary_Color,0


#Using cosine similarity to calculate the similarity between products based on their description to get content based recommendation


In [39]:
# TF-IDF used to evaluate the importance of a word in a document related to a collection of documents (or corpus)
# Create a TfidfVectorizer and Remove stopwords

tfidf = TfidfVectorizer(stop_words='english')

In [40]:
# Forming a matrix for transformed data after transforming 'Description' column
# Fit and transform the data to a tfidf matrix

tfidf_matrix = tfidf.fit_transform(data['Description'])

tfidf_matrix

<12491x8418 sparse matrix of type '<class 'numpy.float64'>'
	with 204097 stored elements in Compressed Sparse Row format>

In [41]:
# Shape of tfidf_mattrix

tfidf_matrix.shape

(12491, 8418)

In [42]:
# All features in tfidf

tfidf.get_feature_names_out()

array(['000', '01', '015', ..., 'zones', 'zoom', 'zoop'], dtype=object)

In [43]:
# Forming a dataset for tf-idf transformed dataset

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

In [44]:
# Naming the first column as 'Description'

tfidf_df.insert(0, 'Description', data['Description'])

In [45]:
#tf-idf transformed dataset

tfidf_df

Unnamed: 0,Description,000,01,015,01shade,01what,02shade,02what,03shade,03what,...,zirconiaclosure,zirconiaplating,zirconiasecured,zit,zonal,zoneallover,zoned,zones,zoom,zoop
0,"Black and grey printed medium trolley bag, sec...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Beige & Grey made to measure kurta with churid...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Pink coloured wash 5-pocket high-rise cropped ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Blue self-design bandhgala suitBlue self-desig...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Brown and off-white printed casual shirt, has ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12486,"Black dark wash 5-pocket low-rise jeans, clean...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12487,"A pair of gold-toned open toe heels, has regul...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12488,Navy Blue and White printed mid-rise denim sho...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12489,Bvlgari Men Aqva Pour Homme Marine Eau de Toil...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
# Using a word to get all the recommedate products

tfidf_df.loc[np.where(tfidf_df['zoom'] > 0)][['Description', 'zoom']]

Unnamed: 0,Description,zoom
7924,Lens colour: BrownLens feature: Regular LensFr...,0.285286
7947,Lens colour: BrownLens feature: Regular LensFr...,0.285286
7953,Lens colour: BrownLens feature: Polarised lens...,0.276952
7978,Lens colour: PurpleLens feature: Regular LensF...,0.282289
8014,Lens colour: BrownLens feature: Regular LensFr...,0.285286
8015,Lens colour: BrownLens feature: Polarised Lens...,0.274956
8127,Lens colour: BrownLens feature: Regular LensFr...,0.285286
8189,Lens colour: PurpleLens feature: Regular LensF...,0.282289
8250,Lens colour: GreyLens feature: Polarised LensF...,0.274673
8302,Lens colour: GreenLens feature: Polarised lens...,0.272969


#**Cosine Similarity**


**Cosine Similarity is a measure of the similarity between two vectors of an inner product space.**

For two vectors, A and B, the Cosine Similarity is calculated as:

Cosine Similarity = ΣA<sub>i</sub> * B<sub>i</sub> / (√ΣA<sub>i</sub><sup>2</sup> * √ΣB<sub>i</sub><sup>2</sup>)

#**Linear kernel**
Linear kernel measures the similarity between text documents by calculating the dot product of their TF-IDF vectors. This linear relationship captures how closely the content of two documents aligns, based on the importance of words in each document relative to the entire corpus.

In [47]:
# Checking cosine similarity within the different columns of the matrix using linear_kernel

linear_kernel(tfidf_matrix, tfidf_matrix)

array([[1.        , 0.01646711, 0.05571126, ..., 0.01420408, 0.        ,
        0.01911564],
       [0.01646711, 1.        , 0.00637368, ..., 0.0065001 , 0.        ,
        0.00679858],
       [0.05571126, 0.00637368, 1.        , ..., 0.11172552, 0.        ,
        0.08409877],
       ...,
       [0.01420408, 0.0065001 , 0.11172552, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.01911564, 0.00679858, 0.08409877, ..., 0.        , 0.        ,
        1.        ]])

In [48]:
# Compute the cosine similarity between each movie description
# Creating a varialbe for linear_kernel transformed dataset

lk_df = linear_kernel(tfidf_matrix, tfidf_matrix)

In [49]:
# Forming a dataset with similarity scores where values are based on data tranformed by linear kernel and columns will be the Products name

similar_score = pd.DataFrame(lk_df, columns=data['Product_Name'], index=data['Product_Name'])

In [50]:
# The given Product name will suggest the all other products based on the similar description with similarity score
# Sorting the values in ascending order to get all highly similar product and iloc[1: ] is used to get the similar product at top but not the same one

similar_score['Kvsfab Women Yellow Embroidered Poly Georgette Saree'].sort_values(ascending=False).iloc[1: ]

Unnamed: 0_level_0,Kvsfab Women Yellow Embroidered Poly Georgette Saree
Product_Name,Unnamed: 1_level_1
Kvsfab Navy Blue Embroidered Poly Georgette Saree,0.985614
Kvsfab Cream-Coloured Embroidered Poly Georgette Saree,0.967374
Kvsfab Women Beige & Grey Cotton Blend Printed Saree,0.925197
Kvsfab Women Beige Solid Poly Georgette Fringed Saree,0.906082
Kvsfab Women Pink Solid Poly Georgette Fringed Saree,0.893958
...,...
Kappa Men Turquoise Blue Solid Polo T-shirt,0.000000
Archies Love Gifts Multicolured Printed Ceramic Coffee Mug,0.000000
Indian Terrain Boys Blue Printed Polo Collar T-shirt,0.000000
Archies Love Gifts White & Pink Printed Coffee Mug,0.000000


In [51]:
# Implementing a content-based recommendation engine that suggests top similar products based on a given product
# Using define fuction to create function and iloc to to get a number of products as per needs

def get_recommendations(Product_Name, number):

    return similar_score[Product_Name].sort_values(ascending=False).iloc[1:number+1]

In [52]:
# To get top 20 recommendate products similar to given product

get_recommendations('SHOWOFF Men Brown Solid Slim Fit Regular Shorts', 20)

Unnamed: 0_level_0,SHOWOFF Men Brown Solid Slim Fit Regular Shorts
Product_Name,Unnamed: 1_level_1
SHOWOFF Men Khaki Solid Slim Fit Regular Shorts,0.789576
Indian Terrain Men Brown Brooklyn Slim Fit Solid Regular Trousers,0.766781
Sera Women Black Solid Loose Fit Regular Shorts,0.738331
Gini and Jony Boys Brown Solid Regular Fit Shorts,0.735919
Indian Terrain Men Brown Brooklyn Slim Fit Solid Chinos,0.718679
Parx Men Brown Slim Fit Solid Regular Trousers,0.718679
Park Avenue Men Beige Solid Slim Fit Regular Shorts,0.697789
Indian Terrain Men Khaki Printed Slim Fit Regular Shorts,0.69687
Indian Terrain Men Blue Slim Fit Printed Regular Shorts,0.691987
Gini and Jony Boys Blue Printed Regular Fit Shorts,0.691987


In [53]:
# To get top 15 recommendate products similar to given product

get_recommendations('ID Men Brown Leather Formal Slip-Ons', 15)

Unnamed: 0_level_0,ID Men Brown Leather Formal Slip-Ons
Product_Name,Unnamed: 1_level_1
ID Men Black Leather Formal Slip-Ons,0.897947
ID Men Black Leather Formal Slip-Ons,0.876427
Red Tape Men Black Leather Formal Slip-Ons,0.798744
Red Tape Men Black Leather Formal Slip-Ons,0.798744
Red Tape Men Coffee Brown Leather Semiformal Slip-Ons,0.787637
Red Tape Men Black Leather Semiformal Slip-Ons,0.742805
Red Tape Men Black Leather Formal Slip-Ons,0.65824
Arrow Men Tan Brown Formal Leather Slip-Ons Shoes,0.628414
Franco Leone Men Brown Leather Formal Slip-On Shoes,0.613538
ID Men Brown Formal Leather Derbys,0.609459
