# **Content Based Method**

## **Loading Dataset**

In [2]:
import pandas as pd
products_path="/content/drive/MyDrive/BigBasket Products.csv"
products_df = pd.read_csv(products_path)

In [3]:
products_df

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.00,220.0,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.00,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.00,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."
3,4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.00,176.0,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.00,162.0,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...
...,...,...,...,...,...,...,...,...,...,...
27550,27551,"Wottagirl! Perfume Spray - Heaven, Classic",Beauty & Hygiene,Fragrances & Deos,Layerr,199.20,249.0,Perfume,3.9,Layerr brings you Wottagirl Classic fragrant b...
27551,27552,Rosemary,Gourmet & World Food,Cooking & Baking Needs,Puramate,67.50,75.0,"Herbs, Seasonings & Rubs",4.0,Puramate rosemary is enough to transform a dis...
27552,27553,Peri-Peri Sweet Potato Chips,Gourmet & World Food,"Snacks, Dry Fruits, Nuts",FabBox,200.00,200.0,Nachos & Chips,3.8,We have taken the richness of Sweet Potatoes (...
27553,27554,Green Tea - Pure Original,Beverages,Tea,Tetley,396.00,495.0,Tea Bags,4.2,"Tetley Green Tea with its refreshing pure, ori..."


Since we won't need the column "index" in the following steps, we just drop it.

In [4]:
products_df = products_df.drop(["index"],axis=1)
products_df.head(3)

Unnamed: 0,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
1,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
2,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."


In [5]:
products_df.shape

(27555, 9)

In [6]:
products_df.columns

Index(['product', 'category', 'sub_category', 'brand', 'sale_price',
       'market_price', 'type', 'rating', 'description'],
      dtype='object')

In [7]:
products_df.nunique()

product         23540
category           11
sub_category       90
brand            2313
sale_price       3256
market_price     1348
type              426
rating             40
description     21944
dtype: int64

Keep in mind that the column "category" has few unique values. This info will be useful whie encoding categorical columns.

In [8]:
products_df.describe()

Unnamed: 0,sale_price,market_price,rating
count,27555.0,27555.0,18929.0
mean,322.514808,382.056664,3.94341
std,486.263116,581.730717,0.739063
min,2.45,3.0,1.0
25%,95.0,100.0,3.7
50%,190.0,220.0,4.1
75%,359.0,425.0,4.3
max,12500.0,12500.0,5.0


## **Null Values**

In this step, we just check if there's any null value in our dataset or not. If we found any null value we should find a way to handle the problem and fill in those empty cells in our dataset.

In [10]:
df_null =  products_df.copy(deep=True)

In [12]:
import numpy as np
null_columns=df_null.columns[df_null.isna().any()].tolist()

In [16]:
def show_null(null_columns):
 null_values=pd.DataFrame(df_null[null_columns].isna().sum(), columns=['Number of Null Values'])
 null_values['Percentage of Null Values']=np.round(100*null_values['Number of Null Values']/len(df_null),2)
 print(null_values)

In [17]:
show_null(null_columns)

             Number of Null Values  Percentage of Null Values
product                          1                       0.00
brand                            1                       0.00
rating                        8626                      31.30
description                    115                       0.42


There are two solutions for the null values considering the result shown above.
Due to the fact that the percentage of null values for description, product and brand  is very low , we can just remove those rows containing null values and get rid of them easily.
The second thing is that for rating we can use one of the "mean" or "median" strategies, since it's a numerical feature.
So, in the following steps we're gonna apply these changes we mentioned.

In [18]:
df_null = df_null.dropna(subset=['description'], axis=0)

In [20]:
show_null(null_columns)

             Number of Null Values  Percentage of Null Values
product                          1                       0.00
brand                            0                       0.00
rating                        8599                      31.34
description                      0                       0.00


Now as you can see the null value of the brand is also eliminated after removing the null rows of description.

In [22]:
df_null = df_null.dropna(subset=['product'], axis=0)

In [23]:
show_null(null_columns)

             Number of Null Values  Percentage of Null Values
product                          0                       0.00
brand                            0                       0.00
rating                        8599                      31.34
description                      0                       0.00


In [24]:
df_null['rating'] = df_null['rating'].fillna(df_null['rating'].median())

In [25]:
show_null(null_columns)

             Number of Null Values  Percentage of Null Values
product                          0                        0.0
brand                            0                        0.0
rating                           0                        0.0
description                      0                        0.0


Now there are no null values in our dataset and we can move on to the next part.

## **Encoding**

In this step we are going to encode all of the categorical columns of our dataset.

In [28]:
df_encode = df_null.copy(deep=True)

**Category**: As I mentioned earlier, we can easily use the one hot coding for encoding "category" column because it only has 11 unique values and using one hot encoding will all only 11 additional columns to our dataset which is okay.

In [29]:
onehot_BC = pd.get_dummies(df_encode['category'], prefix='category')
df_encode = pd.concat([df_encode, onehot_BC], axis=1)

In [30]:
#now we just drop the "category" column
df_encode = df_encode.drop(["category"],axis=1)
df_encode.tail()

Unnamed: 0,product,sub_category,brand,sale_price,market_price,type,rating,description,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods
27550,"Wottagirl! Perfume Spray - Heaven, Classic",Fragrances & Deos,Layerr,199.2,249.0,Perfume,3.9,Layerr brings you Wottagirl Classic fragrant b...,0,0,1,0,0,0,0,0,0,0,0
27551,Rosemary,Cooking & Baking Needs,Puramate,67.5,75.0,"Herbs, Seasonings & Rubs",4.0,Puramate rosemary is enough to transform a dis...,0,0,0,0,0,0,0,0,1,0,0
27552,Peri-Peri Sweet Potato Chips,"Snacks, Dry Fruits, Nuts",FabBox,200.0,200.0,Nachos & Chips,3.8,We have taken the richness of Sweet Potatoes (...,0,0,0,0,0,0,0,0,1,0,0
27553,Green Tea - Pure Original,Tea,Tetley,396.0,495.0,Tea Bags,4.2,"Tetley Green Tea with its refreshing pure, ori...",0,0,0,1,0,0,0,0,0,0,0
27554,United Dreams Go Far Deodorant,Men's Grooming,United Colors Of Benetton,214.53,390.0,Men's Deodorants,4.5,The new mens fragrance from the United Dreams ...,0,0,1,0,0,0,0,0,0,0,0


In [31]:
df_encode.shape

(27439, 19)

**Sub_Category**: For sub_category we have to choose a different strategy. We can use the frequency encoding, but due to the fact that sub_category plays a crucial role in the characterictic of a product I prefer to use a better encoding. The reason is that it's very likely for some sub categories to have the same frequency and this problem could mislead us while grouping similar products with similar features.

In [33]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_encode['sub_category'] = label_encoder.fit_transform(df_encode['sub_category'])

In [34]:
df_encode.head(7)

Unnamed: 0,product,sub_category,brand,sale_price,market_price,type,rating,description,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods
0,Garlic Oil - Vegetarian Capsule 500 mg,49,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...,0,0,1,0,0,0,0,0,0,0,0
1,Water Bottle - Orange,86,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ...",0,0,0,0,0,0,0,0,0,1,0
2,"Brass Angle Deep - Plain, No.2",73,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m...",0,0,0,0,1,0,0,0,0,0,0
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,Nakoda,149.0,176.0,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...,0,0,0,0,1,0,0,0,0,0,0
4,Creme Soft Soap - For Hands & Body,8,Nivea,162.0,162.0,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...,0,0,1,0,0,0,0,0,0,0,0
5,Germ - Removal Multipurpose Wipes,0,Nature Protect,169.0,199.0,Disinfectant Spray & Cleaners,3.3,Stay protected from contamination with Multipu...,0,0,0,0,1,0,0,0,0,0,0
6,Multani Mati,80,Satinance,58.0,58.0,Face Care,3.6,Satinance multani matti is an excellent skin t...,0,0,1,0,0,0,0,0,0,0,0


**Brand:** Now based on the concept of a brand's frequency, which is actually the range of the brand's products and its popularity, we can use frequency encoding and it'd be a good choice. Using this method is like treating the brands with same popularity somehow equally.

In [35]:
brand_num = df_encode["brand"].value_counts(normalize=True)
df_encode["brand_freq"] = df_encode["brand"].map(brand_num) * 100
#and just drop the original column
df_encode = df_encode.drop(["brand"],axis=1)

In [36]:
df_encode.tail(6)

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,description,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods,brand_freq
27549,Papad - Garlic Disco,75,61.0,61.0,"Papads, Ready To Fry",4.0,Papads are prepared from urad dal flour and sp...,0,0,0,0,0,0,0,0,0,0,1,0.061956
27550,"Wottagirl! Perfume Spray - Heaven, Classic",41,199.2,249.0,Perfume,3.9,Layerr brings you Wottagirl Classic fragrant b...,0,0,1,0,0,0,0,0,0,0,0,0.127556
27551,Rosemary,20,67.5,75.0,"Herbs, Seasonings & Rubs",4.0,Puramate rosemary is enough to transform a dis...,0,0,0,0,0,0,0,0,1,0,0,0.306134
27552,Peri-Peri Sweet Potato Chips,82,200.0,200.0,Nachos & Chips,3.8,We have taken the richness of Sweet Potatoes (...,0,0,0,0,0,0,0,0,1,0,0,0.255111
27553,Green Tea - Pure Original,87,396.0,495.0,Tea Bags,4.2,"Tetley Green Tea with its refreshing pure, ori...",0,0,0,1,0,0,0,0,0,0,0,0.047378
27554,United Dreams Go Far Deodorant,59,214.53,390.0,Men's Deodorants,4.5,The new mens fragrance from the United Dreams ...,0,0,1,0,0,0,0,0,0,0,0,0.069245


**Type:** Again, we use label encoding for this column for the best possible result.

In [38]:
df_encode['type'] = label_encoder.fit_transform(df_encode['type'])

In [39]:
df_encode.head()

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,description,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods,brand_freq
0,Garlic Oil - Vegetarian Capsule 500 mg,49,220.0,220.0,204,4.1,This Product contains Garlic Oil that is known...,0,0,1,0,0,0,0,0,0,0,0,0.043733
1,Water Bottle - Orange,86,180.0,180.0,420,2.3,"Each product is microwave safe (without lid), ...",0,0,0,0,0,0,0,0,0,1,0,0.211378
2,"Brass Angle Deep - Plain, No.2",73,119.0,250.0,249,3.4,"A perfect gift for all occasions, be it your m...",0,0,0,0,1,0,0,0,0,0,0,0.153067
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,149.0,176.0,250,3.7,Multipurpose container with an attractive desi...,0,0,0,0,1,0,0,0,0,0,0,0.375378
4,Creme Soft Soap - For Hands & Body,8,162.0,162.0,39,4.4,Nivea Creme Soft Soap gives your skin the best...,0,0,1,0,0,0,0,0,0,0,0,0.317067


**Description:** For this column, there's nothing usuall we can do to encode it, becuase it has so many unqiue values and each value in each row is a text with a long length. For encoding this type of feature we can always use word embedding methods.

In [40]:
desc_values = df_encode['description'].values.tolist()

In [43]:
desc_values

['This Product contains Garlic Oil that is known to help proper digestion, maintain proper cholesterol levels, support cardiovascular and also build immunity.  For Beauty tips, tricks & more visit https://bigbasket.blog/',
 'Each product is microwave safe (without lid), refrigerator safe, dishwasher safe and can also be used for re-heating food and not for cooking. All containers come with airtight lids and a wide variety of attractive colours. Stack these stylish and colourful containers in your kitchen with ease and for a look-good factor.',
 'A perfect gift for all occasions, be it your mother, sister, in-laws, boss or your friends, this beautiful designer piece wherever placed, is sure to beautify the surroundings Traditional design This type diya has been used for Diwali and All other Festivals for centuries. Sturdy and easy to carry The feet keep it balanced to ensure safety. Wonderful Oil Lamp made in Brass also called as Jyoti. This is a handcrafted piece of Indian brass Deepak

In [45]:
unique_lengths = set(len(x) for x in desc_values)
print(unique_lengths)

{7, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 

In [46]:
len(unique_lengths)

1895

In [47]:
min(unique_lengths)

7

In [48]:
max(unique_lengths)

4486

In [50]:
#import the required libraries
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

In [52]:
def word_embeding(max_w, max_l, column_):
  max_words = max_w
  maxlen = max_l
  tokenizer = Tokenizer(num_words=max_words)
  tokenizer.fit_on_texts(column_)
  sequences = tokenizer.texts_to_sequences(column_)
  word_index = tokenizer.word_index
  print('Found %s unique tokens.' % len(word_index))
  return  word_index, sequences

In [53]:
word_index, sequences = word_embeding(20000, 4500, df_encode['description'])

Found 32831 unique tokens.


In [54]:
df_encode['desc_seq'] = sequences

In [56]:
df_encode

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,description,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods,brand_freq,desc_seq
0,Garlic Oil - Vegetarian Capsule 500 mg,49,220.00,220.0,204,4.1,This Product contains Garlic Oil that is known...,0,0,1,0,0,0,0,0,0,0,0,0.043733,"[15, 62, 71, 574, 29, 14, 5, 144, 6, 83, 1582,..."
1,Water Bottle - Orange,86,180.00,180.0,420,2.3,"Each product is microwave safe (without lid), ...",0,0,0,0,0,0,0,0,0,1,0,0.211378,"[210, 62, 5, 599, 96, 87, 329, 1664, 96, 607, ..."
2,"Brass Angle Deep - Plain, No.2",73,119.00,250.0,249,3.4,"A perfect gift for all occasions, be it your m...",0,0,0,0,1,0,0,0,0,0,0,0.153067,"[4, 48, 667, 10, 30, 1284, 27, 8, 11, 1466, 45..."
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,149.00,176.0,250,3.7,Multipurpose container with an attractive desi...,0,0,0,0,1,0,0,0,0,0,0,0.375378,"[1016, 366, 7, 23, 684, 208, 1, 24, 17, 40, 19..."
4,Creme Soft Soap - For Hands & Body,8,162.00,162.0,39,4.4,Nivea Creme Soft Soap gives your skin the best...,0,0,1,0,0,0,0,0,0,0,0,0.317067,"[1481, 1028, 73, 223, 97, 11, 16, 2, 54, 107, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27550,"Wottagirl! Perfume Spray - Heaven, Classic",41,199.20,249.0,329,3.9,Layerr brings you Wottagirl Classic fragrant b...,0,0,1,0,0,0,0,0,0,0,0,0.127556,"[5097, 199, 13, 9161, 431, 787, 60, 5354, 10, ..."
27551,Rosemary,20,67.50,75.0,215,4.0,Puramate rosemary is enough to transform a dis...,0,0,0,0,0,0,0,0,1,0,0,0.306134,"[3384, 2062, 5, 865, 6, 1636, 4, 428, 130, 238..."
27552,Peri-Peri Sweet Potato Chips,82,200.00,200.0,286,3.8,We have taken the richness of Sweet Potatoes (...,0,0,0,0,0,0,0,0,1,0,0,0.255111,"[66, 57, 2148, 2, 1730, 3, 109, 1902, 19064, 1..."
27553,Green Tea - Pure Original,87,396.00,495.0,395,4.2,"Tetley Green Tea with its refreshing pure, ori...",0,0,0,1,0,0,0,0,0,0,0,0.047378,"[7117, 127, 55, 7, 32, 187, 129, 677, 70, 71, ..."


Now it also doesn't hurt to check again if there are any null values in the dataset.

In [57]:
null_df = df_encode.copy(deep=True)

In [58]:
null_columns = null_df.columns[null_df.isna().any()].tolist()

In [59]:
show_null(null_columns)

Empty DataFrame
Columns: [Number of Null Values, Percentage of Null Values]
Index: []


In [62]:
df_encode = df_encode.drop('description', axis = 1)
df_encode.head()

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,category_Cleaning & Household,"category_Eggs, Meat & Fish","category_Foodgrains, Oil & Masala",category_Fruits & Vegetables,category_Gourmet & World Food,"category_Kitchen, Garden & Pets",category_Snacks & Branded Foods,brand_freq,desc_seq
0,Garlic Oil - Vegetarian Capsule 500 mg,49,220.0,220.0,204,4.1,0,0,1,0,0,0,0,0,0,0,0,0.043733,"[15, 62, 71, 574, 29, 14, 5, 144, 6, 83, 1582,..."
1,Water Bottle - Orange,86,180.0,180.0,420,2.3,0,0,0,0,0,0,0,0,0,1,0,0.211378,"[210, 62, 5, 599, 96, 87, 329, 1664, 96, 607, ..."
2,"Brass Angle Deep - Plain, No.2",73,119.0,250.0,249,3.4,0,0,0,0,1,0,0,0,0,0,0,0.153067,"[4, 48, 667, 10, 30, 1284, 27, 8, 11, 1466, 45..."
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,149.0,176.0,250,3.7,0,0,0,0,1,0,0,0,0,0,0,0.375378,"[1016, 366, 7, 23, 684, 208, 1, 24, 17, 40, 19..."
4,Creme Soft Soap - For Hands & Body,8,162.0,162.0,39,4.4,0,0,1,0,0,0,0,0,0,0,0,0.317067,"[1481, 1028, 73, 223, 97, 11, 16, 2, 54, 107, ..."


Now, we should convert the vector values of desc_seq column to columns of our dataset. To be more clear we should assign a column in our table for each of the vector elements and then store the values for that element in its corresponding column.
To do this, we define a function and then apply that on our dataset.

In [63]:
def conversion(con_df,column):
  max_length = max(len(v) for v in con_df[column])
  new_columns = [f'{column}{i + 1}' for i in range(max_length)]
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  print(con_df)

In [64]:
conversion(df_encode,'desc_seq')

  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[new_columns] = con_df[column].apply(lambda x: pd.Series(x + [0] * (max_length - len(x))))
  con_df[n

                                                 product  sub_category  \
0                 Garlic Oil - Vegetarian Capsule 500 mg            49   
1                                  Water Bottle - Orange            86   
2                         Brass Angle Deep - Plain, No.2            73   
3      Cereal Flip Lid Container/Storage Jar - Assort...             9   
4                     Creme Soft Soap - For Hands & Body             8   
...                                                  ...           ...   
27550         Wottagirl! Perfume Spray - Heaven, Classic            41   
27551                                           Rosemary            20   
27552                       Peri-Peri Sweet Potato Chips            82   
27553                          Green Tea - Pure Original            87   
27554                     United Dreams Go Far Deodorant            59   

       sale_price  market_price  type  rating  category_Baby Care  \
0          220.00         220.0   204     

In [65]:
df_encode

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,...,desc_seq721,desc_seq722,desc_seq723,desc_seq724,desc_seq725,desc_seq726,desc_seq727,desc_seq728,desc_seq729,desc_seq730
0,Garlic Oil - Vegetarian Capsule 500 mg,49,220.00,220.0,204,4.1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Water Bottle - Orange,86,180.00,180.0,420,2.3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Brass Angle Deep - Plain, No.2",73,119.00,250.0,249,3.4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,149.00,176.0,250,3.7,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Creme Soft Soap - For Hands & Body,8,162.00,162.0,39,4.4,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27550,"Wottagirl! Perfume Spray - Heaven, Classic",41,199.20,249.0,329,3.9,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
27551,Rosemary,20,67.50,75.0,215,4.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27552,Peri-Peri Sweet Potato Chips,82,200.00,200.0,286,3.8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27553,Green Tea - Pure Original,87,396.00,495.0,395,4.2,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


As you can see 730 rows are added to our dataset as a result of applying our function.

In [67]:
#just drop the desc_seq column since we no longer need it
df_encode = df_encode.drop('desc_seq', axis = 1)
df_encode

Unnamed: 0,product,sub_category,sale_price,market_price,type,rating,category_Baby Care,"category_Bakery, Cakes & Dairy",category_Beauty & Hygiene,category_Beverages,...,desc_seq721,desc_seq722,desc_seq723,desc_seq724,desc_seq725,desc_seq726,desc_seq727,desc_seq728,desc_seq729,desc_seq730
0,Garlic Oil - Vegetarian Capsule 500 mg,49,220.00,220.0,204,4.1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Water Bottle - Orange,86,180.00,180.0,420,2.3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Brass Angle Deep - Plain, No.2",73,119.00,250.0,249,3.4,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cereal Flip Lid Container/Storage Jar - Assort...,9,149.00,176.0,250,3.7,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Creme Soft Soap - For Hands & Body,8,162.00,162.0,39,4.4,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27550,"Wottagirl! Perfume Spray - Heaven, Classic",41,199.20,249.0,329,3.9,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
27551,Rosemary,20,67.50,75.0,215,4.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27552,Peri-Peri Sweet Potato Chips,82,200.00,200.0,286,3.8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27553,Green Tea - Pure Original,87,396.00,495.0,395,4.2,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Create Our Dictionary

Sometimes in finding the similarity of two product, we don't use all of their available features and we only use the most effective and important ones.
In our case, for being able to find the similarity of products we need to have a vector representation for each of them and then find the difference between those vactors. The more distant those vectors are, the less similar the products are.
So, first we try to get those vector representation for each row(product) of our dataset and create a disctionary fr our products, which the keys of the dictionary is the product's name and the value is the vector representation of the product's characteristics.
Using this method, whenever we get a product's name as our input we reach out to the dictionary and find its corresponding vector, then we compare that vector with all other available vectors we have and we give the similar one's product name as an output(the threshold for similarity is set by us based on our special preferences).

In [70]:
vec_df = df_encode.copy(deep=True)

In [71]:
columns_for_vector = vec_df.columns

In [75]:
#these columns are not required as a feature of a product
exclude = ['product','sale_price']

In [76]:
def vectorize_rows(dataframe, columns_to_exclude=None):
    if columns_to_exclude:
        dataframe = dataframe.drop(columns=columns_to_exclude)
    return dataframe.values.tolist()

In [77]:
vectorized_rows = vectorize_rows(vec_df,exclude)

In [78]:
for idx, row in enumerate(vectorized_rows[:3]):
    print(f"Row {idx + 1}: {row}")

Row 1: [49.0, 220.0, 204.0, 4.1, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04373337220744196, 15.0, 62.0, 71.0, 574.0, 29.0, 14.0, 5.0, 144.0, 6.0, 83.0, 1582.0, 570.0, 419.0, 1582.0, 372.0, 481.0, 837.0, 4094.0, 1.0, 33.0, 1142.0, 406.0, 10.0, 138.0, 195.0, 250.0, 46.0, 317.0, 235.0, 133.0, 193.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

In [80]:
dictionary = dict(zip(vec_df['product'], vectorized_rows))

## Recommender System

In [115]:
def get_input():
  target_product = input("Please input the product that you're searching for the similar items to it:")
  similarity_threshold = input("How similar you want these items to be, compared to your product? " +
                             "1)Very similar " +
                             "2)Fairly similar " )
  return target_product , similarity_threshold

In [110]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    similarity = intersection / union if union != 0 else 0
    return similarity

def find_similar_products(customer_product, product_vectors, threshold=0.25):
    customer_set = set(product_vectors.get(customer_product, []))
    similar_products = []

    for product, vector in product_vectors.items():
        if product != customer_product:
            vector_set = set(vector)
            similarity = jaccard_similarity(customer_set, vector_set)
            if similarity > threshold:
                similar_products.append(product)

    return similar_products

In [119]:
def get_list(target_product,similarity_threshold):
 if(similarity_threshold == "1"):
  found_items = find_similar_products(target_product, dictionary, threshold=0.4)
 else:
  found_items = find_similar_products(target_product, dictionary, threshold=0.3)
 print("The similar items are:",  found_items)


In [124]:
while(True):
 target_product , similarity_threshold = get_input()
 get_list(target_product,similarity_threshold)
 con = input("Do you want to continue?" + " Yes or No : ")
 if(con != "Yes"):
   break

Please input the product that you're searching for the similar items to it:Hand Wash - Green Apple
How similar you want these items to be, compared to your product? 1)Very similar 2)Fairly similar 1
The similar items are: ['Hand Wash - Orange Peel', 'Hand Wash - White Lily', 'Hand Wash - Peach & Apricot', 'Hand Wash - Marina']
Do you want to continue? Yes or No : Yes
Please input the product that you're searching for the similar items to it:Hand Wash - Green Apple
How similar you want these items to be, compared to your product? 1)Very similar 2)Fairly similar 2
The similar items are: ['Pain Relief - Oil', 'Hand Wash - Orange Peel', 'Code Vaporisateur Natural Spray for Men', 'Hand Wash - White Lily', 'Bathing Soap (Lavender & Milk Cream)', 'Laboratory Reagent CH3, CO, CH3', 'Cotton Balls', 'Shampoo - for Normal Hair', 'Fruity Soap Enriched with Narural Grape Extract', 'Bathing Bar - Ice Cube', 'Perfume - Fresh Blossom Absolute', 'Body Bath+Scrub - Orange, Peach Essence', 'Energy Vapori