### Training the Recommender System with Outliers (as in the Dataset)

#### Approach: Content Based Filtering: Use of features directly related to the products.
#### Model Used: NearestNeighbour


Features Used:
- BrandName 
- Category
- Individual_category 
- DiscountPrice
- OriginalPrice
- DiscountOffer

Remarks: 
1. All categorical features were one-hot encoded, with PCA dimensionality reduction applied.
2. No outliers were removed for this training, though they exist in each of the numerical features.

### Importing the necessary libraries

In [44]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors  
from scipy.sparse import csr_matrix
import pickle

#### Load the dataset

In [45]:
path =  './clothes_df_clean_with_outliers.csv'
clothing_df= pd.read_csv(path)
clothing_df

Unnamed: 0,URL,Product_id,Description,DiscountPrice,OriginalPrice,DiscountOffer,SizeOption,Ratings,Reviews,product_name,...,pca_individual_category_18,pca_individual_category_19,Category_Indian Wear,Category_Inner Wear & Sleep Wear,Category_Lingerie & Sleep Wear,Category_Plus Size,Category_Sports Wear,Category_Topwear,Category_Western,category_by_Gender_Women
0,https://www.myntra.com/jeans/roadster/roadster...,2296012,roadster men navy blue slim fit mid rise clean...,824.0,1499.0,45% OFF,"28, 30, 32, 34, 36",3.9,999.0,roadster-men-navy-blue-slim-fit-mid-rise-clean...,...,-0.005001,-0.004482,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,https://www.myntra.com/track-pants/locomotive/...,13780156,locomotive men black white solid slim fit tra...,517.0,1149.0,55% OFF,"S, M, L, XL",4.0,999.0,locomotive-men-black--white-solid-slim-fit-tra...,...,-0.008943,-0.007922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,https://www.myntra.com/shirts/roadster/roadste...,11895958,roadster men navy white black geometric print...,629.0,1399.0,55% OFF,"38, 40, 42, 44, 46, 48",4.3,999.0,roadster-men-navy-white--black-geometric-print...,...,-0.005118,-0.004585,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,https://www.myntra.com/shapewear/zivame/zivame...,4335679,zivame women black saree shapewear zi3023core0...,893.0,1295.0,31% OFF,"S, M, L, XL, XXL",4.2,999.0,zivame-women-black-saree-shapewear-zi3023core0...,...,0.041040,0.048508,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,https://www.myntra.com/trousers/highlander/hig...,6744434,highlander men olive green slim fit solid regu...,599.0,1499.0,60% OFF,"30, 32, 34, 36",3.9,998.0,highlander-men-olive-green-slim-fit-solid-regu...,...,-0.007880,-0.007002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117536,https://www.myntra.com/tshirts/hrx-by-hrithik-...,8379269,hrx by hrithik roshan women navy blue nautical...,404.0,899.0,55% OFF,"XS, S, M, L, XL",4.4,0.0,hrx-by-hrithik-roshan-women-navy-blue-nautical...,...,-0.003031,-0.002733,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
117537,https://www.myntra.com/track-pants/stylestone/...,12767048,stylestone women black solid track pants,467.0,899.0,48% OFF,"S, M, L, XL",4.2,0.0,stylestone-women-black-solid-track-pants,...,-0.008943,-0.007922,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
117538,https://www.myntra.com/tshirts/hrx-by-hrithik-...,10106141,hrx by hrithik roshan women black green print...,404.0,899.0,55% OFF,"S/M, L/XL",4.4,0.0,hrx-by-hrithik-roshan-women-black--green-print...,...,-0.003031,-0.002733,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
117539,https://www.myntra.com/tshirts/hrx-by-hrithik-...,11640324,hrx by hrithik roshan women north sea printed ...,494.0,899.0,45% OFF,"XS, S, M, L, XL",4.4,0.0,hrx-by-hrithik-roshan-women-north-sea-printed-...,...,-0.003031,-0.002733,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### Upon further investigation, the DiscountOffer features need a little more cleaning

In [46]:
# Extract numeric values from DiscountOffer - just incase the split didn't
clothing_df['DiscountOffer'] = clothing_df['DiscountOffer'].str.extract('(\d+)').astype(float)

#### Verify the data types of the features once more

In [47]:
clothing_df.dtypes

URL                                   object
Product_id                             int64
Description                           object
DiscountPrice                        float64
OriginalPrice                        float64
DiscountOffer                        float64
SizeOption                            object
Ratings                              float64
Reviews                              float64
product_name                          object
pca_brandname_0                      float64
pca_brandname_1                      float64
pca_brandname_2                      float64
pca_brandname_3                      float64
pca_brandname_4                      float64
pca_brandname_5                      float64
pca_brandname_6                      float64
pca_brandname_7                      float64
pca_brandname_8                      float64
pca_brandname_9                      float64
pca_brandname_10                     float64
pca_brandname_11                     float64
pca_brandn

#### Remarks: Data types looks good ( as expected)

#### Split dataset (80, 20)

In [48]:
# Split the data
train_data, test_data = train_test_split(clothing_df, test_size=0.2, random_state=42)


##### cb_features: To be used for the training. 

Since they are many, we can exclude the non_cb_features from all columns and take the rest as our cb_features.

In [49]:
# train_data' is our dataframe and 'cb_features' are the columns to be used
non_cb_features = ['Product_id', 'URL', 'SizeOption', 'Ratings', 'Reviews', 'Description', 'product_name']
cb_features = [col for col in train_data.columns if col not in non_cb_features]


In [50]:
cb_features

['DiscountPrice',
 'OriginalPrice',
 'DiscountOffer',
 'pca_brandname_0',
 'pca_brandname_1',
 'pca_brandname_2',
 'pca_brandname_3',
 'pca_brandname_4',
 'pca_brandname_5',
 'pca_brandname_6',
 'pca_brandname_7',
 'pca_brandname_8',
 'pca_brandname_9',
 'pca_brandname_10',
 'pca_brandname_11',
 'pca_brandname_12',
 'pca_brandname_13',
 'pca_brandname_14',
 'pca_brandname_15',
 'pca_brandname_16',
 'pca_brandname_17',
 'pca_brandname_18',
 'pca_brandname_19',
 'pca_individual_category_0',
 'pca_individual_category_1',
 'pca_individual_category_2',
 'pca_individual_category_3',
 'pca_individual_category_4',
 'pca_individual_category_5',
 'pca_individual_category_6',
 'pca_individual_category_7',
 'pca_individual_category_8',
 'pca_individual_category_9',
 'pca_individual_category_10',
 'pca_individual_category_11',
 'pca_individual_category_12',
 'pca_individual_category_13',
 'pca_individual_category_14',
 'pca_individual_category_15',
 'pca_individual_category_16',
 'pca_individual_ca

In [51]:
len(cb_features)

51

#### Remarks: We are using 51 features for the training. This has increased because of the one-hot encoding applied on the categorical features.

#### Let's create the TF-IDF (Term Frequency -Inverse Document Frequency) matrix:
Important to enable the model learn how important a word is in the corpus, 

In [52]:
# Create the TF-IDF matrix for CB filtering
tfidf_matrix = train_data.set_index('Product_id')[cb_features].fillna(0)

In [53]:
tfidf_matrix

Unnamed: 0_level_0,DiscountPrice,OriginalPrice,DiscountOffer,pca_brandname_0,pca_brandname_1,pca_brandname_2,pca_brandname_3,pca_brandname_4,pca_brandname_5,pca_brandname_6,...,pca_individual_category_18,pca_individual_category_19,Category_Indian Wear,Category_Inner Wear & Sleep Wear,Category_Lingerie & Sleep Wear,Category_Plus Size,Category_Sports Wear,Category_Topwear,Category_Western,category_by_Gender_Women
Product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14685592,324.0,1299.0,75.0,-0.076477,-0.094640,-0.200341,0.934253,0.166078,0.043500,0.034887,...,-0.003831,-0.003446,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
13673098,1559.0,2399.0,35.0,-0.047953,-0.036005,-0.047985,-0.046413,-0.062067,-0.061202,-0.092845,...,-0.005118,-0.004585,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
11790004,399.0,799.0,50.0,-0.033569,-0.021034,-0.025610,-0.021041,-0.024074,-0.016253,-0.017834,...,-0.016033,-0.013911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16878312,1264.0,2299.0,45.0,-0.067500,-0.069379,-0.117166,-0.250873,0.932096,0.070445,0.050385,...,-0.016033,-0.013911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15403992,849.0,999.0,15.0,-0.038227,-0.025310,-0.031593,-0.027050,-0.032042,-0.023109,-0.026458,...,0.034426,0.034708,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17151468,1259.0,1799.0,30.0,-0.033238,-0.020748,-0.025219,-0.020665,-0.023594,-0.015866,-0.017368,...,-0.005118,-0.004585,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
12497676,479.0,2178.0,78.0,-0.032215,-0.019876,-0.024040,-0.019544,-0.022173,-0.014737,-0.016021,...,-0.010445,-0.009211,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1692131,489.0,699.0,30.0,-0.032480,-0.020100,-0.024341,-0.019829,-0.022532,-0.015020,-0.016357,...,-0.029704,-0.024784,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8255321,1291.0,3149.0,59.0,-0.039855,-0.026920,-0.033930,-0.029528,-0.035501,-0.026498,-0.030992,...,-0.005143,-0.004607,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [54]:
tfidf_matrix.to_csv('tfidf_matrix.csv', index = False)

#### Transforminig the data to a Sparse matrix for memory efficiency and performance.
This will store only the non-zero elements and their positions, which will significantly reduce memory usage.

In [55]:
# Transform to a sparse matrix
tfidf_sparse = csr_matrix(tfidf_matrix)

In [71]:
tfidf_sparse

<94032x51 sparse matrix of type '<class 'numpy.float64'>'
	with 4163997 stored elements in Compressed Sparse Row format>

#### Instantiate our learning model: NearestNeighbors

In [56]:
# Use NearestNeighbors for approximate nearest neighbors
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(tfidf_sparse)

#### Create a function `get_recommendations` that finds top 10 similar products

In [58]:
# def get_recommendations(item_index, model, data, n_neighbors=10):
#     distances, indices = model.kneighbors(data[item_index], n_neighbors=n_neighbors)
#     return indices.flatten(), distances.flatten() 


In [60]:
def get_recommendations(product_id, model, data, n_neighbors=10):
    product_index = data.index.get_loc(product_id)
    distances, indices = model.kneighbors(data[product_index], n_neighbors=n_neighbors)
    recommended_product_ids = data.index[indices.flatten()].tolist()
    return recommended_product_ids[1:], distances.flatten()[1:] 

#### A dictionary to store similar products

In [61]:
recommendations = {}

for idx in range(tfidf_sparse.shape[0]):
    recommended_indices, _ = get_recommendations(idx, nn, tfidf_sparse)
    recommendations[idx] = recommended_indices[1:]  # Exclude the item itself


AttributeError: 'csr_matrix' object has no attribute 'index'

In [57]:
# Save the recommendations dictionary to a file
with open('recommendations.pkl', 'wb') as f:
    pickle.dump(recommendations, f)

print("Recommendations saved successfully.")

Recommendations saved successfully.


In [58]:
train_data.to_csv('data_with_recommendations.csv', index=False) 

#### Load the model and get some recommendations

In [15]:
# Load the recommendations dictionary from the file
try:
    with open('recommendations.pkl', 'rb') as f:
        loaded_recommendations = pickle.load(f)
    print("Recommendations loaded successfully.")
    print(loaded_recommendations)
except EOFError:
    print("Error: The file is empty or corrupted.")

Recommendations loaded successfully.


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [None]:
###

In [16]:
# Function to get similar products with descriptions
def get_similar_products_with_descriptions(product_id, n_neighbors=10):
    product_index = tfidf_matrix.index.get_loc(product_id)
    recommended_indices, _ = get_recommendations(product_index, nn, tfidf_sparse, n_neighbors=n_neighbors)
    recommended_products = [tfidf_matrix.index[idx] for idx in recommended_indices[1:]]
    
    # Get descriptions for the recommended products
    similar_products_with_descriptions = []
    for product in recommended_products:
        product_name = clothing_df.loc[clothing_df['Product_id'] == product, 'product_name'].values[0]
        similar_products_with_descriptions.append((product, product_name))
    
    return similar_products_with_descriptions

In [17]:
data_with_recommendations = pd.read_csv('./data_with_recommendations.csv')

In [30]:
data_with_recommendations.head

<bound method NDFrame.head of                                                      URL  Product_id  \
0      https://www.myntra.com/kurtas/sangria/sangria-...    14685592   
1      https://www.myntra.com/shirts/wrogn/wrogn-men-...    13673098   
2      https://www.myntra.com/shorts/ajile-by-pantalo...    11790004   
3      https://www.myntra.com/shorts/puma/puma-men-bl...    16878312   
4      https://www.myntra.com/thermal-tops/levis/levi...    15403992   
...                                                  ...         ...   
94027  https://www.myntra.com/shirts/van-heusen-sport...    17151468   
94028  https://www.myntra.com/sarees/silk-bazar/silk-...    12497676   
94029  https://www.myntra.com/briefs/basiics-by-la-in...     1692131   
94030  https://www.myntra.com/kurta-sets/vishudh/vish...     8255321   
94031  https://www.myntra.com/bra/dressberry/dressber...    14461768   

                                             Description  DiscountPrice  \
0            sangria women pin

In [29]:
loaded_tfidf_matrix = pd.read_csv('./tfidf_matrix.csv')

In [18]:
# Function to get similar products with descriptions
def get_similar_products_(product_id, n_neighbors=10):
    product_index = loaded_tfidf_matrix.index.get_loc(product_id)
    loaded_recommendations, _ = get_recommendations(product_index, nn, tfidf_sparse, n_neighbors=n_neighbors)
    recommended_products = [loaded_tfidf_matrix.index[idx] for idx in loaded_recommendations[1:]]
    
    # Get descriptions for the recommended products
    similar_products_with_descriptions = []
    for product in recommended_products:
        product_name = clothing_df.loc[clothing_df['Product_id'] == product, 'product_name'].values[0]
        similar_products_with_descriptions.append((product, product_name))
    
    return similar_products_with_descriptions

In [23]:
# Example: Get top 5 similar products for a given product_id
product_id_example = data_with_recommendations['Product_id'].iloc[1]
similar_products = get_similar_products_(product_id_example, n_neighbors=5)
for product in similar_products:
    print(f"Product ID: {product[0]}, Product: {product[1]}")

KeyError: 13673098

In [24]:
# Example: Get top 5 similar products for a given product_id
product_id_example = train_data['Product_id'].iloc[1]
similar_products = get_similar_products_(product_id_example, n_neighbors=5)
for product in similar_products:
    print(f"Product ID: {product[0]}, Product: {product[1]}")

KeyError: 13673098

In [74]:
# Example: Get top 5 similar products for a given product_id
product_id_example = train_data['Product_id'].iloc[500]
similar_products = get_similar_products_with_descriptions(product_id_example, n_neighbors=5)
for product in similar_products:
    print(f"Product ID: {product[0]}, Product: {product[1]}")

Product ID: 13249718, Product: roadster-men-navy-blue-skinny-fit-light-fade-stretchable-jeans
Product ID: 13249912, Product: roadster-men-blue-carrot-fit-low-distress-light-fade-stretchable-jeans
Product ID: 12303566, Product: roadster-men-blue-slim-fit-mid-rise-clean-look-stretchable-jeans
Product ID: 11691082, Product: roadster-men-blue-skinny-fit-mid-rise-clean-look-stretchable-jeans


In [39]:
def get_recommendations(product_id, model, data, n_neighbors=10):
    product_index = data.index.get_loc(product_id)
    distances, indices = model.kneighbors(data[product_index], n_neighbors=n_neighbors)
    recommended_product_ids = data.index[indices.flatten()].tolist()
    return recommended_product_ids[1:], distances.flatten()[1:]  # Exclude the item itself

def get_similar_products_with_descriptions(product_id, n_neighbors=10):
    recommended_product_ids, _ = get_recommendations(product_id, nn, tfidf_matrix, n_neighbors=n_neighbors)
    similar_products_with_descriptions = []
    for prod_id in recommended_product_ids:
        product_name = clothing_df.loc[clothing_df['Product_id'] == prod_id, 'product_name'].values[0]
        similar_products_with_descriptions.append((prod_id, product_name))
    return similar_products_with_descriptions


In [40]:
def evaluate_recommendations(data, model, tfidf_matrix):
    similarities = []
    for idx, row in data.iterrows():
        product_id = row['Product_id']
        recommended_products, _ = get_recommendations(product_id, model, tfidf_matrix)
        query_vector = tfidf_matrix.loc[product_id].values.reshape(1, -1)
        recommended_vectors = tfidf_matrix.loc[recommended_products].values
        avg_similarity = cosine_similarity(query_vector, recommended_vectors).mean()
        similarities.append(avg_similarity)
    return np.mean(similarities)


In [43]:
# Evaluate the recommendations
average_similarity = evaluate_recommendations(test_data, nn, tfidf_matrix)
print(f"Average Similarity of Recommendations: {average_similarity:.4f}")

KeyError: 'Product_id'

In [42]:
test_data

Unnamed: 0_level_0,DiscountPrice,OriginalPrice,DiscountOffer,pca_brandname_0,pca_brandname_1,pca_brandname_2,pca_brandname_3,pca_brandname_4,pca_brandname_5,pca_brandname_6,...,pca_individual_category_18,pca_individual_category_19,Category_Indian Wear,Category_Inner Wear & Sleep Wear,Category_Lingerie & Sleep Wear,Category_Plus Size,Category_Sports Wear,Category_Topwear,Category_Western,category_by_Gender_Women
Product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11898988,1299.0,2599.0,50.0,-0.036214,-0.023402,-0.028891,-0.024268,-0.028287,-0.019869,-0.022254,...,-0.005118,-0.004585,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
14345966,699.0,1999.0,65.0,-0.038535,-0.025609,-0.032024,-0.027499,-0.032659,-0.023724,-0.027277,...,-0.014266,-0.012441,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
16925756,4199.0,5999.0,30.0,-0.034613,-0.021952,-0.026869,-0.022264,-0.025652,-0.017548,-0.019406,...,-0.005001,-0.004482,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15201176,519.0,1999.0,74.0,-0.032729,-0.020311,-0.024627,-0.020100,-0.022875,-0.015292,-0.016681,...,-0.003831,-0.003446,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
13503210,439.0,1099.0,60.0,-0.032000,-0.019696,-0.023796,-0.019315,-0.021885,-0.014512,-0.015754,...,0.660176,-0.603985,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2132777,1799.0,4390.0,59.0,-0.034164,-0.021555,-0.026323,-0.021730,-0.024961,-0.016977,-0.018709,...,-0.010445,-0.009211,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
16650730,600.0,1250.0,52.0,-0.032906,-0.020462,-0.024832,-0.020295,-0.023123,-0.015489,-0.016916,...,-0.029704,-0.024784,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
15875414,539.0,899.0,40.0,-0.032235,-0.019893,-0.024062,-0.019565,-0.022199,-0.014758,-0.016046,...,-0.016033,-0.013911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15688410,1499.0,1999.0,25.0,-0.067500,-0.069379,-0.117166,-0.250873,0.932096,0.070445,0.050385,...,-0.016033,-0.013911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Get top 5 similar products for a given product_id example
product_id_example = test_data['Product_id'].iloc[100]
similar_products = get_similar_products_with_descriptions(product_id_example, n_neighbors=5)
for product in similar_products:
    print(f"Product ID: {product[0]}, Product: {product[1]}")