# 4. Modelling
The goal of the project is to build a product recommendation system for skincare products from Sephora. In this section, I tried couple different models to build recommendation systems: 
1. A Naive recommender: using a weighted popularity metric that takes into account both a product average rating and the number of reviews, I computed the top 50 products that made up the top 10\% percentile of the number of reviews and randomly recommened these products to new customers. In other words, each product in this list has received more reviews than 90% of the products on the platform. 
- Pros: this naive recommender works quite well because products in this list had accummulated many ratings and that their quality were validated. Thus, there is a high probability that new customers will like them.
- Cons: This model is not personalized and doesnt give new products a chance. It is also very sensitive to the cut-off value for the number of reviews. 

2. Content-based recommendation system: uses item features to recommend other products similar to a particular product. In this project, I computed the pairwise cosine similarity scores between all products using different feature such as: product name, short descriptions, ingredients, etc.. The idea is that if a customer like a particular product, he/she will also likes something similar to it. 
- Pros: the product names are fairly descriptive (from most skincare products); thus, recommendations made by scanning for similarity between the product's names, descriptions, etc.. work quite well. This is a simple recommender to build and deploy since most websites/stores have a finite number of products, and the pairwise similarity scores could be pre-computed easily. It only requires data on the products' characteristics and could recommend a product that hasnt been reviewed by any customer.
- Cons: the recommendation is not personalized to a customer's preferences.



In [1]:
import pandas as pd
import numpy as np
from itertools import chain # to flatten a 2-D list
import math
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_colwidth', None) # display the whole DF content

In [2]:
products = (pd.read_csv('Cleaned_data/cleaned_all_products.csv', index_col = 0, 
                        dtype = {'brand_name': str, 'product_name': str, 'price': float,
                                 'category': str, 'loves_count': int, 'product_id': str, 'no_reviews': int, 
                                 'rating': float, 'short_description': str, 'ingredients': str, 
                                 'unique_product_name': str, 'Total_reviews': 'float64', 'RecommendedCount': 'float64', 
                                 'AverageOverallRating': 'float64', '1star': 'float64', '2star': 'float64', '3star': 'float64', 
                                 '4star': 'float64', '5star': 'float64', 'normal': 'float64', 'combination': 'float64', 'dry': 'float64',
                                 'oily': 'float64', 'acneConcern': 'float64', 'agingConcern': 'float64', 'blackheadsConcern': 'float64',
                                 'dullnessConcern': 'float64', 'rednessConcern': 'float64', 'sensitivityConcern': 'float64',
                                 'darkCirclesConcern': 'float64', 'sunDamageConcern': 'float64', 'nonStaffReviews': 'float64',
                                 'staffReviews': 'float64', 'incentivizedReviews': 'float64', '13to17': int, '18to24': int, 
                                 '25to34': int, '35to44': int, '45to54': int, 'over54': int, 'size_oz': float, 
                                 'price_per_oz': float, 'fragrance_free': int, 'alcohol_free': int, 
                                 'phthalates_free': int, 'sulfate_free': int, 'formaldehydes_free': int, 
                                 'parabens_free': int, 'mineral_oil_free': int, 'oil_free': int, 'gluten_free': int, 
                                 'cruelty_free': int, 'clean_at_sephora': int, 'vegan': int, 'planet_positive': int, 
                                 'community_favorite': int, 'ha_acid': int, 'niacinamide': int, 'salicylic_acid': int, 
                                 'AHA_Glycolic_acid': int, 'retinol': int, 'vitamin_C': int, 'uv_protection': int, 
                                 'collagen': int, 'dry_skin': int, 'combo_skin': int, 'normal_skin': int,
                                 'oily_skin': int, 'firmness': int, 'dullness': int, 'pores': int, 'acne': int, 
                                 'dark_cirlces': int, 'dark_spots': int, 'redness': int, 'allure_award': int, 'highlights': str}))


In [3]:
reviews = pd.read_csv('Cleaned_data/cleaned_all_reviews.csv', index_col = 0,
                      dtype = {'UserNickname': str, 'Rating': int, 'ReviewText': str, 'userSkinType': str, 
                                  'userEyeColor': str,'isSephoraStaff': int, 'isVerifiedPurchase': int, 
                                  'userHairColor': str, 'userSkinTone': str,'isIncentivizedReview': int, 
                                  'p_id': str, 'author_id': str, 'TotalPositiveFeedbackCount': int, 
                                  'TotalNegativeFeedbackCount': int, 'TotalFeedbackCount': int, 
                                  'userSkinConcern': str})

In [4]:
# rename 'rating' columns to avoid confusion
products = products.rename(columns = {'rating': 'product_rating'})
reviews = reviews.rename(columns = {'Rating': 'user_rating'})

When making recommendations, only intrinsic features should be considered. Rating relevant features could be removed for now. 

In [5]:
# drop columns that are not relevants 
# drop 'highlights' and keep 'short_descriptions' (see Data Wrangling)
cleaned_products = (products.copy().drop(columns = ['no_reviews',  
                                             '1star', '2star', '3star', '4star', '5star', 
                                             'normal', 'combination', 'dry', 'oily', 'acneConcern', 'agingConcern', 
                                             'blackheadsConcern', 'dullnessConcern', 'rednessConcern', 'sensitivityConcern', 
                                             'darkCirclesConcern', 'sunDamageConcern', 'product_rating',
                                             'nonStaffReviews', 'staffReviews', 'incentivizedReviews', 
                                             '13to17', '18to24', '25to34', '35to44', '45to54', 'over54', 
                                             'loves_count', 'size_oz', 'community_favorite', 'RecommendedCount', 
                                             'price_per_oz']))

In [6]:
cleaned_reviews = (reviews.copy().drop(columns = ['UserNickname', 'userSkinType', 'userEyeColor', 'isSephoraStaff', 
                                                  'isVerifiedPurchase', 'userHairColor', 'userSkinTone', 
                                                  'isIncentivizedReview', 'TotalPositiveFeedbackCount', 
                                                  'TotalNegativeFeedbackCount', 'userSkinConcern', 'TotalFeedbackCount']))

In [7]:
# if a customer reviews a product multiple times, only retain the first records
cleaned_reviews.drop_duplicates(subset = ['p_id', 'author_id'], keep = 'first', inplace = True, ignore_index = True)

## 4.1 Naive recommender
This kind of simple system randomly recommends the top products to every customer. The top products were based on ratings and number of reviews assuming that well-perceived products have a higher probability of being liked by the general customers. 


### Choosing a popularity metric:
I used a weighted rating that takes into account the average rating and the number of  it ratings that a product has received to avoid the situation when a product with a just a few reviews could have a very high or almost perfect rating (As the number of reviews increases, the rating of a product regularizes and approaches towards a value that is reflective of the product's quality and gives the customer a much better idea as to which product he/she should choose). 

We are going to use this metric that is recommened by datacamp (which was how the IMDB Top 250 movies were suggested):
WeightedRating(WR)=(v/(v+m)⋅R)+(m/(v+m)⋅C)

in which: 
v is the number of votes for the movie;
m is the minimum votes required to be listed in the chart;
R is the average rating of the movie;
C is the mean vote across the whole report.

This weighted raitng assures that a product with a 9 rating from 100,000 reviewers gets a (far) higher score than a product with the same rating but only a few hundred reviews.

For this model, I chose the 85th percentile as the value for m. In other words, for a products in the top 50, it must have at least more ratings that 85% of the products on the platform. To test the approach, I also looked at the case when the value of m is 90th percentile. The top 50 product list was totally different. 

Pros: this simple recommender works quite well given that all of the top products have received a lot of reviews with very good ratings and that their quality was validated. The probability of a new customer liking it is pretty high.  

Cons: This model is not personalized to a customer's preferences and doesnt give new products a chance. It is also very sensitive to the cut-off value. 

### 4.1.2 Top 50 products using 85th percentile cut-off

In [14]:
print(cleaned_products.Total_reviews.quantile([.8, .85, .9, .95, .99]))
# at least 20% of the products received more than 674 reviews, we could use this as a cut-off 
no_reviews_cutoff = cleaned_products.Total_reviews.quantile(.85)
mean_reviews = cleaned_products.Total_reviews.mean()

0.80     674.40
0.85     922.95
0.90    1311.80
0.95    1968.10
0.99    4332.42
Name: Total_reviews, dtype: float64


In [10]:
def compute_weighted_rating(data, m = no_reviews_cutoff, C = mean_reviews):
    """ 
    this function computes the weighted rating from the actual rating and the number of reviews 
    """
    v = data['Total_reviews']
    R = data['AverageOverallRating']
    return (v/(v + m)*R)+(m/(v+m)*C)

In [15]:
# filter out all qualified products:
top_products = cleaned_products[cleaned_products['Total_reviews'] >= no_reviews_cutoff].copy()
top_products['weighted_rating'] = compute_weighted_rating(top_products)

In [16]:
#Print the top 20 products with the highest weighted rating score: 
(top_products[['unique_product_name', 'price', 
               'category', 'weighted_rating',
               'Total_reviews', 'AverageOverallRating']].sort_values(by = 'weighted_rating', 
                                                                                    ascending=False).head(20))

Unnamed: 0,unique_product_name,price,category,weighted_rating,Total_reviews,AverageOverallRating
1113,Rose Deep Hydration Oil-Infused Serum fresh,59.0,facial treatments,207.744101,927.0,4.5955
978,Ceramic Slip French Green Clay Cleanser Sunday Riley,35.0,cleanser,207.18448,928.0,3.8481
1921,The Body Wash - With Niacinamide Nécessaire,25.0,vegan skin care,207.085045,932.0,4.5494
203,Rose Floral Toner fresh,40.0,moisturizing cream oils mists,206.953662,931.0,4.1053
129,Ultra Repair Face Moisturizer First Aid Beauty,28.0,moisturizing cream oils mists,205.763538,941.0,4.2179
966,Blueberry Bounce Gentle Cleanser Glow Recipe,34.0,cleanser,205.538161,941.0,3.831
1101,Banana Bright 15% Vitamin C Serum OLEHENRIKSEN,68.0,facial treatments,205.415725,943.0,4.0488
46,Glow Lip Pop Lip Balm Glow Recipe,22.0,lip balm lip care,204.653446,952.0,4.6502
152,C.E.O. Vitamin C Brightening Rich Hydration Moisturizer Sunday Riley,65.0,moisturizing cream oils mists,204.409587,951.0,4.0231
313,Midnight Recovery Concentrate Moisturizing Face Oil Kiehl's Since 1851,52.0,moisturizing cream oils mists,204.401983,952.0,4.2206


### 4.1.2 Top 50 products using 90th percentile cut-off

In [17]:
# change the cut-off to 90th percentile and check the results:
no_reviews_cutoff = cleaned_products.Total_reviews.quantile(.9) #each product has at least 1132 reviews 
mean_reviews = cleaned_products.Total_reviews.mean()
# filter out all qualified products:
top_products = cleaned_products[cleaned_products['Total_reviews'] >= no_reviews_cutoff].copy()
top_products['weighted_rating'] = compute_weighted_rating(top_products)
#Print the top 20 products with the highest weighted rating score: 
(top_products[['unique_product_name', 'price', 
               'category', 'weighted_rating',
               'Total_reviews', 'AverageOverallRating']].sort_values(by = 'weighted_rating', 
                                                                                    ascending=False).head(20))

Unnamed: 0,unique_product_name,price,category,weighted_rating,Total_reviews,AverageOverallRating
901,Aqua Bomb Cleansing Balm belif,34.0,cleanser,168.200375,1313.0,4.4631
73,Crème de la Mer Moisturizer La Mer,530.0,moisturizing cream oils mists,167.877353,1314.0,4.0989
622,Mini Crème de la Mer Moisturizer La Mer,95.0,moisturizing cream oils mists,167.79549,1315.0,4.0996
1945,The Silk Powder Protective Setting Powder Tatcha,48.0,vegan skin care,167.37029,1323.0,4.4475
139,Cicapair™ Tiger Grass Cream Dr. Jart+,49.0,moisturizing cream oils mists,167.307167,1323.0,4.3522
1248,Cold Plasma+ Advanced Serum Concentrate Perricone MD,149.0,facial treatments,167.304992,1321.0,4.1022
1132,The Serum Stick: Treatment & Touch Up Balm Tatcha,48.0,facial treatments,167.163315,1325.0,4.3811
1897,Umbra Sheer™ Physical Daily Defense SPF 30 Drunk Elephant,34.0,sunscreen sun protection,167.035411,1318.0,3.3232
122,High-Potency Night-a-Mins™ Resurfacing Cream with Fruit-Derived AHAs Origins,49.0,moisturizing cream oils mists,166.794653,1329.0,4.3153
157,Neo Nude Foundation Armani Beauty,40.0,moisturizing cream oils mists,166.421421,1333.0,4.2408


## 4.2 Content-based recommendation system or non-personalized recommendation system
Content-based filtering uses item features to recommend other products similar to a particular product. In this project, I computed the pairwise cosine similarity scores between all products using different feature such as: product name, short descriptions, ingredients, etc.. The idea is that if a customer like a particular product, he/she will also likes something similar to it. 

This is best suited for suggesting similar items (on Sephora website itself) or similar items that were frequently bought together (Amazon)

Credit this website: https://medium.com/@bindhubalu/content-based-recommender-system-4db1b3de03e7
and this: https://www.datacamp.com/community/tutorials/recommender-systems-python



In [18]:
print('No. reviews :', cleaned_reviews.shape[0])
print('No. products :', cleaned_products.shape[0])

No. reviews : 345270
No. products : 2184


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

import sklearn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, euclidean_distances

#### Cosine similarity
To compare between the 2 different documents (in this case, product's descriptions) is a NLP problem. It is not possible to compute the similarity between any two descriptions in their raw forms. To do this, I need to:
1. Compute the word vectors of each description or document, aka vectorized representation of words in a document
2. Compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document which gives a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a product. The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews
3. Use the TF-IDF matrix to compute a similarity score (other similarity metrics to consider are the manhattan, euclidean, the Pearson, and the soft cosine similarity scores). 

In [20]:
# TF(t): term frequency = (Number of times term t appears in a document) / (Total number of terms in the document).
# IDF(t): inverse document frequenct = log_e(Total number of documents / Number of documents with term t in it).

def compute_cosine_similarity(col):
    """
    This function takes in a column, calculates the TF-IDF vector for each reviews/descriptions, then calculates 
    the pairwise similarity with each other reviews/descriptions in the dataset
    
    Input: 
        col - The text column for TFIDF analysis (nx1)
    Output:
        cosine_similarity - Pairwise similarity scores of all products (nxn)
    """
    # top words are simply words that add no significant value to our system, like ‘an’, ‘is’, ‘the’, and hence 
    # are ignored by the system.
    stopwords_list = stopwords.words('english') 
    # scikit-learn provides a pre-built TF-IDF vectorizer that calculates the TF-IDF score for each document’s 
    # description, word-by-word.
    
    vectorizer = TfidfVectorizer(analyzer='word', stop_words=stopwords_list) # Remove all english stop words such 
                                                                             # as 'the', 'a', 'is'
    # tfidf_matrix: contains each word and its TF-IDF score with regard to each document, or item in this case. 
    tfidf_matrix = vectorizer.fit_transform(col)
    
    # Calculate pairwise cosine similarity for each item and every other items in the dataset
    cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix) 
    
    # calling linear_kernel() is faster than cosine_similarity()
    
    return cosine_similarity


In [21]:
#Function to get the top most similar products
def recommend(index, method, name_col=cleaned_products['unique_product_name'], top_n=5):
    """
    Get the pairwise similarity scores of all products, 
    Sort the scores, and return the top n 
    Input:
        index - Index of the similarity method
        method - The similarity method
        name_col - The column of product name
        top_n - The number of top recommended item
    """
    # Get the pairwsie similarity scores of all products with the input product
    similarity_scores = list(enumerate(method[index]))
    # sort the similarity scores and take the top n+1, note that the 1st one is the highest score against itself 
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:1+top_n]
    
    # Get the products index
    pd_index = [i[0] for i in similarity_scores]
    
    #Return the top 5 most similar products
    return name_col.iloc[pd_index]


In [23]:
def generate_recommendation_DF(method, top_n = 5):
    """
    Create the recommendation dataframe.
    Input:
        method - The similarity method
    """
    similar_pds = [recommend(i, method, top_n = top_n).values for i in range(len(products))]
    recom_df = pd.DataFrame(similar_pds)
    recom_df.columns = [f'Recommedation{i+1}' for i in range(top_n)]
    recom_df['Product_Name'] = products['unique_product_name']
    col_names = ['Product_Name'] + [f'Recommedation{i+1}' for i in range(top_n)]
    return recom_df[col_names]


In [25]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(name, recom_product_name, products = cleaned_products):
    """
    Search the recommendation dataframe for the desired product and returns the top n recommended products
    """
    indices = pd.Series(recom_product_name.index, index=recom_product_name['Product_Name']).drop_duplicates()
    # Get the index of the product that matches the name
    product_ind = indices[name]
    name_list = list(recom_product_name.loc[product_ind,:])
    ind_list = [products[products['unique_product_name'] == name_i].index[0] for name_i in name_list]

    single_product_recom_DF = products.loc[ind_list, ['product_name', 'brand_name', 'category', 'price', 'AverageOverallRating']]

    # Return the top 5 most similar products
    return single_product_recom_DF

### 4.2.1 Recommendations based on product names
Let's say that we are having a very simple recommendation system that suggests the next product for a customer based on how similar it is to the name of the current product that customer is looking at or just purchased. We could see that this recommendation system tends to recommend other products from the same brand or even the same line/collection. The reality could be that customer who enjoyed this moisturizers want to try out similar products from other brands. 

In [26]:
# compute all pairwise cosine similarity scores 
product_name_similarity = compute_cosine_similarity(cleaned_products['product_name']) 
# generate the recommendation DF
recom_product_name_DF = generate_recommendation_DF(product_name_similarity, 5)
# return the top 5 recommended products for a given product
get_recommendations(cleaned_products['unique_product_name'][5], recom_product_name_DF)

Unnamed: 0,product_name,brand_name,category,price,AverageOverallRating
5,Dramatically Different Moisturizing Gel,CLINIQUE,moisturizing cream oils mists,32.5,4.4416
293,Dramatically Different Moisturizing Cream,CLINIQUE,moisturizing cream oils mists,32.5,4.4012
304,Dramatically Different Moisturizing Lotion+,CLINIQUE,moisturizing cream oils mists,32.5,3.9172
578,Dramatically Different™ Moisturizing BB-Gel Tinted Moisturizer,CLINIQUE,moisturizing cream oils mists,17.0,3.219
545,Dramatically Different Hydrating Jelly,CLINIQUE,moisturizing cream oils mists,32.5,3.8892
580,Dramatically Different™ Hydrating Clearing Jelly,CLINIQUE,moisturizing cream oils mists,17.0,3.6154


In [27]:
# compute all pairwise cosine similarity scores 
product_name_similarity = compute_cosine_similarity(cleaned_products['unique_product_name']) 
# generate the recommendation DF
recom_product_name_DF = generate_recommendation_DF(product_name_similarity, 5)
# return the top 5 recommended products for a given product
get_recommendations(cleaned_products['unique_product_name'][5], recom_product_name_DF)


Unnamed: 0,product_name,brand_name,category,price,AverageOverallRating
5,Dramatically Different Moisturizing Gel,CLINIQUE,moisturizing cream oils mists,32.5,4.4416
293,Dramatically Different Moisturizing Cream,CLINIQUE,moisturizing cream oils mists,32.5,4.4012
304,Dramatically Different Moisturizing Lotion+,CLINIQUE,moisturizing cream oils mists,32.5,3.9172
578,Dramatically Different™ Moisturizing BB-Gel Tinted Moisturizer,CLINIQUE,moisturizing cream oils mists,17.0,3.219
545,Dramatically Different Hydrating Jelly,CLINIQUE,moisturizing cream oils mists,32.5,3.8892
580,Dramatically Different™ Hydrating Clearing Jelly,CLINIQUE,moisturizing cream oils mists,17.0,3.6154


### 4.2.2 Recommendations based on product descriptions

In [28]:
# compute all pairwise cosine similarity scores 
product_descrip_similarity = compute_cosine_similarity(cleaned_products['short_description']) 
# generate the recommendation DF
recom_product_descrip_DF = generate_recommendation_DF(product_descrip_similarity, 5)
# return the top 5 recommended products for a given product
get_recommendations(cleaned_products['unique_product_name'][5], recom_product_descrip_DF)

Unnamed: 0,product_name,brand_name,category,price,AverageOverallRating
5,Dramatically Different Moisturizing Gel,CLINIQUE,moisturizing cream oils mists,32.5,4.4416
304,Dramatically Different Moisturizing Lotion+,CLINIQUE,moisturizing cream oils mists,32.5,3.9172
1574,Water Bank Eye Gel,LANEIGE,eye treatment dark circle treatment,39.0,4.601
2069,Moisturizer Mini knockout brightening gel,tarte,vegan skin care,14.0,4.5
798,Balancing Toner with Green Tea,innisfree,cleanser,17.0,4.4737
115,Daily Greens Oil-Free Gel Moisturizer with Moringa and Papaya,Farmacy,moisturizing cream oils mists,38.0,4.1632


### 4.2.3 Recommendations based on product's ingredients

In [29]:
# compute all pairwise cosine similarity scores 
product_ingredient_similarity = compute_cosine_similarity(cleaned_products['ingredients']) 
# generate the recommendation DF
recom_product_ingredient_DF = generate_recommendation_DF(product_ingredient_similarity, 5)
# return the top 5 recommended products for a given product
get_recommendations(cleaned_products['unique_product_name'][5], recom_product_ingredient_DF)

Unnamed: 0,product_name,brand_name,category,price,AverageOverallRating
5,Dramatically Different Moisturizing Gel,CLINIQUE,moisturizing cream oils mists,32.5,4.4416
578,Dramatically Different™ Moisturizing BB-Gel Tinted Moisturizer,CLINIQUE,moisturizing cream oils mists,17.0,3.219
432,DayWear Matte Oil-Control Anti-Oxidant Moisturizer Gel Creme,Estée Lauder,moisturizing cream oils mists,55.0,4.6383
1318,Even Better Clinical™ Radical Dark Spot Corrector + Interrupter Serum,CLINIQUE,facial treatments,85.0,3.7397
1326,Mini Even Better Clinical™ Radical Dark Spot Corrector + Interrupter Serum,CLINIQUE,facial treatments,22.0,3.7397
846,Redness Solutions Soothing Cleanser,CLINIQUE,cleanser,25.0,4.1614


#### Recommendations based on product characteristics:

In [None]:
products[['unique_product_name', 'category', 'price', 'AverageOverallRating', 'clean_at_sephora']]

#### Jaccard Similarity Scores
The ratio of common attributes between the 2 products divided by the total number of their combined attributes


In [68]:
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

product_features = (cleaned_products[['parabens_free', 'clean_at_sephora', 'ha_acid', 'niacinamide', 'AHA_Glycolic_acid', 
                                      'retinol', 'vitamin_C']].copy())
# drop out products with no classifications 
product_features = product_features[(product_features.sum(axis = 1) > 0).values]
# compute the pairwise jaccard distance
jaccard_distance = pdist(product_features.values, metric = 'jaccard') # this returns a 1-D array
jaccard_similarity = 1 - squareform(jaccard_distance) # jaccard distance measures how different the 2 items are

In [69]:
pid_index = cleaned_products.iloc[product_features.index]['product_id']
jaccard_similarity_df = pd.DataFrame(data = jaccard_similarity, columns = pid_index, index = pid_index) 
jaccard_similarity_df

ValueError: Shape of passed values is (1, 1), indices imply (0, 0)

### 4.2.4 Recommendations based on user profiles
This model elevates the simple content-based recommender by integrating a customer's user profiles in addition to the similarity scores. 
This model first computes all pairwise cosine similarity scores for all the products, look at the target customer's profile and find the most similar items to the ones he/she has tried. 

To recommend items similar to what a user likes or rated in the past. Applications include suggesting alternatives to out-of-stock products or suggesting the next movie to watch 

In [75]:
cleaned_products.set_index(cleaned_products['product_id'], inplace = True)
cleaned_reviews.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 345270 entries, 0 to 345269
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   user_rating  345270 non-null  int64 
 1   ReviewText   345270 non-null  object
 2   p_id         345270 non-null  object
 3   author_id    345270 non-null  object
dtypes: int64(1), object(3)
memory usage: 10.5+ MB


In [76]:
# compute all pairwise cosine similarity scores DF
stopwords_list = stopwords.words('english') 
# scikit-learn provides a pre-built TF-IDF vectorizer that calculates the TF-IDF score for each document’s 
# description, word-by-word.
vectorizer = TfidfVectorizer(analyzer='word', stop_words=stopwords_list) # Remove all english stop words such 
                                                                         # as 'the', 'a'
tfidf_df = pd.DataFrame(data = vectorizer.fit_transform(cleaned_products['short_description']).toarray(),
                        columns = vectorizer.get_feature_names(),
                        index = cleaned_products.product_id)
tfidf_df.sample(10)


Unnamed: 0_level_0,000,001,01,10,100,10004,1000mg,100mg,107,11,...,zeaxanthin,zero,zeroes,zerumbet,zesty,zinc,zits,zone,zones,åland
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
P458759,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P471549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P4050,0.0,0.0,0.0,0.0,0.0,0.30054,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P211304,0.0,0.0,0.0,0.0,0.0,0.391596,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P434360,0.0,0.0,0.0,0.0,0.0,0.159232,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P452904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P454806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P456421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P469515,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
P443834,0.0,0.0,0.0,0.0,0.288196,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
# let's pick user 1250517772 as an example: 
list_of_product_bought = cleaned_reviews[cleaned_reviews.author_id == '1250517772']['p_id'] 
customer_products = tfidf_df.reindex(list_of_product_bought) # get the tfidf scores for this customer then compute the column averages
customer_prof = customer_products.mean() #customer profile


In [78]:
# only get the products that the user hasnt bought
non_purchase_products = tfidf_df.drop(list_of_product_bought, axis = 0)

# non_purchase_products_names = 
# calculate pairwise cosine similarity scores
customer_prof_similarity = linear_kernel(customer_prof.values.reshape(1,-1), non_purchase_products)

#convert this to DF
customer_prof_similarity = pd.DataFrame(data = customer_prof_similarity.T, 
                                        columns = ['similarity_score'],
                                        index = non_purchase_products.index)


In [79]:
# now sort the similarity score
sorted_similarity_score = (pd.DataFrame(data = customer_prof_similarity.sort_values(by = 'similarity_score', ascending = False),
                                        columns = ['similarity_score']))
sorted_similarity_score['product_name'] = cleaned_products.reindex(sorted_similarity_score.index)['product_name']
sorted_similarity_score['price'] = cleaned_products.reindex(sorted_similarity_score.index)['price']
sorted_similarity_score['brand'] = cleaned_products.reindex(sorted_similarity_score.index)['brand_name']
print('products purchased')
print(cleaned_products.reindex(list_of_product_bought)[['product_name', 'price', 'brand_name']])
print('Recommended products')
sorted_similarity_score.head(10)

products purchased
                                                                    product_name  \
p_id                                                                               
P474832                                 Sugar Recovery Lip Mask Advanced Therapy   
P474113       TL Advanced ™ Tightening Neck Cream PLUS for Firming & Brightening   
P474117  Mini TL Advanced ™ Tightening Neck Cream PLUS for Firming & Brightening   

         price  brand_name  
p_id                        
P474832   26.0       fresh  
P474113   95.0  StriVectin  
P474117   15.0  StriVectin  
Recommended products


Unnamed: 0_level_0,similarity_score,product_name,price,brand
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P474112,0.222257,Super-B Barrier Strengthening Oil with Vitamin B3 and Prebiotics,72.0,StriVectin
P474122,0.208607,Advanced Retinol Nightly Renewal Face Moisturizer with Retinol,119.0,StriVectin
P455610,0.197449,Confidence in a Neck Cream,54.0,IT Cosmetics
P461165,0.18442,GOOPGENES All-In-One Nourishing Eye Cream,55.0,goop
P454837,0.180965,Confidence in Your Beauty Sleep Night Cream,54.0,IT Cosmetics
P456194,0.179105,Extra-Firming Neck & Décolleté Cream,90.0,Clarins
P420652,0.174989,Lip Sleeping Mask,22.0,LANEIGE
P443377,0.174328,White Lucent Day Emulsion Broad Spectrum SPF 23,70.0,Shiseido
P411401,0.172967,Bye Bye Under Eye Brightening Eye Cream,49.0,IT Cosmetics
P459127,0.169074,Platinum Lip Plump SPF 30,50.0,Dr. Lara Devgan Scientific Beauty


This is a fairly good list of suggestion as it contains mask and facial cream, within the similar price range, and some of the brands that the customers already tried and some that she/he might want to try

## 4.3 Collaborative filtering or personalized recommendation system

The task here is to essentially predict the products that a target user might like based on the preferences of similar users (user-based collaborated filtering) or based on products they have liked in the past (item-based collaborated filtering).

Collaborative filters can further be classified into two types:

User-based Filtering: these systems recommend products to a user that similar users have liked. 

Item-based Filtering: this is very similar to the content recommendation engine in section 4.2.4 but instead of using the similarity scores computed from the products' intrinsic features. These systems identify similar items based on how people have rated it in the past. For example, if customer A and B have given 5 stars to Clinique moisturizer and Chanel cleanser, the model identifies the items as similar.

User0-based or item-based: do you have more users or more items? which one is easier to maintain? 


In [81]:
# first pivot our review data
products_name = cleaned_products[['product_id', 'unique_product_name']].reset_index(drop = True)
cleaned_reviews = cleaned_reviews.merge(products_name, how = 'left', left_on = 'p_id', right_on = 'product_id')
cleaned_reviews.drop(['product_id'], axis = 1, inplace = True)

# only keep customers that had reviewed at least 5 products 

customer_ratings_pivot = cleaned_reviews.pivot_table(values = 'user_rating', index = 'author_id', columns = 'p_id')

We have a problem with this sparse matrix, we cant just drop out na values because that would leave us with almost nothing left. Alternative, we cant replace missing values with 0 because it would confuse the algorithms between items that a given customer hasnt tried vs what they dont like. 
One strategy is to normalize a user's rating around 0 for each user, then replace the missing values with 0. This means we are giving these missing value a neutral score. This is not a perfect solution; however, our primary purpose here is to compare between users, this should be sufficient.

In [None]:
customer_average = customer_ratings_pivot.mean(axis = 1)
customer_ratings_centered_pivot = customer_ratings_pivot.sub(customer_average, axis = 0).copy()
# only keep customers that have reviewed at least 3 products 
customer_ratings_centered_pivot.dropna(thresh = 3, inplace = True) 
customer_ratings_centered_pivot.fillna(0.0, inplace = True)
customer_ratings_centered_pivot.astype(int)
customer_ratings_centered_pivot.sample(10)

cleaned_products

### 4.3.1 Item-based filtering

Transposing user-based pivot_table brings us to item-based filtering (by columns)

In [None]:
product_ratings_centered_pivot = customer_ratings_centered_pivot.T
product_ratings_centered_pivot

In [None]:
# calculate pairwise similarity 
product_similarity_df = pd.DataFrame(data = cosine_similarity(product_ratings_centered_pivot), 
                                     columns= product_ratings_pivot.index, 
                                     index=product_ratings_pivot.index)    

In [None]:
# given a product, return the top 10 similar products 
sample_prod_name = 'Hydra Zen Glow Liquid Lightweight Moisturizer with Hyaluronic Acid'
sample_prod_pid = cleaned_products[cleaned_products['product_name'] == sample_prod_name]['product_id']
rec = (cleaned_products[cleaned_products['product_id'].isin(product_similarity_df.loc[:, sample_prod_pid].squeeze().sort_values(ascending = False)[1:11].index)])
print('The most similar products to')
print(cleaned_products[cleaned_products['product_name'] == sample_prod_name][['product_name', 'brand_name', 'price', 'category']])
rec[['product_name', 'brand_name', 'price', 'category']]

this similarity score doesnt seem so work well in term of price, it suggested products that are far off in price range

### 4.3.2 User-based filtering

Predict how a user would rate an item even if it's not similar to anything they have seen. 
KNN: find the closet users to the user in question, then average their ratings for the given item. This allows us to predict how a user will feel about an item even when they havent seen it before

In [None]:
# calculate user-user similarity
customer_similarity_df = pd.DataFrame(data = cosine_similarity(customer_ratings_centered_pivot), 
                                      columns= customer_ratings_centered_pivot.index, 
                                      index= customer_ratings_centered_pivot.index)   
customer_similarity_df.sample(10)

#### Manual approach

In [None]:
# given a customer and a product that this customer hasnt used, predict how he/she would rate it
sample_user_id = '944662881'
sample_product_id = 'P122900'

# find the top 3 most similar customers 
nearest_customers = customer_similarity_df.loc['944662881'].sort_values(ascending = False)[1:4].index

In [None]:
# Go back to our originial uncentered dataset to check the mean ratings these nearest customers gave our product 
# of interest
customer_ratings_pivot = cleaned_reviews.pivot_table(values = 'user_rating', index = 'author_id', columns = 'p_id')
customer_ratings_pivot.loc[nearest_customers, sample_product_id].mean()
customer_ratings_pivot.dropna(thresh = 3, inplace = True) #only keep customers with at least 3 ratings

#### Using KNN

In [None]:
# Let's approach this with the built-in KNN function
# drop the product of interest from the dataframe
ratings_centered_train = customer_ratings_centered_pivot.drop(sample_product_id, axis = 1) 
ratings_train = customer_ratings_pivot.copy() 

# X_pred: get the original rating profile for our target customer
target_customer_x = ratings_centered_train.loc[[sample_user_id]]

# y_train: the orginial ratings of the product given by other customers 
other_customer_y = ratings_train[[sample_product_id]].copy()

# X_train: the scaled ratings by other customers who HAVE REVIEWED OUR TARGET PRODUCT
# let's drop out customers who havent reviewed this product from X_train and y_train
other_customer_x = ratings_centered_train.loc[other_customer_y.notnull()[sample_product_id]]
other_customer_y.dropna(inplace = True)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
# use cosine similarity as the metric
customer_knn = KNeighborsRegressor(metric = 'cosine', n_neighbors = 5)
customer_knn.fit(other_customer_x, other_customer_y)
customer_knn.predict(target_customer_x)

In [None]:
# we then can examine different number of neighbors
# use MRSE to measure the performance of KNN


#### Recap
Item-based recommendations: predict how a user would rate an item by looking at how that user rated other similar items 
Pros: item-based recommendations are more consistent over time (price, category, product descriptions are likely to remain the same)
Easier to explain: telling a user that we recommned sth similar to what they have enjoyed in the past makes more sense 
Easier to deploy, for example: every online store has a finite number of items and their similarity score could be pre-computed 
Cons: the recommendations could be pretty obvious. For example, recommend a cleanser from the same line and brand after the customer reviewed the moisturizer

User-based recommendations: predict how a user would rate an item they havent seen before by looking at similar users and their average ratings for the item of interest 
Pros: can create a lot more interesting suggesttion (for example, it could suggest some less popular items which wouldnt have come up from the item-based recommendation)
Cons: 
Less preferable in case when more conservative recommendations are prefered 
More difficult to explain to customers (that they were recommended some products because somebody that they dont know liked it)
Harder to maintain because a user preferences might change over time and we have to acommodate the growing customers profiles
New users wont benefit as much as old users


#### Problem with sparsity -  SVD
Only 1.2% of the rating datafram was filled, this is a big problem with KNN as it cant take into account the similarity. Look at the above example, where it predicted a rating of 5

We can leverage matrix factorization in this case, matrix factorization is to decompose the original matrix into dot product of 2 smaller matrices 

Every user has given at least 1 rating and every product has been reviewed at least once

For example: user_rating (m x n) = (m x rank)\*(rank x n) using SVD (Singular Vector Decomposition)
m: number of customers
n: number of products 
rank: latent features which could represent the customers' preference for some latent features


In [None]:
print("fraction of null values")
customer_ratings_pivot.isnull().values.sum()/customer_ratings_pivot.size

In [None]:
# checking the dimensions of the centered and orginial rating data
print(customer_ratings_pivot.shape)
print(customer_ratings_centered_pivot.shape)

In [None]:
customer_ratings_centered_pivot

In [None]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(customer_ratings_centered_pivot) # the default value for latent features is 6
# convert sigma from np.array to matrix
sigma = np.diag(sigma)

# check the dimensions
print(U.shape)
print(sigma.shape)
print(Vt.shape)

In [None]:
svds_ratings = np.dot(np.dot(U, sigma), Vt)
# now we need to transform these centered values back to the original rating scales
svds_ratings = svds_ratings + customer_ratings_pivot.mean(axis = 1).values.reshape(-1,1)
svds_ratings

### 4.3.3 Validating our predictions using RMSE (used in NetFlix competition)
Challenges: every customers reviewed different number of producs and each products received reviews from a different customers, so we cant split our holdout dataset the same way we usually do

Common metrics:
Precision/Recall/F-Score
ROC Curves
Cost Curves

In [None]:
#### Separating the holdout dataset by taking the first 400 reviews from 40 products 
actual_ratings = customer_ratings_pivot.iloc[:2000, :40].copy().values

# we then have to fill these with nans and center the values again
customer_ratings_pivot.iloc[:2000, :40] = np.nan

# prediced ratings
predicted_ratings = svds_ratings[:2000, :40]

In [None]:
from sklearn.metrics import mean_squared_error

# we only want to compare non-missing ratings
mask = ~np.isnan(actual_ratings) 
mean_squared_error(actual_ratings[mask], predicted_ratings[mask], squared = False)

In [None]:
customer_ratings_pivot.iloc[:2000, :40].values

In [None]:
np.unique(actual_ratings)

#### Hybrid System:

#### 4.3.3.2 ROC Curves

#### 4.3.3.3 Cost Curves