In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sparse
from sklearn.preprocessing import MaxAbsScaler, RobustScaler, StandardScaler
import random
import implicit
from scipy.sparse.linalg import spsolve
from pandas.api.types import CategoricalDtype


In [2]:
full_df = pd.read_csv('../dataset/cleaned/combined_cleansed.csv')

## Collaborative Filtering

In [3]:
collaborative_df = full_df.groupby(['user_id', 'product_name', 'product_id'])['product_id'].agg('count').to_frame('purchase_count').reset_index()

In [4]:
# get a list of unique users
users = list(np.sort(collaborative_df['user_id'].unique()))
# get a list of unique products
products = list(collaborative_df['product_id'].unique())
# get a list of purchase count
purchase_count = list(collaborative_df['purchase_count'])

# get the row indices
cols = collaborative_df['user_id'].astype('category', CategoricalDtype(categories = users)).cat.codes
# get the column indices
rows = collaborative_df['product_id'].astype('category', CategoricalDtype(categories = products)).cat.codes

collaborative_sparse = sparse.csr_matrix((purchase_count, (rows, cols)), shape = (len(products), len(users)))

# purchases_sparse = sparse.csr_matrix((quantity, (rows, cols)), shape=(len(customers), len(products)))

In [5]:
collaborative_sparse

<48422x206209 sparse matrix of type '<class 'numpy.int32'>'
	with 13266179 stored elements in Compressed Sparse Row format>

We have 206209 customers with 48422 items. lets check our sparcity

## Scaling

In [6]:
mas = MaxAbsScaler()
collaborative_sparse = mas.fit_transform(collaborative_sparse)

Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.

MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this

In [7]:
# get the number of possible interactions
matrix_size = collaborative_sparse.shape[0] * collaborative_sparse.shape[1]

# number of actual interactions
purchase_num = len(collaborative_sparse.nonzero()[0])
print('Sparcity is {}'.format(100*(1-(purchase_num/matrix_size))))

Sparcity is 99.86713961292403


for this to work we the maximum sparcity should be about 99.5% we are 0.3% above it, this may affect our result

## Alternating Least Squares

In [8]:
model = implicit.als.AlternatingLeastSquares(factors=100, regularization=0.01, iterations=30)
alpha_val = 80
data_conf = (collaborative_sparse * alpha_val).astype('double')
model.fit(data_conf)
user_items = data_conf.T.tocsr()



HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [9]:
def get_recommendations(df, model, fitted, user):
    recommendations = model.recommend(user, fitted, filter_already_liked_items = True)
    product_dict = dict(zip(df.product_id, df.product_name))
    
    print('Recommended items for user {} are: \n'.format(user))
    for i in recommendations:
        print(i[0], product_dict.get(i[0]))

In [10]:
get_recommendations(collaborative_df, model, user_items, 1)

Recommended items for user 1 are: 

13517 Whole Wheat Bread
25173 Glucosamine + Chondroitin
20493 All Natural Chocolate Hemp
45715 Veggie & Fruit Pops, Meteorite Mango
41186 Brussels Sprouts, Petite
27492 Homestyle Mayonnaise
26853 Complete Wheat 100% Whole Wheat Bread
24217 Golden Oat Breakfast Biscuits
32943 Soybeans in Pods Edamame
13035 Fun Size Candy Snack 6 Count


In [11]:
get_recommendations(collaborative_df, model, user_items, 8)

Recommended items for user 8 are: 

36230 Tofu Ravioli Verde
5421 Peach 0% Fat Oikos Greek Yogurt
43623 Shaved Blend Parmesian Cheese
21471 Salisbury Steak Dinner
36528 Organic Brown Rice Lasagne Pasta
44470 Backyard Variety Hard Strawberry Lemonade Hard Lemonade Hard Cranberry & Passion Fruit Lemonade Hard Tropical Mango Lemonade
44878 Ndimaini
45400 Multi-Surface Sunflower Scent Everyday Cleaner
426 2nd Foods Bananas
38288 Pizza Sauce


## Bayesian Personalized Ranking

In [12]:
model2 = implicit.bpr.BayesianPersonalizedRanking(factors=100, regularization=0.01, iterations=30)
alpha_val = 80
data_conf2 = (collaborative_sparse * alpha_val).astype('double')
model2.fit(data_conf2)
user_items2 = data_conf2.T.tocsr()

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [13]:
get_recommendations(collaborative_df, model2, user_items2, 1)

Recommended items for user 1 are: 

18441 Organic Ketchup
1685 Clean Care 1-Ply Double Rolls Toilet Paper
1427 Berries GoMega Smoothie Blend
17276 Honey Chipotle Salmon
12921 Cream Style Corn Golden Sweet
32533 Original BBQ Baby Back Pork Ribs
45888 Homestyle Thick & Hearty Traditional Pasta Sauce
35945 L.A. Natural Styling Gel
27745 Radiant Super Unscented Tampons
10571 Seasoning Adobo Seasoning


In [14]:
get_recommendations(collaborative_df, model2, user_items2, 8)

Recommended items for user 8 are: 

32004 Pasta Sauce, Marinara
15925 Macaroni and Cheese
921 Black Salt Caramel Dark Chocolate Bar
45799 Real Bacon Pieces 50% Less Fat
10663 Advanced Enzyme System Rapid Release Formula
17706 Organic Whole Grain Wheat English Muffins
41993 Cough+Chest Congestion Dm Non Drowsy Syrup
38567 Hearty French Country Vegetable Soup
40729 Apple Slice Scented Soy Candle
32833 Plain Nonfat Yogurt


## Logistic Matrix Factorization

In [15]:
model3 = implicit.lmf.LogisticMatrixFactorization(factors=100, learning_rate = 0.1, regularization=0.01, iterations=30)
alpha_val = 80
data_conf3 = (collaborative_sparse * alpha_val).astype('double')
model3.fit(data_conf3)
user_items3 = data_conf3.T.tocsr()

HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




In [16]:
get_recommendations(collaborative_df, model3, user_items3, 1)

Recommended items for user 1 are: 

42991 Kat Kit Disposable Tray
16894 Full Bodied Flavor Extra Virgin Organic Olive Oil
34709 Concentrated Deep Reach Fogger Insecticide
44644 Lotion with Aloe & Lanolin Hair Remover
46717 40 Flavors Jelly Beans
17412 Mini Lemon Bites
10061 Meatloaf with Portobello Mushroom Gravy
8747 Bathroom Cleaner with Lemon Scent
43121 Gourmet Natural Original Magic Seasoning
18837 24 hour Allergy Relief Gelcaps for Adults


In [17]:
get_recommendations(collaborative_df, model3, user_items3, 8)

Recommended items for user 8 are: 

39732 Pitted Kalamata Olives
39042 None
8293 Lindor Milk Chocolate Truffles Pieces
46337 Complete Iron System Mini Tabs
16139 Clara
35319 Paloma De Colores Mixed Popcorn
25654 Mint Dark Chocolate Candy
44015 Classic Citrus Kombucha
2321 Orange Halo
42569 Cinnamon Bun Sandwich Cookies


## Raw

# test our recommender

## Testing our recommender system

This function will take in the original user-item matrix and "mask" a percentage of the original ratings where a
user-item interaction has taken place for use as a test set. The test set will contain all of the original ratings, 
while the training set replaces the specified percentage of them with a zero in the original ratings matrix. 

parameters: 

ratings - the original ratings matrix from which you want to generate a train/test set. Test is just a complete
copy of the original set. This is in the form of a sparse csr_matrix. 

pct_test - The percentage of user-item interactions where an interaction took place that you want to mask in the 
training set for later comparison to the test set, which contains all of the original ratings. 

returns:

training_set - The altered version of the original data with a certain percentage of the user-item pairs 
that originally had interaction set back to zero.

test_set - A copy of the original ratings matrix, unaltered, so it can be used to see how the rank order 
compares with the actual interactions.

user_inds - From the randomly selected user-item indices, which user rows were altered in the training data.
This will be necessary later when evaluating the performance via AUC.

This simple function will output the area under the curve using sklearn's metrics. 

parameters:

- predictions: your prediction output

- test: the actual target result you are comparing to

returns:

- AUC (area under the Receiver Operating Characterisic curve)

'''
This function will calculate the mean AUC by user for any user that had their user-item matrix altered. 

parameters:

training_set - The training set resulting from make_train, where a certain percentage of the original
user/item interactions are reset to zero to hide them from the model 

predictions - The matrix of your predicted ratings for each user/item pair as output from the implicit MF.
These should be stored in a list, with user vectors as item zero and item vectors as item one. 

altered_users - The indices of the users where at least one user/item pair was altered from make_train function

test_set - The test set constucted earlier from make_train function



returns:

The mean AUC (area under the Receiver Operator Characteristic curve) of the test set only on user-item interactions
there were originally zero to test ranking ability in addition to the most popular items as a benchmark.
'''


## Backup

Implicit weighted ALS taken from Hu, Koren, and Volinsky 2008. Designed for alternating least squares and implicit
feedback based collaborative filtering. 

parameters:

training_set - Our matrix of ratings with shape m x n, where m is the number of users and n is the number of items.
Should be a sparse csr matrix to save space. 

lambda_val - Used for regularization during alternating least squares. Increasing this value may increase bias
but decrease variance. Default is 0.1. 

alpha - The parameter associated with the confidence matrix discussed in the paper, where Cui = 1 + alpha*Rui. 
The paper found a default of 40 most effective. Decreasing this will decrease the variability in confidence between
various ratings.

iterations - The number of times to alternate between both user feature vector and item feature vector in
alternating least squares. More iterations will allow better convergence at the cost of increased computation. 
The authors found 10 iterations was sufficient, but more may be required to converge. 

rank_size - The number of latent features in the user/item feature vectors. The paper recommends varying this 
between 20-200. Increasing the number of features may overfit but could reduce bias. 

seed - Set the seed for reproducible results

returns:

The feature vectors for users and items. The dot product of these feature vectors should give you the expected 
"rating" at each point in your original matrix. 