## Comparing Recomendation Systems 

**Developer: Mayana Mohsin Khan**

- **Packages used:**
- pandas
- numpy
- implicit
- surprise
- torch
- sklearn
- tqdm
- scipy

### In this notebook, We compare the following approaches for building a user-item relation Recomender System using following approaches
- surprise
- neural network
- implicit

#### Before we begin, lets start with some theory on collaborative filtering.

#### What is Collaborative Filtering? 
In collaborative filtering, the recommender matches the users with similar interests and predicts a recommendation based on this matching approach.
For Example; lets consider two users, user 1 and user 2. user 1 has rated item 300 as 1 and item 24 as 0 and item 98 as 1. Now lets assume that user 2 has rated item 300 as 1. Based on the calculating the similarity between the two users, the recommender does not in this case recommend item 24 to user 2 since user 1 did not interact with it. This is an example of user-item recommender system based on user similarity. We can also use collaborative filter on item-user interactions in recommendation system using item-item similarity.

**User based filtering**: In user-based filtering as we saw as an example earlier, the two users interact with the item 300 and so they have a similarity of 1. While a third user who rated item 24 shows no similarity to user 1.

|user_id / item_id| 300 | 24 | 98 | similarity |
| --- | --- | --- | --- |  --- |
| 1 | 1 |    | 1 |   |
| 2 | 1 |    | 1 | 1 | 
| 3 |    |  1  |    | NA  

**Item based filtering**: In item-based filtering, the recommender checks for similar use have interacted with the user and recommends the articles to those user

|user_id / item_id| 300 | 24 | 98 | 
| --- | --- | --- | --- |
| 1 | 1 |    | 1 | 
| 2 | 1 |    |  |  
| 3 |    |  1  |    |
| <b>similarity<b> | 1 | NA  |   |


The similarity is calculated using `pearson` or `cosine` similarity based on the programming choice.


### Loading the packages

In [1]:
import numpy as np # Manuplating the data
import pandas as pd # Manuplating the data
import implicit # usine als model
from sklearn.preprocessing import MinMaxScaler # scalar to normalize the value
from sklearn import metrics # to obtain accuracy
import scipy.sparse as sparse # create sparse matrix
from tqdm import tqdm # to prettify the wait time and reduce anxiety :)

In [2]:
# using suprise model for SVD, SVD++ and NMF models
from surprise import Reader, Dataset # get the reader and dataset builder
from surprise import SVD, NMF, SVDpp
from surprise.model_selection import cross_validate, train_test_split # Preform cross validation and train and test split 
from surprise.model_selection import GridSearchCV 

### Loading the Dataset

In [3]:
train_df = pd.read_csv("train_data.csv") # Loading training data
test_df = pd.read_csv("test_data.csv") # Loading testing Data
valid_df = pd.read_csv("validation_data.csv") # Loading Validation Data
user_fea_df = pd.read_csv("user_fea.csv") # Loading user features data
item_fea_df = pd.read_csv("item_fea.csv") # Loading item Features

# Recomender system with Surprise

### Building th Recomender systems using Suprise algorithm 

Comparing the RMSE, MAE values of different models to select the best model to preform the recomendation on.

## Building Classifiers

Using Training data to build classifier models.

**Models used:**
- SVD
- SVD++
- NMF

In [6]:
# Creating a suprise reader object
reader = Reader(rating_scale=(0,1)) 

In [7]:
# Loading training data into surprise dataset with surprise reader
data = Dataset.load_from_df(train_df[['user_id', 'item_id', 'rating']], reader)

In [8]:
# List to store models
models = {'SVD':SVD(), 
          'SVD++':SVDpp(),
          'NMF':NMF()}

# Create and Evaluate the models
for model_name, model in models.items():
    print(model_name) # print model name
    algo = model # create the model
    cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True) # preform 10-fold cv to get measueres
    print('-'*80)

SVD
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0676  0.0678  0.0691  0.0672  0.0675  0.0678  0.0007  
MAE (testset)     0.0380  0.0386  0.0392  0.0381  0.0377  0.0383  0.0005  
Fit time          3.88    5.10    4.63    4.18    3.37    4.23    0.60    
Test time         0.15    0.11    0.17    0.19    0.13    0.15    0.03    
--------------------------------------------------------------------------------
SVD++
Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0339  0.0332  0.0339  0.0336  0.0343  0.0338  0.0004  
MAE (testset)     0.0190  0.0186  0.0188  0.0187  0.0191  0.0188  0.0002  
Fit time          28.00   16.99   17.63   21.38   30.50   22.90   5.45    
Test time         0.61    0.40    0.35    0.39    0.74    0.50    0.15    
-------------------------------------------------

**Comparing RMSE and MAE:**
- SVD++ has the Least RMSE.
- SVD++ has the least MAE.

Choosing the model that has the least RMSE to predict

### Build the model

The  Sbuilding a recomendation system with SVD++ Algorithm takes in default parameters, it is needed to obtain the most optimal parameters for the classification task.

Hyperparameter tuning the SVD++ model using GridSearchCV to obtain the best parameters.


In [8]:
algo = SVDpp() # Create the SVD++ model

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x2c55a93c550>

In [None]:
training = data.build_full_trainset() # Train the model by building full training dataset
algo.fit(training) # Fitting the model

### Predict using test dataset

In [9]:
# Prediction on testing data
test_pred_df = test_df.apply(lambda x: algo.predict(uid=x.user_id,iid = x.item_id,r_ui=1), axis=1)

0           (0, 2158, 1, 1, {'was_impossible': False})
1           (0, 2113, 1, 1, {'was_impossible': False})
2    (0, 2070, 1, 0.9840006061147656, {'was_impossi...
3    (0, 2026, 1, 0.923516912994206, {'was_impossib...
4           (0, 1948, 1, 1, {'was_impossible': False})
dtype: object

In [45]:
# Creating a dataframe to return predictied rating in testing dataset
predicted_df = pd.DataFrame({'user_id':test_pred_df.apply(lambda tup: tup[0]), # Extraing users
                             'item_id':test_pred_df.apply(lambda tup: tup[1]), # Extraing items
                             'rating':test_pred_df.apply(lambda tup: tup[3])}) # Extraing rating
predicted_df.head()

Unnamed: 0,user_id,item_id,rating
0,0,2158,1.0
1,0,2113,1.0
2,0,2070,0.984001
3,0,2026,0.923517
4,0,1948,1.0


- Extracting the top 10 items for every users using groupby user_id, 
- Sorting values in descending order.
- Extracting the top 10 items for every users based on the predicited rating

In [46]:
# Groupby 
grouped_df = predicted_df.groupby(['user_id']).apply(lambda x: x.sort_values('rating', ascending=False).nlargest(10,'rating'))
grouped_df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,item_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,0,2158,1.0
0,18,0,1740,1.0
0,76,0,558,1.0
0,75,0,569,1.0
0,67,0,893,1.0
0,63,0,950,1.0
0,59,0,1025,1.0
0,31,0,1455,1.0
0,55,0,1072,1.0
0,35,0,1407,1.0


In [None]:
grouped_df[['user_id','item_id']].to_csv('test_SVDpp.csv', index=False)

On submittion to kaggle the following model gives a very low score on NDGC evaluation

# Recomender System with Neural Networks

- Using Pytorch to build the recomender system

In [None]:
# Using Pytorch
import torch # import torch
import torch.nn as nn # import neural network from pytorch
import torch.nn.functional as F # import the F score

### Axulliary functions

In [1]:
# Define a function to preform modelling training
def train_epocs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    # Using adams Optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    # create the model 
    model.train()
    # for epochs
    for i in range(epochs):
        users = torch.LongTensor(train_df.user_id.values) # .cuda() # convert users to tensors
        items = torch.LongTensor(train_df.item_id.values) #.cuda() # convert items to tensors
        ratings = torch.FloatTensor(train_df.rating.values) #.cuda() # Convert rating to tensors
        # if the rains are unsequenced
        if unsqueeze:
            ratings = ratings.unsqueeze(1)
        y_hat = model(users, items) # predict the train rating
        loss = F.mse_loss(y_hat, ratings) # calculate the mse loss
        optimizer.zero_grad() # apply the optimizer
        loss.backward() # Preform backward propogation
        optimizer.step() # preform setwise optimizer
        print(loss.item()) # print the losss
    test_loss(model, unsqueeze) # call the test_loss function

In [2]:
# Function to preform testing on validation set
def test_loss(model, unsqueeze=False):
    model.eval()
    users = torch.LongTensor(valid_df.user_id.values) #.cuda() # convert users to tensors
    items = torch.LongTensor(valid_df.item_id.values) #.cuda() # convert items to tensors
    ratings = torch.FloatTensor(valid_df.rating.values) #.cuda() # convert rating to tensors
    # if the rains are unsequenced
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    y_hat = model(users, items) # predict the train rating
    loss = F.mse_loss(y_hat, ratings) # calculate the mse loss
    print("test loss %.3f " % loss.item()) # print the test loss

### Create the neural network with matrix factorization and bias.

**Neural Network Structure:**
- Embending Layer with input dimension = num_users and 100 embending size for users
- Embending Layer with bias for user
- Embending layer with input dimension = num_users and 100 embending size for items
- Embending layer with bias for items

**Forward propogation:**
Use the user and item embendings to create the network forward Propgation:
- U = user embending
- V = item embending
                                            Sum of (U*V) +  b_u  + b_v

In [None]:
# Create Matrix factorization with bias
class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size) # user embending layers
        self.user_bias = nn.Embedding(num_users, 1) # user embending layers with bias
        self.item_emb = nn.Embedding(num_items, emb_size) # item embending layers
        self.item_bias = nn.Embedding(num_items, 1) # item embending layers with bias
        self.user_emb.weight.data.uniform_(0,0.05)
        self.item_emb.weight.data.uniform_(0,0.05)
        self.user_bias.weight.data.uniform_(-0.01,0.01)
        self.item_bias.weight.data.uniform_(-0.01,0.01)
    
    # Calculate the forward propogation
    def forward(self, u, v):
        U = self.user_emb(u) 
        V = self.item_emb(v)
        b_u = self.user_bias(u).squeeze()
        b_v = self.item_bias(v).squeeze()
        return (U*V).sum(1) +  b_u  + b_v

In [None]:
# Calculate number of users and items
num_users = len(train_df.user_id.unique()) # number of users 
num_items = len(train_df.item_id.unique()) # number of items
print(num_users, num_items) 

### Create the model
Set the embending dimensions size to 100

In [None]:
model = MF_bias(num_users, num_items, emb_size=100)

In [None]:
# train_epocs(model, epochs=10, lr=0.1)
train_epocs(model, epochs=25, lr=0.001, wd=1e-5)

### Prediction 

In [None]:
# Function to predict on testing on testing set
def predictions(test_df):
    users = torch.LongTensor(test_df.user_id.values) # convert users to tensors
    items = torch.LongTensor(test_df.item_id.values)  # convert items to tensors
    rating = model(users, items) # Obatins rating by calling in the model
    users = users.tolist()
    items = items.tolist()
    rating = rating.tolist()
    # Create the dataframe
    df = pd.DataFrame({'user_id':users, # users
                       'item_id':items, # items
                       'rating':rating # ratings
                      })
    return df # return dataframe

In [None]:
# Preform prediction
predicted_df = predictions(test_df) 
predicted_df.head(20) # return 20 predicitons

- Extracting the top 10 items for every users using groupby user_id, 
- Sorting values in descending order.
- Extracting the top 10 items for every users based on the predicited rating

In [None]:
# Grouby
grouped_df = predicted_df.groupby(['user_id']).apply(lambda x: x.sort_values('rating', ascending=False).nlargest(10,'rating'))
grouped_df.head(20)

In [None]:
# store the top 10 items for each users
grouped_df[['user_id','item_id']].to_csv('test_MF_Bias.csv', index=False)

On submittion to kaggle the following model gives a very low score on NDGC evaluation

# Recomender System with ALS 

- Using ALS model from Implicit to build the recomender system

Steps:
- Create user_item and item_user sparse matrix.
- set alpha value.
- create the model with best parameters.
- recomend top 10 items for all users.

In [4]:
# Import Implicit library
import implicit
from sklearn.preprocessing import MinMaxScaler # 
from sklearn import metrics
import scipy.sparse as sparse
from tqdm import tqdm

Convert the users and items for creating a sparse matrix

In [5]:
train_df.user_id = train_df.user_id.astype("category")
train_df.item_id = train_df.item_id.astype("category")
train_df.rating = train_df.rating.astype("category")
train_df['user'] = train_df.user_id.cat.codes
train_df['item'] = train_df.item_id.cat.codes
train_df.head()

Unnamed: 0,user_id,item_id,rating,user,item
0,0,0,1,0,0
1,0,1,1,0,1
2,0,2,1,0,2
3,0,3,1,0,3
4,0,4,1,0,4


### Creating Sparse Matrix

Create the sparse matrix using `sparse.csr_matrix` from `scipy`
- user_item
- item_user 

In [6]:
#convert to sparse matrix
sparse_item_user = sparse.csr_matrix((train_df.rating.astype(float), (train_df.item, train_df.user))) # item_user matrixx
sparse_user_item = sparse.csr_matrix((train_df.rating.astype(float), (train_df.user, train_df.item))) # user_item matrix

- Set the alpha = 15 for our recomender system
- create sparse data using item_user sparse matrix * alpha

In [33]:
alpha = 15 #The rate in which we'll increase our confidence in a preference with more interactions.
data = (sparse_item_user * alpha).astype('double') # 

### Model Building
- ALS model using implicit package

Set the following parameters:
- alpha = 15
- facctors = 8
- regularization = 0.1
- iterations = 30 

In [40]:
# build the model using als algorithm from implicit package
model = implicit.als.AlternatingLeastSquares(factors=8, regularization=0.1, iterations=30)
model.fit(data)

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




In [41]:
user_vecs = model.user_factors
item_vecs = model.item_factors

### Get user Recomendations

In [43]:
# Function to get the recomendatins
def recommend(person_id, sparse_person_content, person_vecs, content_vecs, num_contents):
    # Get the interactions scores from the sparse person content matrix
    person_interactions = sparse_person_content[person_id,:].toarray()
    # Add 1 to everything, so that articles with no interaction yet become equal to 1
    person_interactions = person_interactions.reshape(-1) + 1
    # Make articles already interacted zero
    person_interactions[person_interactions > 1] = 0
    # Get dot product of person vector and all content vectors
    rec_vector = person_vecs[person_id,:].dot(content_vecs.T)
    # Scale the recomender 
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    # Content already interacted have their recommendation multiplied by zero
    recommend_vector = person_interactions * rec_vector_scaled
    # Sort the indices of the content into order of best recommendations
    content_idx = np.argsort(recommend_vector)[::-1][:num_contents]
    # return recomended items
    return content_idx

# create unique users list
user_list = train_df.user_id.unique().tolist()
# list to store recomendations
recomendations_list = []
# por each user in user_list get recomendations
for user in tqdm(user_list):
    recomendations_list.append(recommend(user, sparse_user_item, user_vecs, item_vecs,num_contents = (len(item_vecs)-1)))

100%|████████████████████████████████████████████████████████████████████████████| 2239/2239 [00:02<00:00, 1118.10it/s]


### Create the user-item recomended dataframe

In [44]:
item_user_recomenadations=[] # list to store items based on users
for user in tqdm(test_df.user_id.unique().tolist()): # for earch users in testing samples
    items_list = [] # store recomended items
    for item in recomendations_list[user].tolist(): # for each item in recomendations we have
        if item in test_df[test_df.user_id == user].item_id.values: # get the items values for each uuser
            items_list.append((user,item)) # storeing the items
            if len(items_list) > 9: # if len of items list is more then 10 then break
                break
    item_user_recomenadations.extend(result) # return users and items

100%|██████████████████████████████████████████████████████████████████████████████| 2239/2239 [09:44<00:00,  3.83it/s]


In [45]:
# convert user - item interactions list to dataframe
test_res = pd.DataFrame(result_item_usr,columns=['user_id','item_id'])

In [46]:
# store to csv
test_res.to_csv('30487420.csv',index=False)

On submitting this csv to kaggle, this model is able to achieve the NDGC score of 0.20211.

##### Conclusion
1. By Comparing the various recommender systems build above. The notebooks does a fine job in comparing the various approaches in building a successful recommender systems.
2. The NDGC score obtained in Kaggle for ALS algorithm is the highest score achieved by the code provided in this notebook.
