### Week 8: Colaborative Filtering
```
- Advanced Machine Learning, Innopolis University 
- Professor: Muhammad Fahim 
- Teaching Assistant: Gcinizwe Dlamini
```
<hr>


```
Lab Plan
    1. Content based recommendation Systems 
    2. Matrix Factorisation
    3. Surprise 
    4. Deep Learning based recommendation systems
    5. Lab Task
```

<hr>

## 1. Background

**Recommender Systems** are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on the product).

![alt text](https://miro.medium.com/max/1920/1*Y_QG3Kvfk0fSnCirLBHZ7w.jpeg)


### Recommendation paradigms

The distinction between approaches is more academic than practical, but it’s important to understand their differences.
Broadly speaking, recommender systems are of 4 types:

1. **Collaborative filtering** is perhaps the most well-known approach to recommendation, to the point that it’s sometimes seen as synonymous with the field. The main idea is that you’re given a matrix of preferences by users for items, and these are used to predict missing preferences and recommend items with high predictions. All you need to get started is user and item IDs and a notion of preference by users for items (ratings, views, etc.).

2. **Content-based filtering** algorithms are given user preferences for items and recommend similar items based on a domain-specific notion of item content. This approach also extends naturally to cases where item metadata is available (e.g., movie stars, book authors, and music genres).
3. **Social and demographic** recommenders suggest items that are liked by friends, friends of friends, and demographically-similar people. Such recommenders don’t need any preferences by the user to whom recommendations are made, making them very powerful.
4. **Contextual recommendation** algorithms recommend items that match the user’s current context. This allows them to be more flexible and adaptive to current user needs than methods that ignore context (essentially giving the same weight to all of the user’s history). Hence, contextual algorithms are more likely to elicit a response than approaches that are based only on historical data.

## Collaborative Filtering

Collaborative filtering (CF) systems work by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities in rating behavior among several users in determining how to recommend an item.
CF accumulates customer product ratings, identifies customers with common ratings, and offers recommendations based on inter-customer comparisons. It’s based on the idea that people who agree in their evaluations of certain items in the past are likely to agree again in the future. For example, most people ask their trusted friends for restaurant or movie suggestions.

![alt text](https://miro.medium.com/max/687/1*-Jr1l2rlj9SBcCzlDHtN5g.jpeg)

Collaborative filtering models are based on an assumption that people like things similar to other things they like, and things that are liked by other people with similar taste.

![alt text](https://miro.medium.com/max/1348/1*K5BOY3B93MLn173VVzOW0Q.png)

## 2. Content based recommendation Systems

* What is content based recommendation Systems? 
* How are Content based recommendation Systems different from other systems you know? 


### 2.1 Dataset

What does the dataset look like? <br>
The componets of the dataset:

1. Item-ID
2. Item-description

In [14]:
!wget https://raw.githubusercontent.com/quarriedstone/AML-DS-2021/main/data/Recommender/sample-data.csv

--2021-04-22 09:01:41--  https://raw.githubusercontent.com/quarriedstone/AML-DS-2021/main/data/Recommender/sample-data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 566516 (553K) [text/plain]
Saving to: ‘sample-data.csv’


2021-04-22 09:01:41 (6.57 MB/s) - ‘sample-data.csv’ saved [566516/566516]



In [8]:
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

#Load dataset and take a look
ds = pd.read_csv("sample-data.csv")
ds.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


## 2.2 Recommendation task

Recommend k items to a user given that he is intreted in "Coton Shorts" item 30.

Solution:

1. Create TF-IDF of every item.
2. Measure cosine distance.
3. Propose the K-closest

In [16]:
#Step 1: Create TF-IDF of every item
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])

# Step 2: Measure cosine distance
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

results = {}

for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]

    results[row['id']] = similar_items[1:]
    
print('done!')

def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

# Just reads the results out of the dictionary.
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")
    print("-------")
    recs = results[item_id][:num]
    for rec in recs:
        print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

recommend(item_id=30, num=10)

done!
Recommending 10 products similar to Cotton board shorts...
-------
Recommended: Wavefarer board shorts-21 in. (score:0.14044018978564404)
Recommended: Minimalist board shorts-19 in. (score:0.13793658801766598)
Recommended: Twenty-three's board shorts (score:0.1336824469134404)
Recommended: Paddler board shorts (score:0.13331764205970856)
Recommended: Duck shorts (score:0.1158622466782414)
Recommended: Custodian pants (score:0.10690842537522768)
Recommended: All-wear shorts (score:0.10526227846780732)
Recommended: Light and variable surf trunks (score:0.10502889619007406)
Recommended: Custodian pants (score:0.10459695983928953)
Recommended: Custodian pants (score:0.10447441847496913)


Discussion:

1. Make changes.
2. Evaluate the model.

## 3. Model Based Recommendation Systems


![alt text](https://datascienceplus.com/wp-content/uploads/2017/09/2017-09-20-2.png)

## 4.1 Dataset
We will use 
[**MovieLens 20M Dataset**
](https://grouplens.org/datasets/movielens/20m/) <br>
An open-source dataset available in grouplens.org, The data set has 25000095 ratings and 1093360 tag applications across 62423 movies. Created by 162541 users between 1995 and 2019.

Download it and upload the zip file

In [17]:
!wget https://files.grouplens.org/datasets/movielens/ml-20m.zip --no-check-certificate
!unzip 'ml-20m.zip'

--2021-04-22 09:01:55--  https://files.grouplens.org/datasets/movielens/ml-20m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 198702078 (189M) [application/zip]
Saving to: ‘ml-20m.zip’


2021-04-22 09:01:58 (61.7 MB/s) - ‘ml-20m.zip’ saved [198702078/198702078]

Archive:  ml-20m.zip
   creating: ml-20m/
  inflating: ml-20m/genome-scores.csv  
  inflating: ml-20m/genome-tags.csv  
  inflating: ml-20m/links.csv        
  inflating: ml-20m/movies.csv       
  inflating: ml-20m/ratings.csv      
  inflating: ml-20m/README.txt       
  inflating: ml-20m/tags.csv         


## 2.2 Load Dataset 

In [18]:
data_path = './ml-20m/'
movies_filename = 'movies.csv'
ratings_filename = 'ratings.csv'

df_movies = pd.read_csv(
    os.path.join(data_path, movies_filename),
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

df_ratings = pd.read_csv(
    os.path.join(data_path, ratings_filename),
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})


In [19]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [20]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


## 2.3 Create user-movie dataset

In [21]:
df_ratings=df_ratings[:2000000]
df_movie_features = df_ratings.pivot(
    index='userId',
    columns='movieId',
    values='rating'
).fillna(0)

df_movie_features.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,128671,128686,128715,128736,128832,128842,128898,128900,128902,128968,128991,129030,129034,129036,129068,129233,129235,129303,129350,129354,129428,129530,129659,129707,129786,129788,129822,129857,130052,130069,130073,130075,130219,130462,130490,130496,130512,130642,130644,130768
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2.4 Singular value decomposition (SVD)

* Remember dimensionality reduction
* What are other algorithmns for dimensionality reduction? 

SVD from scratch ??? **NO**<br>

We will use sklearn implementation for now
<br>


In [22]:
from scipy.sparse.linalg import svds

R = df_movie_features.values
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

U, sigma, Vt = svds(R_demeaned, k = 50)

sigma = np.diag(sigma)
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

preds_df = pd.DataFrame(all_user_predicted_ratings, columns = df_movie_features.columns)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,128671,128686,128715,128736,128832,128842,128898,128900,128902,128968,128991,129030,129034,129036,129068,129233,129235,129303,129350,129354,129428,129530,129659,129707,129786,129788,129822,129857,130052,130069,130073,130075,130219,130462,130490,130496,130512,130642,130644,130768
0,-1.032376,0.816788,0.065366,-0.069286,-0.08983,0.488379,-0.417959,0.045865,-0.13905,-0.341772,-0.513422,0.178584,0.069932,-0.034303,-0.04544,0.260364,0.069652,0.285616,0.471105,-0.199864,-0.03473,0.439173,-0.025721,0.213959,0.094146,0.060702,0.057206,-0.044555,1.233433,-0.004471,0.095022,2.282895,-0.011613,0.45606,0.021102,0.520967,0.004015,-0.014233,0.176869,0.043323,...,-0.00771,-0.007637,-0.001437,-0.001974,-0.001901,-0.002341,-0.00771,-0.010073,-0.009284,-0.003201,-0.010073,-0.002748,-0.001605,-0.001605,-0.009607,-0.000441,-0.004316,-0.003167,0.000509,-0.005256,-0.00729,-0.001605,-0.014006,-0.002858,-0.001877,-0.001741,-0.001728,-0.004781,-0.004179,-0.011247,0.000436,-0.000913,-0.005743,-0.006718,-0.004173,-0.002836,-0.009607,-0.006603,-0.001093,-0.002193
1,0.878418,0.018588,0.350089,0.069685,0.219142,0.379299,0.404332,0.008962,0.102663,-0.336462,0.312137,0.088314,0.029474,0.091948,0.045356,0.03074,0.429898,0.061929,-0.042158,0.062091,-0.037035,0.128742,0.110055,0.276819,0.333761,0.03421,0.057615,-0.010574,0.326948,-0.026605,0.111463,1.481748,0.002342,0.073128,-0.019877,0.147414,-0.003927,0.020103,-0.161433,-0.000244,...,-0.003447,0.000291,-0.001468,-0.000321,0.000662,-0.000419,-0.003447,0.001479,0.001308,0.001028,0.001479,0.000548,-0.000801,-0.000801,-0.004802,0.001637,0.000272,0.00133,-0.000974,-0.003719,-0.003486,-0.000801,0.001901,0.000244,-0.000653,-0.000727,0.000495,-0.001496,0.000349,0.001411,0.001779,-0.000374,-0.001408,0.00251,0.002424,-0.000837,-0.004802,0.001166,-0.000476,0.000287
2,2.00451,0.853212,-0.124892,0.033102,-0.197887,0.725845,-0.070364,-0.043636,-0.004803,0.117873,0.67562,-0.01199,0.023256,0.035115,-0.085638,1.054159,-0.39464,-0.095323,-0.392497,-0.00602,0.61225,0.300114,-0.104179,0.892096,0.300976,-0.106821,0.05214,-0.153646,0.799834,-0.006502,-0.09988,2.957937,0.010231,0.690257,-0.046066,-0.073031,0.025883,0.007296,0.748279,-0.02753,...,0.014029,0.002355,-0.005867,9e-06,-0.003435,-0.000388,0.014029,-0.001359,-0.001174,0.000396,-0.001359,0.000377,0.000973,0.000973,0.019518,0.010235,0.000436,0.002968,-0.000737,0.005804,-0.004255,0.000973,0.003951,0.000342,0.00084,0.000906,0.000638,-0.00346,-0.000256,0.00304,0.003959,-0.004786,-0.001086,0.001811,0.005158,0.001003,0.019518,-0.001382,-0.001391,0.000514
3,-0.730042,0.575875,0.316375,-0.04377,0.175544,0.804843,0.024102,0.053929,0.182834,1.027779,0.215089,0.053206,0.031952,0.014743,0.069392,0.533745,-0.540017,0.039601,0.485803,0.259154,0.650261,0.308804,0.244486,0.137633,-0.052208,-0.045443,-0.014521,-0.067064,0.017106,-0.024689,0.171829,0.44483,-0.000832,0.061157,-0.058791,-0.131904,0.003973,0.000392,0.206131,-0.037648,...,0.001746,0.001691,0.00254,0.001675,0.00282,0.001738,0.001746,0.004465,0.004178,0.002712,0.004465,0.001991,0.00199,0.00199,0.001691,0.00127,0.001783,0.00254,0.001151,0.001958,0.004704,0.00199,0.001713,0.001938,0.001969,0.001979,0.001728,0.002784,0.001792,0.001756,0.00122,0.001608,0.002619,0.001078,0.003182,0.001051,0.001691,0.001607,0.002382,0.001786
4,1.487831,1.324306,1.370973,0.112845,1.282828,0.726827,1.453526,0.217864,0.285134,1.278285,1.579988,0.208238,0.088085,0.47525,0.035512,-0.176552,1.851011,0.099256,0.402609,0.031267,0.674253,0.260641,0.019003,0.071228,0.663682,0.228992,0.160756,0.229392,0.06737,-0.015628,0.448435,0.985231,-0.006345,2.396324,0.040164,1.618643,0.005433,0.037594,0.31042,0.106339,...,-0.002487,-0.002947,0.004797,0.000543,0.002156,0.000243,-0.002487,0.003953,0.003598,0.002265,0.003953,0.001517,-0.001509,-0.001509,-0.003786,0.000288,0.002172,-0.0011,4e-06,0.001105,-0.005823,-0.001509,0.000266,0.001138,-0.001055,-0.001282,0.002379,0.000471,0.001003,0.000389,-0.000932,-0.001432,-4.9e-05,0.002574,-0.002193,-0.00184,-0.003786,0.001492,0.001601,0.001771


In [23]:
def recommend_movies(preds_df, userID, movies_df, original_ratings_df, num_recommendations=5):
  user_row_number = userID - 1 
  sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
  user_data = original_ratings_df[original_ratings_df.userId == (userID)]
  user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                    sort_values(['rating'], ascending=False)
                )
  recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])]).merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', left_on = 'movieId',
              right_on = 'movieId').rename(columns = {user_row_number: 'Predictions'}).sort_values('Predictions', ascending = False).iloc[:num_recommendations, :-1]
                    

  return user_full, recommendations


already_rated, predictions = recommend_movies(preds_df, 330, df_movies, df_ratings, 10)
already_rated.head(10)

Unnamed: 0,userId,movieId,rating,title
11,330,17,5.0,Sense and Sensibility (1995)
28,330,47,5.0,Seven (a.k.a. Se7en) (1995)
30,330,50,5.0,"Usual Suspects, The (1995)"
200,330,588,5.0,Aladdin (1992)
154,330,349,5.0,Clear and Present Danger (1994)
190,330,509,5.0,"Piano, The (1993)"
44,330,110,5.0,Braveheart (1995)
131,330,296,5.0,Pulp Fiction (1994)
0,330,1,4.0,Toy Story (1995)
183,330,457,4.0,"Fugitive, The (1993)"


In [24]:
predictions

Unnamed: 0,movieId,title
159,293,Léon: The Professional (a.k.a. The Professiona...
178,329,Star Trek: Generations (1994)
397,608,Fargo (1996)
187,342,Muriel's Wedding (1994)
304,497,Much Ado About Nothing (1993)
99,163,Desperado (1995)
200,368,Maverick (1994)
280,468,Englishman Who Went Up a Hill But Came Down a ...
130,223,Clerks (1994)
1,5,Father of the Bride Part II (1995)


## 3. [Surprise](https://github.com/NicolasHug/Surprise)

Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data

In [5]:
!pip3 install scikit-surprise



### 3.1 Load data and fit SVD model

In [26]:
from surprise import Reader, SVD, Dataset
from collections import defaultdict

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_ratings[["userId", "movieId", "rating"]], reader)

In [27]:
# Create a train set  and fit the model (using ALS or SGD)
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdd43a99e90>

### 3.2 Recommend Movies

In [28]:
def get_top_n(predictions, n=10):
  """Return the top-N recommendation for each user from a set of predictions.
  Args:
      predictions(list of Prediction objects): The list of predictions, as
          returned by the test method of an algorithm.
      n(int): The number of recommendation to output for each user. Default
          is 10.
  Returns:
  A dict where keys are user (raw) ids and values are lists of tuples:
      [(raw item id, rating estimation), ...] of size n.
  """

  # First map the predictions to each user.
  top_n = defaultdict(list)
  for uid, iid, true_r, est, _ in predictions:
      top_n[uid].append((iid, est))

  # Then sort the predictions for each user and retrieve the k highest ones.
  for uid, user_ratings in top_n.items():
      user_ratings.sort(key=lambda x: x[1], reverse=True)
      top_n[uid] = user_ratings[:n]

  return top_n

In [None]:
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

In [None]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

## 4. Deep learning Based approach 

**We will use PyTorch !!**

The deep learnig approach is not so different from SVD what we have just seen earlier. 

**Back to Embeddings!!!**

The Neural network is made up of Two Embedding layers and some hidden layers. 

1. Example Achitecture :
        1.1 Two `Embedding`s for users and movies.
        1.2 One `Dropout` for the output of the embeddings.
        1.3 The hidden layers
        1.4 Output layer

2. Example forward pass: 
        2.1 Get the 2 embeddings tensors, then concatenate both.
        2.2 Run it through the hidden layers then the last fc layer.
        2.3 Apply sigmoid activation.
        2.4 Adjust the range of the estimated rating matrix to be [1, 5].

3. Loss function is MSE
4. Optimizer ??? 



In [9]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch
import pandas as pd

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

epochs = 100
batch_sz = 128

## Read Data and Create batches

100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. It is a smaller dataset for education and development. 

In [6]:
#Get smaller dataset 
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip --no-check-certificate
!unzip ml-latest-small.zip

--2021-04-22 09:10:22--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2021-04-22 09:10:22 (4.48 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [11]:

df_ratings = pd.read_csv("./ml-latest-small/ratings.csv", usecols=['userId', 'movieId', 'rating'],
                         dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

users = df_ratings['userId'].values - 1
movies = df_ratings['movieId'].values - 1
rates = df_ratings['rating'].values
n_samples = len(rates)

n_users, n_movies =  max(users)+1, max(movies)+1
batches = []

#Create batches
for i in range(0, n_samples, batch_sz):
  limit =  min(i + batch_sz, n_samples)
  users_batch, movies_batch, rates_batch = users[i: limit], movies[i: limit], rates[i: limit]
  batches.append((torch.tensor(users_batch, dtype=torch.long), torch.tensor(movies_batch, dtype=torch.long),
                  torch.tensor(rates_batch, dtype=torch.float)))
users = None
movies = None 
rates = None 

## Define Model

**TODO :** implement the hidden layers with the following achitecture: 

* 3 layers (128, 256 and 128 neurones)
* Dropout every after a layer with 20% probability
* Relu as activation function for all 3 hidden layers 

In [12]:
from torch import nn

class RecommenderNet(nn.Module):
  def __init__(self, n_users, n_movies, n_factors=50, embedding_dropout=0.02, dropout_rate=0.2):
    super().__init__()

    self.u = nn.Embedding(n_users, n_factors)
    self.m = nn.Embedding(n_movies, n_factors)
    self.drop = nn.Dropout(embedding_dropout)
    #self.hidden = nn.Sequential(....) #TODO: Implement the hidden layers
    self.fc = nn.Linear(n_factors*2, 1)
    self._init()

  def forward(self, users, movies, minmax=[1,5]):
    features = torch.cat([self.u(users), self.m(movies)], dim=1)
    x = self.drop(features)
    #x = self.hidden(x)
    out = torch.sigmoid(self.fc(x))
    
    if minmax is not None: #Scale the output to [1,5]
      min_rating, max_rating = minmax
      out = out*(max_rating - min_rating) + min_rating
    return out

  def _init(self):
    """
    Initialize embeddings and hidden layers weights with xavier.
    """
    def init(m):
        if type(m) == nn.Linear:
            torch.nn.init.xavier_uniform_(m.weight)
            m.bias.data.fill_(0.01)

    self.u.weight.data.uniform_(-0.05, 0.05)
    self.m.weight.data.uniform_(-0.05, 0.05)
    #self.hidden.apply(init)
    init(self.fc)

In [13]:
net = RecommenderNet(n_users=n_users, n_movies=n_movies).to(device)
net

RecommenderNet(
  (u): Embedding(610, 50)
  (m): Embedding(193609, 50)
  (drop): Dropout(p=0.02, inplace=False)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)

## Define Training parameters

1. Loss function 
2. Optimizer 
3. learning rate scheduler : dynamic learning rate `lr_scheduler.ReduceLROnPlateau`: Reduce learning rate when a metric has stopped improving.

In [14]:
criterion = nn.MSELoss(reduction='mean')
optimizer = optim.Adam(net.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.3, patience=2)

## Training Loop

In [15]:
epochs = 10

for epoch in range(epochs):
  train_loss = 0
  for users_batch, movies_batch, rates_batch in batches:
    net.zero_grad()
    out = net(users_batch.to(device), movies_batch.to(device), [1, 5]).squeeze()
    loss = criterion(rates_batch.to(device), out)

    loss.backward()
    optimizer.step()
    train_loss += loss
  scheduler.step(loss)
  print("Loss at epoch {} = {}".format(epoch, loss.item()))
print("Last Loss = {}".format(loss.item()))

Loss at epoch 0 = 0.537467360496521
Loss at epoch 1 = 0.40989959239959717
Loss at epoch 2 = 0.3426230549812317
Loss at epoch 3 = 0.31937116384506226
Loss at epoch 4 = 0.2943408191204071
Loss at epoch 5 = 0.2869146168231964
Loss at epoch 6 = 0.2719496488571167
Loss at epoch 7 = 0.2728712856769562
Loss at epoch 8 = 0.2608964741230011
Loss at epoch 9 = 0.26685476303100586
Last Loss = 0.26685476303100586


## Lab Task
```
1. Implement ...

```


<center>Don't to forget to make a Git commit</center>

## References
1. [Introduction to recommender systems](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada)

2. [Recommender system](https://en.wikipedia.org/wiki/Recommender_system)

3. [Recommender Systems with Python — Part I: Content-Based Filtering](https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831)

4. [Build a Recommendation Engine With Collaborative Filtering](https://realpython.com/build-recommendation-engine-collaborative-filtering/)
