### Collaborative Filtering
Collaborative filtering is a recommendation approach based on filtering out items that a user might prefer on the basis of the **reactions of users with similar preferences**.

The primary asumption of the collaborative filtering is that the users who have agreed in the past tend to agree 
in the future.

#### Sub-approaches in Collaborative Filtering
1) Memory based

Memory-based approach is based on **finding similar users** using a selected measure(e,g.,cosine similarity or Pearson correlation) and taking a weighed average of rating.

Pro and Con

It is easy to build and more interpretable, but does not perform well when the data is limited.

2) Model based

The model based approach utilizes **machine learning** to predict expected user ratings of **unrated** items.(The following case used this approach)

Pro and con

It is hard to interprete but it is more effective when the avaiable data is limited.

#### Data Collection in Collaborative Filtering
Collaborative filtering is based on users' historical data, so we need to collect data about the users' feedbacks and preferences. 
1) Explicit Data Collection
Explicit data collection comprises of all data that the user **directly provides** to the system.
- user's rate on an item
- user's item ranking(inside a collection from most favorite to least favorite)
- user's selection between two or more items
- user's list of favorate items


2) Implicit Data Collection
Implicit data collection is based on a user's observable behaviors, which can be within the system as well outside of the systems.
- records of items that user purchased online
- websites that a user visited
- user's viewed items list
- user's social network engagements

### Content-Based Filtering
Content refers to the content or attributes of the items that the user engages with. In the content-based filtering approach, **items are categoried**, and based on the user's limited feedback, the system recommends new items belonging to the categories the user likes.

For content-based filtering, **both items and users are tagged with keywords** to categorize them. Items are tagged **based on their attributions**, whereas to tag users, a dedicated model is designed to create user profiles **based on their interaction with the recommender system**. A vector space representation algorithm(eg. **tf-idf**) is used to abstract the features of the items. 

### Case - Deep Collaborative Filtering with MovieLens Dataset

### 1. Initial Imports

In [4]:
# to build and train out model and make predictions
import tensorflow as tf

# to unzip the zip file
from zipfile import ZipFile

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# to import the embedding layer from Tensorflow
from tensorflow.keras.layers import Embedding

# to download the dataset from the external URL
from tensorflow.keras.utils import get_file

### 2. Loading the Data

In [46]:
# Use the url to open the 'rating.csv' file

url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
movielens_path = get_file('movielens.zip',url,extract = True)

"""extract = True means extract the file as an archive, like zip"""

with ZipFile(movielens_path) as z:
    with z.open("ml-latest-small/ratings.csv") as f:
        df = pd.read_csv(f)

df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### 3. Processing the Data

#### 3.1 processing userIds

In [47]:
user_ids = df['userId'].unique().tolist()
user2encoder = {x:i for i, x in enumerate(user_ids)}
encoder2user = {i:x for i, x in enumerate(user_ids)}
df['user'] = df['userId'].map(user2encoder)
num_users = len(user2encoder)

#### 3.2 processing Movie IDs

In [48]:
movie_ids = df['movieId'].unique().tolist()
movie2encoder = {x:i for i, x in enumerate(movie_ids)}
encoder2movie = {i:x for i, x in enumerate(movie_ids)}
df['movie'] = df['movieId'].map(movie2encoder)
num_movies = len(movie2encoder)

#### 3.3 Processing the Ratings

For the rating, we normalize them for computational efficiency and the reliability of the model. Herein, we did Minmax normalization.

In [49]:
min, max = df['rating'].min(), df['rating'].max()
df['rating'] = df['rating'].apply(lambda x: (x - min)/ (max - min))

In [50]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,user,movie
0,1,1,0.777778,964982703,0,0
1,1,3,0.777778,964981247,0,1
2,1,6,0.777778,964982224,0,2
3,1,47,1.0,964983815,0,3
4,1,50,1.0,964982931,0,4


### 4. Splitting the Dataset

In [51]:
# choose features
x = df[['user','movie']].values
y = df['rating'].values

# train and validation split
x_train, x_val, y_train, y_val = train_test_split(x,y,test_size = 0.1, random_state = 42)
print('Shape of the x_train: ', x_train.shape)
print('Shape of the y_train: ', y_train.shape)
print('Shape of the x_val: ', x_val.shape)
print('Shape of the y_val:', y_val.shape)

Shape of the x_train:  (90752, 2)
Shape of the y_train:  (90752,)
Shape of the x_val:  (10084, 2)
Shape of the y_val: (10084,)


### 5. Building the Model

Just a simple recap: In tensorflow, there are three ways to build model: 1) Sequential API; 2) Functional API; and 3) Model Subclassing. The last one is very powerful and fully customerizable. Here we will use the last one.

In [97]:
class RecommenderNet(tf.keras.Model):
    #__init function is to initialize the values of instance, and define the layers    
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        # super function is used to call the parent constructor(tf.keras.Model)
        super(RecommenderNet, self).__init__(**kwargs)
        
        # variable for embedding size
        self.embedding_size = embedding_size
        
        # variable for user count, and related weights and biases
        self.num_users = num_users
        # tf.keras.layers.Embedding() can only be used as the first layer in a model
        self.user_embedding = Embedding(num_users, embedding_size, 
                                        embeddings_initializer='he_normal',
                                        embeddings_regularizer = tf.keras.regularizers.l2(1e-6))
        self.user_bias = Embedding(num_users,1)
        
        # variables for movie count, and realted weights and biases
        self.num_movies = num_movies
        self.movie_embedding = Embedding(num_movies, embedding_size,
                                        embeddings_initializer = 'he_normal',
                                        embeddings_regularizer = tf.keras.regularizers.l2(16-6))
        self.movie_bias = Embedding(num_users, 1)
    
    # call function is where the operations are defined after layers are indifned in the __init__
    # call function accept the inputs, feeds them into the layers and final sigmoid layer
    def call(self,inputs):
        user_vector = self.user_embedding(inputs[:,0])
        user_bias = self.user_bias(inputs[:,0])
        
        movie_vector = self.movie_embedding(inputs[:,0])
        movie_bias = self.movie_bias(inputs[:,0])
        
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        x = dot_user_movie + user_bias + movie_bias
        
        # The sigmoid activation forces the rating to between 0 and 1
        return tf.nn.sigmoid(x)
        
model = RecommenderNet(num_users, num_movies, embedding_size = 50)      

### 6. Compile and Train the Model

In [98]:
model.compile(loss = 'mse',optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001))

In [99]:
history = model.fit(x=x_train, y=y_train, batch_size = 64, epochs = 5, verbose =1, validation_data = (x_val,y_val))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### 7. Make Recommendations

In [100]:
# Randomly pick a user ID
user_id = df['userId'].sample(2).iloc[0]
print('The selected user ID is: ', user_id)

The selected user ID is:  490


In [101]:
# filter out the movies the user watched before
movies_watched = df[df['userId'] == user_id]

# filter out the movies the user not watched and corresponding encoder
not_watched = df[~df['movieId'].isin(movies_watched['movieId'].values)]['movieId'].unique()

not_watched_encoder = [[movie2encoder.get(x)] for x in not_watched]

print('The number of movies the user has not seen before: ', len(not_watched))

The number of movies the user has not seen before:  9611


In [102]:
# to understand the result of each step
print("not_watched_encoder: ", not_watched_encoder[:5])

print('take one from not_watched_encoder: ', not_watched_encoder[1])

print('not_watched_encoder[1][0]: ', not_watched_encoder[1][0])

not_watched_encoder:  [[1], [2], [3], [4], [5]]
take one from not_watched_encoder:  [2]
not_watched_encoder[1][0]:  2


In [103]:
user_encoder = user2encoder.get(user_id)

user_movie_array = np.hstack(([[user_encoder]]*len(not_watched), not_watched_encoder))

user_movie_array

array([[ 489,    1],
       [ 489,    2],
       [ 489,    3],
       ...,
       [ 489, 9721],
       [ 489, 9722],
       [ 489, 9723]])

In [111]:
# use the model.predict() functions to generate the predicted movie ratings
ratings = model.predict(user_movie_array).flatten()
ratings

array([0.58840495, 0.58840495, 0.58840495, ..., 0.5884053 , 0.5884053 ,
       0.5884053 ], dtype=float32)

In [123]:
# Tope 10 ratings
top10_indices = ratings.argsort()[-10:][::-1]

# convert our assigned movie ID to their original movieId
recommended_movie_ids = [encoder2movie.get(not_watched_encoder[x][0]) for x in top10_indices]
recommended_movie_ids

[163981,
 152372,
 147657,
 147662,
 148166,
 149011,
 163937,
 158721,
 160527,
 160836]

In [117]:
# load another dataset to connect the movieId with its name
with ZipFile(movielens_path) as z:
    with z.open('ml-latest-small/movies.csv') as f:
        movie_df = pd.read_csv(f)
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [118]:
# top 10 movie watched, get the id first
top_watched_movieId = (movies_watched.sort_values(by = 'rating', ascending = False).head(10)['movieId'].values)
top_watched_movieId

array([ 6669,   306,   307,  1251,  4144, 78499,  4226, 79132,  2571,
        7361], dtype=int64)

In [119]:
# connect the top 10 watched movie id to their names
top_watched_movieinfo = movie_df[movie_df['movieId'].isin(top_watched_movieId)]
top_watched_movieinfo

Unnamed: 0,movieId,title,genres
266,306,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama
267,307,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
950,1251,8 1/2 (8½) (1963),Drama|Fantasy
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
3087,4144,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
3141,4226,Memento (2000),Mystery|Thriller
4507,6669,Ikiru (1952),Drama
4909,7361,Eternal Sunshine of the Spotless Mind (2004),Drama|Romance|Sci-Fi
7355,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
7372,79132,Inception (2010),Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX


In [125]:
# top 10 movies the collaborative filtering model would recommend the user 
recommended_movies = movie_df[movie_df['movieId'].isin(recommended_movie_ids)][['title','genres']]
recommended_movies

Unnamed: 0,title,genres
9153,Masked Avengers (1981),Action
9154,Return of the One-Armed Swordsman (1969),Action|Adventure
9156,Hitchcock/Truffaut (2015),Documentary
9175,He Never Died (2015),Comedy|Drama|Horror
9231,Southbound (2016),Horror
9292,Gen-X Cops (1999),Action|Comedy|Thriller
9330,Sympathy for the Underdog (1971),Action|Crime|Drama
9342,Hazard (2005),Action|Drama|Thriller
9389,Blair Witch (2016),Horror|Thriller
9390,31 (2016),Horror


-- Great Job! End! --

### Complentary Reading

### Similarity
You have 2 vectors x and y and want to measure similarity between them. A basic similarity function is the **inner product**---If x tends to be high where y is also high, and low where y is low, **the inner product** will be **high**, then the **vectors** are **more similar**.

However, the inner product is unbounded. One way to make it **bounded between -1 to 1** is to divide by the vectors, giving **cosine similarity**. Cosine similarity is actully bounded between 0 and 1 if x and y are non-negative. It can also be interpreted as the cosine of the angle between two vectors. 

But cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. So, that's why **Pearson correlation** come out. Pearson Correlation is the cosine similarity between centered versions of x and y, again bounded between -1 and 1. Unlike the cosine, the correlation is invariant to both scale and location changes of x and y.

Summary: Cosine similarity is normalized inner product. Pearson correlation is centered cosine similarity.

ref:https://brenocon.com/blog/2012/03/cosine-similarity-pearson-correlation-and-ols-coefficients/

### Batch and Epoch
**Batch** is a hyperparameter that defines the number of samples to work through before updating the model parameters.
- Batch Gradient Descent: Batch size = size of training set
- Stochastic Gradient Descent: Batch size = 1
- Mini-Batch Gradient Descent: 1 < Batch Size < Size of training set


**Epoch** is a hyperparameter that defines the number times that the learning algorithm will work through the **entire training dataset**. The number of epochs is traditionally large, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized.

**Example**

Assume you have a dataset with 200 samples (rows of data) and you choose a batch size of 5 and 1,000 epochs.

This means that the dataset will be divided into 40 batches, each with five samples. The model weights will be updated after each batch of five samples.

This also means that one epoch will involve 40 batches or 40 updates to the model.

With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.