## Data Preparation

In [1]:
from google.colab import drive

drive.mount('/content/drive')

ratings_path = '/content/drive/My Drive/movie_rec_system/ratings.csv'
users_path = '/content/drive/My Drive/movie_rec_system/users.csv'
movies_path = '/content/drive/My Drive/movie_rec_system/movies.csv'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv(ratings_path, sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])

# Reading users file
users = pd.read_csv(users_path, sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv(movies_path, sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

Mounted at /content/drive


In [2]:
# Check the top 5 rows
print(ratings.head())

   user_id  movie_id  rating
0        1      1193       5
1        1       661       3
2        1       914       3
3        1      3408       4
4        1      2355       5


In [3]:
# Check the file info
print(ratings.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   user_id   1000209 non-null  int64
 1   movie_id  1000209 non-null  int64
 2   rating    1000209 non-null  int64
dtypes: int64(3)
memory usage: 22.9 MB
None


## Collaborative Filtering Recommendation Model
The content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who she/he is.

Therefore, in this section, I will use Memory-Based Collaborative Filtering to make recommendations to movie users. The technique is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

### Theory
There are 2 main types of memory-based collaborative filtering algorithms:
1. **User-User Collaborative Filtering**: Here we find look alike users based on similarity and recommend movies which first user’s look-alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every user pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.
2. **Item-Item Collaborative Filtering**: It is quite similar to previous algorithm, but instead of finding user's look-alike, we try finding movie's look-alike. Once we have movie's look-alike matrix, we can easily recommend alike movies to user who have rated any movie from the dataset. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new user, the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between users. And with fixed number of movies, movie-movie look alike matrix is fixed over time.

![user_item_cf](images/user_item_cf.jpg)

In either scenario, we builds a similarity matrix. For user-user collaborative filtering, the **user-similarity matrix** will consist of some distance metrics that measure the similarity between any two pairs of users. Likewise, the **item-similarity matrix** will measure the similarity between any two pairs of items.

There are 3 distance similarity metrics that are usually used in collaborative filtering:
1. **Jaccard Similarity**:
    * Similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B
    * It is typically used where we don’t have a numeric rating but just a boolean value like a product being bought or an add being clicked

2. **Cosine Similarity**: (as in the Content-Based system)
    * Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
    * Closer the vectors, smaller will be the angle and larger the cosine

3. **Pearson Similarity**:
    * Similarity is the pearson coefficient between the two vectors.

For the purpose of diversity, I will use **Pearson Similarity** in this implementation.

### Implementation
I will use the file **ratings.csv** first as it contains User ID, Movie IDs and Ratings. These three elements are all I need for determining the similarity of the users based on their ratings for a particular movie.

First I do some quick data processing:

In [4]:
# Fill NaN values in user_id and movie_id column with 0
ratings['user_id'] = ratings['user_id'].fillna(0)
ratings['movie_id'] = ratings['movie_id'].fillna(0)

# Replace NaN values in rating column with average of all values
ratings['rating'] = ratings['rating'].fillna(ratings['rating'].mean())

Due to the limited computing power in my laptop, I will build the recommender system using only a subset of the ratings. In particular, I will take a random sample of 20,000 ratings (2%) from the 1M ratings.

In [5]:
# Randomly sample 1% of the ratings dataset
small_data = ratings.sample(frac=0.02)
# Check the sample info
print(small_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 20004 entries, 187444 to 393032
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   user_id   20004 non-null  int64
 1   movie_id  20004 non-null  int64
 2   rating    20004 non-null  int64
dtypes: int64(3)
memory usage: 625.1 KB
None


Now I use the **scikit-learn library** to split the dataset into testing and training.  **Cross_validation.train_test_split** shuffles and splits the data into two datasets according to the percentage of test examples, which in this case is 0.2.

In [6]:
from sklearn.model_selection import cross_val_score as cv
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(small_data, test_size=0.2)

Now I need to create a user-item matrix. Since I have splitted the data into testing and training, I need to create two matrices. The training matrix contains 80% of the ratings and the testing matrix contains 20% of the ratings.

In [7]:
# Create two user-item matrices, one for training and another for testing
train_data_matrix = train_data[['user_id', 'movie_id', 'rating']].to_numpy()
test_data_matrix = test_data[['user_id', 'movie_id', 'rating']].to_numpy()

# Check their shape
print(train_data_matrix.shape)
print(test_data_matrix.shape)

(16003, 3)
(4001, 3)


Now I use the **pairwise_distances** function from sklearn to calculate the [Pearson Correlation Coefficient](https://stackoverflow.com/questions/1838806/euclidean-distance-vs-pearson-correlation-vs-cosine-similarity). This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

In [8]:
from sklearn.metrics.pairwise import pairwise_distances

# User Similarity Matrix
user_correlation = 1 - pairwise_distances(train_data, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation[:4, :4])

[[1.         0.95267253 0.99997883 0.57228483]
 [0.95267253 1.         0.95067412 0.79449545]
 [0.99997883 0.95067412 1.         0.56693629]
 [0.57228483 0.79449545 0.56693629 1.        ]]


In [9]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(train_data_matrix.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation[:4, :4])

[[ 1.         -0.01608875  0.01040977]
 [-0.01608875  1.         -0.0606499 ]
 [ 0.01040977 -0.0606499   1.        ]]


With the similarity matrix in hand, I can now predict the ratings that were not included with the data. Using these predictions, I can then compare them with the test data to attempt to validate the quality of our recommender model.

For the user-user CF case, I will look at the similarity between 2 users (A and B, for example) as weights that are multiplied by the ratings of a similar user B (corrected for the average rating of that user). I also need to normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average ratings for the user that I am trying to predict. The idea here is that some users may tend always to give high or low ratings to all movies. The relative difference in the ratings that these users give is more important than the absolute values.

In [10]:
# Function to predict ratings
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        # Use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    print(pred)
    return pred

### Evaluation
There are many evaluation metrics but one of the most popular metric used to evaluate accuracy of predicted ratings is **Root Mean Squared Error (RMSE)**. I will use the **mean_square_error (MSE)** function from sklearn, where the RMSE is just the square root of MSE.

$$\mathit{RMSE} =\sqrt{\frac{1}{N} \sum (x_i -\hat{x_i})^2}$$

I'll use the scikit-learn's **mean squared error** function as my validation metric. Comparing user- and item-based collaborative filtering, it looks like user-based collaborative filtering gives a better result.

In [11]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Function to calculate RMSE
def rmse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return sqrt(mean_squared_error(pred, actual))

In [12]:
# Predict ratings on the training data with both similarity score
user_prediction = predict(train_data_matrix, user_correlation, type='user')
item_prediction = predict(train_data_matrix, item_correlation, type='item')

# RMSE on the test data
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

[[ 2991.60899565   787.83962683  -533.44862248]
 [ 2050.0658315    158.30922864 -1479.37506014]
 [ 3924.40380119  1715.06182557   402.53437324]
 ...
 [ 2926.60172226   751.57406942  -612.17579167]
 [ 3693.52589068  1805.19908616   165.27502315]
 [ 3597.63186312  2367.24448529   463.12365159]]
[[ 3.05042960e+03  5.51777244e+01  2.70221265e+01]
 [ 5.24165208e+02  1.63562471e+02 -2.41681030e+00]
 [ 5.72376952e+03  6.15268737e+01  5.08132049e+01]
 ...
 [ 2.78301932e+03  1.45628856e+02  1.91025797e+01]
 [ 4.08850793e+03  1.27507591e+03 -3.77824823e+01]
 [ 3.36685619e+03  2.65926912e+03 -1.26633748e+02]]
User-based CF RMSE: 1419.8618959946425
Item-based CF RMSE: 1629.9997428294428


In [None]:
user_prediction.shape

(16003, 3)

In [None]:
train_data_matrix.shape

(16003, 3)

In [None]:
train_data.shape

(16003, 3)

In [None]:
train_data.describe()

Unnamed: 0,user_id,movie_id,rating
count,16003.0,16003.0,16003.0
mean,3026.317753,1866.704368,3.585203
std,1727.261299,1097.54148,1.109677
min,4.0,1.0,1.0
25%,1516.0,1032.0,3.0
50%,3080.0,1882.0,4.0
75%,4470.0,2770.0,4.0
max,6040.0,3952.0,5.0


In [None]:
user_prediction.describe

AttributeError: 'numpy.ndarray' object has no attribute 'describe'

In [None]:
import pickle

# Assuming user_prediction is already computed
# Save user_prediction to a pickle file
user_prediction_path = '/content/drive/My Drive/movie_rec_system/user_prediction.pkl'
with open(user_prediction_path, 'wb') as f:
    pickle.dump(user_prediction, f)

In [None]:
# RMSE on the train data
print('User-based CF RMSE: ' + str(rmse(user_prediction, train_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, train_data_matrix)))

User-based CF RMSE: 693.9016693063587
Item-based CF RMSE: 222.11165678187047


RMSE of training of model is a metric which measure how much the signal and the noise is explained by the model. I noticed that my RMSE is quite big. I suppose I might have overfitted the training data.

Overall, Memory-based Collaborative Filtering is easy to implement and produce reasonable prediction quality. However, there are some drawback of this approach:

* It doesn't address the well-known cold-start problem, that is when new user or new item enters the system.
* It can't deal with sparse data, meaning it's hard to find users that have rated the same items.
* It suffers when new users or items that don't have any ratings enter the system.
* It tends to recommend popular items.

In [None]:
def recommend_movies(user_id, prediction_matrix, movies, ratings, num_recommendations=20):
    # Get and sort the user's predictions
    user_row_number = user_id - 1  # User ID starts at 1, not 0
    sorted_user_predictions = prediction_matrix[user_row_number].argsort()[::-1]

    # Get user's data and merge in movie information
    user_data = ratings[ratings.user_id == user_id]
    user_full = user_data.merge(movies, how='left', left_on='movie_id', right_on='movie_id').sort_values(['rating'], ascending=False)

    print('User {0} has already rated {1} movies.'.format(user_id, user_full.shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))

    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])]
                       .iloc[sorted_user_predictions]
                       .head(num_recommendations))

    return user_full, recommendations

# Example usage for user ID 1
user_full, recommendations = recommend_movies(344, user_prediction, movies, ratings)
print(recommendations)


User 344 has already rated 90 movies.
Recommending the highest 20 predicted ratings movies not already rated.
   movie_id                    title                        genres
0         1         Toy Story (1995)   Animation|Children's|Comedy
1         2           Jumanji (1995)  Adventure|Children's|Fantasy
2         3  Grumpier Old Men (1995)                Comedy|Romance


In [None]:
def recommend_movies(predictions, user_id, movies, ratings, num_recommendations=30):
    # Get and sort the user's predictions
    user_row_number = user_id - 1  # User ID starts at 1
    sorted_user_predictions = predictions[user_row_number].argsort()[::-1]

    # Get the user's data and merge in the movie information
    user_data = ratings[ratings.user_id == user_id]
    user_full = (user_data.merge(movies, how='left', left_on='movie_id', right_on='movie_id').
                 sort_values(['rating'], ascending=False))

    # Recommend the highest predicted rating movies that the user hasn't seen yet
    recommendations = movies[~movies['movie_id'].isin(user_full['movie_id'])]
                      #  iloc[sorted_user_predictions])

    # Ensure we get the top 'num_recommendations' not already rated by the user
    if len(recommendations) < num_recommendations:
        print(f"Warning: Only {len(recommendations)} recommendations available.")
        top_recommendations = recommendations
    else:
        top_recommendations = recommendations.head(num_recommendations)

    return user_full, top_recommendations

# Example usage:
user_id = 3 # specify the user_id you want to recommend movies to
user_full, recommendations = recommend_movies(user_prediction, user_id, movies, ratings, num_recommendations=20)
print(recommendations)
print(user_full)

    movie_id                                  title  \
0          1                       Toy Story (1995)   
1          2                         Jumanji (1995)   
2          3                Grumpier Old Men (1995)   
3          4               Waiting to Exhale (1995)   
4          5     Father of the Bride Part II (1995)   
5          6                            Heat (1995)   
6          7                         Sabrina (1995)   
7          8                    Tom and Huck (1995)   
8          9                    Sudden Death (1995)   
9         10                       GoldenEye (1995)   
10        11         American President, The (1995)   
11        12     Dracula: Dead and Loving It (1995)   
12        13                           Balto (1995)   
13        14                           Nixon (1995)   
14        15                Cutthroat Island (1995)   
15        16                          Casino (1995)   
16        17           Sense and Sensibility (1995)   
17        

In [42]:
def recommend_movies_for_new_user(user_ratings, movies, ratings, num_recommendations=20):
    # Filter out movie IDs that don't exist in the movies DataFrame
    valid_movie_ids = set(user_ratings.keys()) & set(movies['movie_id'])

    # Filter movies DataFrame to include only valid movie IDs
    unrated_movies = movies[movies['movie_id'].isin(valid_movie_ids)].copy()  # Create a copy

    # Calculate similarity between rated movies and all other movies
    movie_similarity = np.zeros(len(unrated_movies))
    for i, (rated_movie_id, rated_rating) in enumerate(user_ratings.items()):
        rated_movie_genres = movies[movies['movie_id'] == rated_movie_id]['genres'].iloc[0]
        for j, (_, unrated_movie) in enumerate(unrated_movies.iterrows()):
            unrated_movie_genres = unrated_movie['genres']
            # Calculate similarity (for simplicity, just count number of common genres)
            similarity_score = len(set(rated_movie_genres.split('|')).intersection(unrated_movie_genres.split('|')))
            movie_similarity[j] += similarity_score * rated_rating
    print(movie_similarity)
    # Sort movies based on similarity scores
    unrated_movies['similarity'] = movie_similarity

    recommended_movies = unrated_movies.sort_values(by='similarity', ascending=False).head(num_recommendations)
    print(recommended_movies)
    return recommended_movies

# Example usage:
user_ratings = {1: 3, 334: 2, 3324: 5, 3334: 4}  # New user's ratings
recommended_movies = recommend_movies_for_new_user(user_ratings, movies, ratings, num_recommendations=20)
print(recommended_movies[['movie_id', 'title', 'genres']])


[14.  6.  8. 18.]
      movie_id                        title                          genres  \
3265      3334             Key Largo (1948)  Crime|Drama|Film-Noir|Thriller   
0            1             Toy Story (1995)     Animation|Children's|Comedy   
3255      3324         Drowning Mona (2000)                          Comedy   
330        334  Vanya on 42nd Street (1994)                           Drama   

      similarity  
3265        18.0  
0           14.0  
3255         8.0  
330          6.0  
      movie_id                        title                          genres
3265      3334             Key Largo (1948)  Crime|Drama|Film-Noir|Thriller
0            1             Toy Story (1995)     Animation|Children's|Comedy
3255      3324         Drowning Mona (2000)                          Comedy
330        334  Vanya on 42nd Street (1994)                           Drama


In [14]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

def recommend_movies_for_new_user(user_ratings, movies, ratings, num_recommendations=20):
    # Create a unique ID for the new user
    new_user_id = ratings['user_id'].max() + 1

    # Create a DataFrame for the new user's ratings
    new_ratings = pd.DataFrame({
        'user_id': [new_user_id] * len(user_ratings),  # Assign new_user_id to all new ratings
        'movie_id': list(user_ratings.keys()),         # List of movie IDs
        'rating': list(user_ratings.values())          # Corresponding list of ratings
    })

    # Combine existing ratings with new user's ratings
    updated_ratings = pd.concat([ratings, new_ratings], ignore_index=True)

    # Create a user-item matrix
    user_item_matrix = updated_ratings.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

    # Calculate user similarity using Pearson correlation
    user_similarity = 1 - pairwise_distances(user_item_matrix, metric='correlation')
    user_similarity[np.isnan(user_similarity)] = 0

    # Function to predict ratings
    def predict_ratings(ratings, similarity, type='user'):
        if type == 'user':
            mean_user_rating = ratings.mean(axis=1)
            ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
            pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
        elif type == 'item':
            pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
        return pred

    # Predict ratings for the new user
    user_prediction = predict_ratings(user_item_matrix.values, user_similarity, type='user')
    new_user_predicted_ratings = user_prediction[new_user_id - 1]

    # Get all movies the new user has not rated
    unrated_movie_ids = user_item_matrix.columns[user_item_matrix.loc[new_user_id] == 0]

    # Create a DataFrame for the recommended movies
    recommended_movies = movies[movies['movie_id'].isin(unrated_movie_ids)]

    # Assign predicted ratings to the recommended movies
    recommended_movies['predicted_rating'] = recommended_movies['movie_id'].map(lambda x: new_user_predicted_ratings[user_item_matrix.columns.get_loc(x)])

    # Sort and return the top N recommendations
    recommended_movies = recommended_movies.sort_values(by='predicted_rating', ascending=False).head(num_recommendations)
    return recommended_movies

# Example usage:
user_ratings = {1: 3, 334: 2, 3324: 5, 3334: 4}  # New user's ratings
recommended_movies = recommend_movies_for_new_user(user_ratings, movies, ratings, num_recommendations=20)
print(recommended_movies[['movie_id', 'title', 'genres', 'predicted_rating']])


      movie_id                                              title  \
2789      2858                             American Beauty (1999)   
257        260          Star Wars: Episode IV - A New Hope (1977)   
1245      1265                               Groundhog Day (1993)   
3045      3114                                 Toy Story 2 (1999)   
1178      1196  Star Wars: Episode V - The Empire Strikes Back...   
2327      2396                         Shakespeare in Love (1998)   
1250      1270                          Back to the Future (1985)   
1959      2028                         Saving Private Ryan (1998)   
2502      2571                                 Matrix, The (1999)   
1180      1198                     Raiders of the Lost Ark (1981)   
604        608                                       Fargo (1996)   
589        593                   Silence of the Lambs, The (1991)   
1179      1197                         Princess Bride, The (1987)   
315        318                   S

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommended_movies['movie_id'].map(lambda x: new_user_predicted_ratings[user_item_matrix.columns.get_loc(x)])


In [15]:
# Example usage:
user_ratings = {123: 3, 3: 2, 324: 5, 334: 4}  # New user's ratings
recommended_movies = recommend_movies_for_new_user(user_ratings, movies, ratings, num_recommendations=20)
print(recommended_movies[['movie_id', 'title', 'genres', 'predicted_rating']])


      movie_id                                              title  \
1217      1236                                       Trust (1990)   
303        306                           Three Colors: Red (1994)   
497        501                                       Naked (1993)   
2102      2171                       Next Stop, Wonderland (1998)   
1033      1046                             Beautiful Thing (1996)   
322        326                            To Live (Huozhe) (1994)   
305        308                         Three Colors: White (1994)   
1260      1280                       Raise the Red Lantern (1991)   
260        263                           Ladybird Ladybird (1994)   
2291      2360                   Celebration, The (Festen) (1998)   
1685      1734            My Life in Pink (Ma vie en rose) (1997)   
1160      1176  Double Life of Veronique, The (La Double Vie d...   
169        171                                     Jeffrey (1995)   
2282      2351     Nights of Cabir

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommended_movies['predicted_rating'] = recommended_movies['movie_id'].map(lambda x: new_user_predicted_ratings[user_item_matrix.columns.get_loc(x)])


In [None]:
unrated_movies = movies[~movies['movie_id'].isin(user_full['movie_id'])]


In [None]:
if len(unrated_movies) < 20:
    print(f"Warning: Only {len(unrated_movies)} unrated movies available for recommendations.")
    num_recommendations = len(unrated_movies)
len(unrated_movies)

3697

## Alternative Approach
As I mentioned above, it looks like my Collaborative Filtering model suffers from overfitting problem as I only train it on a small sample dataset (2% of the actual 1M ratings). In order to deal with this, I need to apply dimensionality reduction techniques to capture more signals from the big dataset. Thus comes the use of **low-dimensional factor models (aka, Model-Based Collaborative Filtering)**. I won't be able to implement this approach in this notebook due to computing limit, however, I want to introduce it here to give you a general sense of its advantages.

In this approach, CF models are developed using machine learning algorithms to predict user’s rating of unrated items. It has been shown that Model-based Collaborative Filtering has received greater exposure in industry research, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. An example is the competition to win the [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize), which used the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films.

Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. As per my understanding, the algorithms in this approach can further be broken down into 3 sub-types:

* **Matrix Factorization (MF)**: The idea behind such models is that attitudes or preferences of a user can be determined by a small number of hidden latent factors. These factors are also called **Embeddings**, which represent different characteristics for users and items. Matrix factorization can be done by various methods including Support Vecot Decomposition (SVD), Probabilistic Matrix Factorization (PMF), and Non-Negative Matrix Factorization (NMF).

* **Clustering based algorithm (KNN)**: The idea of clustering is same as that of memory-based recommendation systems. In memory-based algorithms, we use the similarities between users and/or items and use them as weights to predict a rating for a user and an item. The difference is that the similarities in this approach are calculated based on an unsupervised learning model, rather than Pearson correlation or cosine similarity.

* **Neural Nets / Deep Learning**: The idea of using Neural Nets is similar to that of Model-Based Matrix Factorization. In matrix factorizaion, we decompose our original sparse matrix into product of 2 low rank orthogonal matrices. For neural net implementation, we don’t need them to be orthogonal, we want our model to learn the values of embedding matrix itself. The user latent features and movie latent features are looked up from the embedding matrices for specific movie-user combination. These are the input values for further linear and non-linear layers. We can pass this input to multiple relu, linear or sigmoid layers and learn the corresponding weights by any optimization algorithm (Adam, SGD, etc.).

![memory-model-cf](images/memory-model-cf.jpg)

## Summary
In this post, I introduced the Movie Lens dataset for building movie recommendation system.

Specifically, I have developed recommendation models including:

* How to load and review the data.
* How to develop a content-based recommendation model based on movie genres.
* How to develop a memory-based collaborative filtering model based on user ratings.
* A glimpse at model-based collaborative filtering models as alternative options.