# Movie Recommendations Using Collaborative Filtering

## Motivation

#### Overall Objective of the project: 

- Predict users' movie preferences and suggest a list of movies that are likely to be watched. In other words, give personalized suggestions on movies to watch for each user.


### Recommendation systems :


#### What is a Recommendation system?

- A recommendation system suggests a product or service to customers who are likely to consume or purchase it.

- Recommendation systems are utilized in e-commerce to predict the preference a user might give to an item.
  - Netflix : Which movie to watch.
  - Amazon : Which products are similar to the one purchased.


- Other examples of recommendation systems and applications:

  - Spotify recommends music and playlist
  - Facebook recommends friends
  - LinkedIn recommends jobs

- The Netflix prize (competition 2009)

    -  Competition for the best collaborative filtering algorithm to predict user ratings for films.

   -  Data: Around 100M ratings from 500K  users on 18K movies.

   -  Winning team used: A Regularized matrix factorization approach. 


#### Recomendation system

Data required for Recommendation system:

- User ratings data
- Variables related to items or users (movies genre, duration of the movie... etc)

#### Recomendation systems : Collaborative filtering

- **Main idea**: Recommending items to users based on the preference of similar users.
   - Based on data, we asumme that: A user who has agreed in past tends to also agree in future.
   
   
- We only have ratings of user for items.
     - Users are consumers.
     - Items are the products or services offered.


- **Approach** : Build an "utility" matrix that captures interactions between users and items.
   - each row is a user
   - each column and item

#### Collaborative filtering : Challenges


- Sparsity of utility matrix:
   - Usually users only interact with a few items.
      - Netflix users rate only a few songs.
   
- Objective?
 - Given a utility matrix of $N$ users and $M$ items, fill the the missing entries to complete the utility matrix. 
     


#### Problem formulation

- An Unsupervised Learning approach
   - Only uses the user-item utility matrix.

- Goal : learn latent features related to users and items.
   - **Matrix factorization algorithm** on the utility matrix to learn latent features related to typical users and typical items.
   - Use reconstructions to fill in missing entries.

#### Problem formulation - Matrix Factorization


$$\hat{Y} = Z W$$
$$Y \approx Z W$$

- $Z$ - Transformed data. Users to latent features of items
- $W$ - Weights.    Items to latent features of users.

Adapted loss function (MSE) because of sparcity of data.


$$\sum_{(i, j) \in R}  (W_{j}^T  Z_{i} - Y_{i,j})^2$$

- Where $R$ is the only available ratings

#### Collaborative filtering

- Advantage
  - It works well when the data is small.
  - Little Domain Knowledge

- Disadvantage
  - Cold Start Problem. Cannot draw inference when new items appear. 

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
import seaborn as sns
import surprise
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import cross_validate
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)

                the kernel may be left running.  Please let us know
                about your system (bitness, Python, etc.) at
                ipython-dev@scipy.org


### Dataset

- For this analysis the [MovieLens 100K Dataset](https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset) is utilized.

- Were collected by the GroupLens Research Project which is a research group in the Department of Computer Science and Engineering at the University of Minnesota

This data set consists of:

- 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.

The data was collected during a seven-month period from September 19th, 1997 through April 22nd, 1998.

In [None]:
# data

cols = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_csv(
    os.path.join("data", "ml-100k", "u.data"),
    sep="\t",
    names=cols,
    encoding="latin-1",
)

ratings
ratings = ratings.drop(columns=["timestamp"])
ratings.head()


In [None]:
# Movies data

cols = [
    "movie_id",
    "movie_title",
    "release_date",
    "video_release_date",
    "IMDb_URL",
    "unknown",
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
]

movies_data = pd.read_csv(
    os.path.join("data", "ml-100k", "u.item"),
    sep="|",
    names=cols,
    encoding="latin-1",
)
movies_data.head()

movie_titles = movies_data[['movie_id', 'movie_title']]
movie_titles.head(3)

## Exploratory Data Analysis EDA

In [None]:
# Description of the data

user_key = "user_id"
item_key = "movie_id"

N = len(np.unique(ratings[user_key]))
M = len(np.unique(ratings[item_key]))
print("Number of users  : %d" % N)
print("Number of movies : %d" % M)

In [None]:
print("Average number of ratings per user : %.0f" % (len(ratings) / N))
print("Average number of ratings per movie: %.0f" % (len(ratings) / M))

In [None]:
movies_full = pd.merge(ratings, movie_titles, right_on='movie_id', left_on='movie_id')
movies_full.head()

In [None]:
movies_agg = movies_full[['movie_title','rating']].groupby(by='movie_title').agg(['count','mean'])['rating'].reset_index()

# If at least 20 rankings

top_rankings = movies_agg[movies_agg['count'] >=20].sort_values('mean', ascending=False).iloc[:5]
top_rankings['mean'] = top_rankings['mean'].round(decimals = 2)
top_rankings

# Highest ranked movies by all users

In [None]:
movies_agg = movies_full[['movie_title','rating']].groupby(by='movie_title').agg(['count','mean'])['rating'].reset_index()

# If at least 20 rankings

# Most ranked movies

movies_agg[movies_agg['count'] >=20].sort_values('count', ascending=False).iloc[:5][['movie_title','count']]

In [None]:
low_rankings = movies_agg[movies_agg['count'] >=20].sort_values('mean', ascending=True).iloc[:5]
low_rankings['mean'] = low_rankings['mean'].round(decimals = 2)
low_rankings

In [None]:
movies_agg[movies_agg['count'] >=20].sort_values('count', ascending=True).iloc[:5][['movie_title','count']]

In [None]:
df_grp = movies_full[['movie_id','rating']].groupby(by='movie_id').agg(['count','mean'])['rating'].reset_index()

plt.figure(figsize=(12,6))
plt.hist(df_grp['count'], bins=30)
plt.title('Distribution of the number of ratings given per movie', fontsize=18)
plt.xlabel('Number of times the movie was rated', fontsize=15)
plt.ylabel('Number of films', fontsize=15)
plt.savefig("hist_dist_count_ratings.png",dpi=400, bbox_inches='tight')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.kdeplot(data = df_grp, x = 'mean', bw_method=0.5)
plt.title('Distribution of Mean movie rating', fontsize=13)
plt.xlabel('Mean movie rating', fontsize=11)
plt.show();

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data=df_grp, x='mean', y='count')
plt.title('Scatter plot of movie rating', fontsize=13)
plt.xlabel('Mean movie rating', fontsize=11)
plt.ylabel('Count of movie ratings', fontsize=11)
plt.show();

In [None]:
genres = list(movies_data.columns[6:])
genre_counts = movies_data[genres].sum(axis=0)
genre_counts= pd.DataFrame(genre_counts,  columns=['counts'])
genre_counts_sorted = genre_counts.sort_values(by = 'counts', ascending=False)

In [None]:
plt.figure(figsize=(12,8))
plt.title('Most Popular Genres', fontsize=18)
plt.ylabel('Genres', fontsize=15)
sns.set(style="darkgrid")
sns.barplot(x='counts',y =genre_counts.index, data=genre_counts_sorted) 
plt.xlabel('Count', fontsize=15);

## Modeling

In [None]:
X = ratings.copy()
y = ratings["user_id"]
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Mapper to train Map ratings

user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))


train_mat = None
valid_mat = None

In [None]:
def create_Y_from_ratings(data, N, M):
    '''
    Create ranking matrix

    '''
    Y = np.zeros((N, M))
    Y.fill(np.nan)
    for index, val in data.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        Y[n, m] = val["rating"]

    return Y

In [None]:
train_mat = create_Y_from_ratings(X_train, N, M)
valid_mat = create_Y_from_ratings(X_valid, N, M)

print("train_mat shape: ", train_mat.shape)
print("valid_mat shape: ", valid_mat.shape)

In [None]:
train_mat

In [None]:
print("Number of non-nan elements in train_mat: ", np.sum(~np.isnan(train_mat))/(943*1682))
print("Number of non-nan elements in valid_mat: ", np.sum(~np.isnan(valid_mat)))

In [None]:
def error(Y1, Y2):
    """
    Returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

In [None]:
#Base line model

avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

In [None]:
results = {'Components' : [],
          'RMSE': []}

### Hyperparameter optimization

In [None]:
# Matrix Factorization and regularization

reader = Reader()
data = Dataset.load_from_df(ratings, reader)

results = {'Components' : [],
          'Mean RMSE': []}

for n in range(5 , 30, 5):
    model_svd = SVD(n_factors=n, random_state=42)
    mean_rmse = round(pd.DataFrame(cross_validate(model_svd, data, measures=['RMSE'], cv=5, verbose=False))['test_rmse'].mean(),3)
    results['Components'].append(n)
    results['Mean RMSE'].append(mean_rmse)
    print(f"Surprise SVD {n} components Mean RMSE {mean_rmse}")

In [None]:
results_df = pd.DataFrame(results)

plt.figure(figsize=(12,6))
plt.plot(results_df['Components'], results_df['Mean RMSE'])
plt.title('Mean RMSE by component parameter', fontsize=18)
plt.xlabel('Number of components', fontsize=15)
plt.ylabel('Mean RMSE', fontsize=15)
plt.show()

In [None]:
trainset, validset = surprise.model_selection.train_test_split(
    data, test_size=0.2, random_state=42
)

k = 10
algo = SVD(n_factors=k, random_state=42)
algo.fit(trainset)
svd_preds = algo.test(validset)
print(f"RMSE score for k={k} factors: {accuracy.rmse(svd_preds, verbose=False):.2f}")

### Recommendation Outputs

In [None]:
# Code inspired by Nicolas Hug recommedation systems

from collections import defaultdict


def top_n_recs(user_id, n=5):
    '''
    Function that returns the top n transaction for each user_id 
    '''
    top_n = get_top_n(svd_preds, n=n)
    data_temp =pd.DataFrame(top_n[user_id], columns=["movie_id", "pred"])
    return pd.merge(data_temp, movie_titles, right_on='movie_id', left_on='movie_id', how='left')[['movie_title', 'pred']]


def get_top_n(predictions, n=10):
    """
    Return the top-N recommendation for each user from a set of predictions of the SVD model

    """
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [None]:
size = 3
u_id_sample = ratings["user_id"].sample(size).to_list()
u_id_sample

In [None]:
n = 5
for user_id in u_id_sample:
    print("\nTop %d recommendations for user with id : %d" % (n, user_id))
    df = top_n_recs(user_id, n=n)
    df['pred'] = df['pred'].round(decimals = 2)
    print(df)