<a href="https://colab.research.google.com/github/BandaVallary/Movie_Recommendation_System/blob/kelvin/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendation System

!["Image of Donald Draper in a Movie Theater"](https://github.com/BandaVallary/Movie_Recommendation_System/blob/kelvin/images/dondraperinmovietheater.jpg?raw=1)

## Overview

This project aims to develop a recommendation system that provides personalized movie recommendations based on user ratings. Utilizing the [MovieLens dataset](https://web.archive.org/web/20240828133414/https://grouplens.org/datasets/movielens/latest/) from the GroupLens research lab at the University of Minnesota, the model will be trained on a subset of the dataset containing 100,000 user ratings. The primary objective is to build a machine learning model that can accurately recommend the top 5 movies to a user, based on their ratings of other movies. This system can be valuable for streaming platforms and movie enthusiasts, offering tailored movie suggestions to enhance user experience and engagement.

The project will involve several steps, including data cleaning, exploratory data analysis, feature engineering, model selection, and evaluation.

Throughout this project, we will also explore the relationships between different variables and their impact on movie recommendations. This will help us gain insights into user preferences and identify potential areas for improvement. Overall, this project has the potential to provide valuable insights and practical applications for the entertainment industry. By developing a recommendation system that can accurately suggest movies, streaming platforms can better engage their users, improve customer satisfaction, and increase viewership.

## Business Understanding

The entertainment industry, particularly streaming platforms, is highly competitive, with companies constantly striving to enhance user engagement and satisfaction. One of the major challenges faced by these platforms is providing personalized content recommendations that keep users engaged and reduce churn rates.

According to recent studies, personalized recommendations can significantly increase user engagement and satisfaction, leading to higher retention rates and increased viewership. This highlights the need for a robust recommendation system that can accurately suggest movies based on user preferences. By building a recommendation system that can provide top 5 movie recommendations to users based on their ratings of other movies, streaming platforms can offer a more tailored viewing experience.

The business value of this project lies in its ability to help streaming platforms improve their content recommendation strategies, increase user satisfaction, and reduce churn rates. By developing a recommendation system that can accurately suggest movies, platforms can better engage their users, leading to increased viewership and subscription renewals. This can provide a competitive edge in the highly competitive entertainment industry, ultimately driving revenue growth and customer loyalty.

In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357280 sha256=4060e577681d1841d42b6e3660c3cd823c66b820d49f0dd1e5125e637653a93c
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a

In [2]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_style('whitegrid')

from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
from surprise import Reader, Dataset

## Reading the Data

In [3]:
# Loading the data
movies = pd.read_csv('/content/movies.csv')
ratings = pd.read_csv('/content/ratings.csv')
tags = pd.read_csv('/content/tags.csv')

## Tidying the data (ratings df)

1. Check data types and figure out which figures are numerical and which are categorical.
2. Check for null values.
3. Check for duplicate values

I check for null values in the dataset. There are none.

In [4]:
# checking for null values
ratings.isnull().sum()

Unnamed: 0,0
userId,0
movieId,0
rating,0
timestamp,0


I then check for duplicates. There are none.

In [5]:
# check for duplicates
ratings.duplicated().sum()

0

In [6]:
# check data types
ratings.dtypes

Unnamed: 0,0
userId,int64
movieId,int64
rating,float64
timestamp,int64


In [7]:
# Drop the timestamp column
ratings.drop('timestamp', axis=1, inplace=True)

Now, I'll add the `title` column from the `movies` dataframe to the `ratings` dataframe based on the common `movieId` column.

In [8]:
# extracting the title from the `movies` df based on the `movieId`
ratings_with_titles = ratings.merge(movies[['movieId', 'title']], on='movieId', how='left')

I'm starting with the `ratings_with_titles` df for my basic recommendation model because it contains key user-item interactions, which are essential for collaborative filtering. This dataframe shows which users rated which movies and their ratings, allowing me to identify patterns in preferences.

I'll use collaborative filtering, focusing on either:

- User-Based: Recommending movies based on the preferences of similar users.
- Item-Based: Suggesting movies based on similarities between items that users have liked.

Using `ratings_with_titles` gives me a solid foundation for identifying these relationships, and I can later integrate `movies` and `tags` dataframes for more nuanced recommendations.

Now, I'll create a scikit surprise dataset from the `ratings` dataframe.

In [9]:
# reading the values as scikit surprise dataset
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()

In [10]:
# parameter grid for SVD
param_grid = {
    'n_factors': [50, 100],
    'n_epochs': [20, 30],
    'lr_all': [0.002, 0.005],
    'reg_all': [0.4, 0.6]
}

In [11]:
# create the grid search object
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, joblib_verbose=5, n_jobs=-1)
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  1.1min finished


In [12]:
# Get the best parameters
print(gs.best_score)
print(gs.best_params)

{'rmse': 0.8834726804864169, 'mae': 0.6825169425864511}
{'rmse': {'n_factors': 50, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.4}, 'mae': {'n_factors': 50, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.4}}


In [13]:
# Cross validation with KNNBasic
# Parameter grid for KNNBasic
param_grid = {
    'k': [20, 40, 60],
    'min_k': [1, 2, 3],
    'sim_options': {
        'name': ['cosine', 'msd', 'pearson'],
        'user_based': [True, False]
    }
}

# Create the grid search object
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

# Fit the data
gs.fit(data)

# Get the best parameters and scores
print("Best RMSE:", gs.best_score['rmse'])
print("Best MAE:", gs.best_score['mae'])
print("Best Parameters:", gs.best_params['rmse'])

Best RMSE: 0.9125338724480567
Best MAE: 0.7017337030146399
Best Parameters: {'k': 60, 'min_k': 3, 'sim_options': {'name': 'msd', 'user_based': False}}


In [None]:
# These 2 code cells run faster than the cell above

# Cross validation with KNNBasic
# knn_basic = KNNBasic(sim_options={'name':'msd', 'user_based':True})
# cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

# and this
# Average RMSE
# for i in cv_knn_basic.items():
#     print(i)
# print('-----------------------')
# print(np.mean(cv_knn_basic['test_rmse']))

In [14]:
knn_baseline = KNNBaseline(sim_options={'name':'msd', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


In [15]:
# print out the average score for the test set
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([0.86696806, 0.88212196, 0.87137691, 0.87282619, 0.87868555]))
('test_mae', array([0.66206073, 0.67227165, 0.66788844, 0.66811934, 0.6714491 ]))
('fit_time', (0.5258605480194092, 0.5425300598144531, 0.5330810546875, 0.5218255519866943, 0.5999324321746826))
('test_time', (1.921116828918457, 1.7216105461120605, 1.6934607028961182, 1.7754650115966797, 2.6083710193634033))


0.8743957317789242

Based off these outputs, it seems like the best performing model is the SVD model with n_factors = 100 and a regularization rate of 0.4.

Next, I'll create a `get_movie_recommendations` function that is designed to enhance the user experience by allowing users to provide ratings for a selection of movies they've seen. It prompts users to rate five randomly chosen movies on a scale of 0.5 to 5. If a user hasn't watched a particular movie, they can simply enter 'n' to skip it.

After collecting the ratings, the function uses a trained recommendation model to predict ratings for all movies in the dataset. It then adjusts the predicted ratings based on the user's inputs and recommends the top movies tailored to their preferences. This approach ensures that users receive personalized suggestions that align with their tastes.

In [16]:
def get_movie_recommendations(user_id, ratings_with_titles, model, n_recommendations=5):
    # Prepare a sample of movies to prompt the user
    sample_movies = ratings_with_titles.sample(n=5)
    print("Please rate the following movies on a scale of 0.5 to 5 (or enter 'n' to skip):")

    user_ratings = {}

    for index, row in sample_movies.iterrows():
        title = row['title']
        # Prompt user for rating
        rating_input = input(f"Rate '{title}': ")

        if rating_input.lower() == 'n':
            continue  # Skip this movie if the user hasn't seen it

        try:
            rating = float(rating_input)
            user_ratings[row['movieId']] = rating
        except ValueError:
            print("Invalid input. Please enter a number between 0.5 and 5 or 'n' to skip.")

    # Get predictions for all movies
    all_movie_ids = ratings_with_titles['movieId'].unique()
    predictions = []

    for movie_id in all_movie_ids:
        # Use the model to predict the rating
        pred_rating = model.predict(user_id, movie_id).est

        # If the user rated this movie, adjust the prediction based on their rating
        if movie_id in user_ratings:
            adjusted_rating = (pred_rating + user_ratings[movie_id]) / 2
        else:
            adjusted_rating = pred_rating

        predictions.append((movie_id, adjusted_rating))

    # Sort predictions by adjusted rating
    predictions.sort(key=lambda x: x[1], reverse=True)

    # Recommend top N movies
    recommended_movies = predictions[:n_recommendations]

    # Convert movie IDs to titles
    recommended_titles = ratings_with_titles[ratings_with_titles['movieId'].isin([m[0] for m in recommended_movies])]

    return recommended_titles[['title', 'movieId']]

# Usage
user_id = 55  # user ID
best_model = gs.best_estimator['rmse']  # or 'mae'
best_model.fit(trainset)
recommended_movies = get_movie_recommendations(user_id, ratings_with_titles, best_model)
print("Recommended Movies:")
print(recommended_movies)

Computing the msd similarity matrix...
Done computing similarity matrix.
Please rate the following movies on a scale of 0.5 to 5 (or enter 'n' to skip):
Rate 'Contact (1997)': 3.9
Rate 'Legend of Bagger Vance, The (2000)': 4.0
Rate 'Doom (2005)': 4.1
Rate 'Mission: Impossible (1996)': 4.2
Rate 'Soul Surfer (2011)': 4.3
Recommended Movies:
                      title  movieId
9016   Way Back, The (2010)    83369
9139       The Mummy (2017)   170827
9147           Black Mirror   176601
9153           Alpha (2018)   185031
90034            61* (2001)    27373


## Neural Networks

To prepare to build a neural network, I'll merge `movies`, `ratings` and `tags` dataframes to one dataframe. First, I'll merge `movies` and `ratings`.

In [20]:
movies_ratings = pd.merge(movies, ratings, on='movieId', how='left')

Then, we'll merge the results with `tags` dataframe.

In [21]:
movies_ratings_tags = pd.merge(movies_ratings, tags, on='movieId', how='left')

Since two userId columns(userId_x, userId_y) have been created after the merge, we'll drop one userId column.

In [23]:
movies_ratings_tags = movies_ratings_tags.drop('userId_y', axis=1)
movies_ratings_tags = movies_ratings_tags.rename(columns={'userId_x': 'userId'})

Next, we'll drop the `timestamp` column:

In [26]:
movies_ratings_tags = movies_ratings_tags.drop('timestamp', axis=1)

First, we'll preprocess the data to create user-item pairs suitable for training our model.

In [29]:
# Importing libraries
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dot, Dense, Concatenate
from tensorflow.keras.optimizers import Adam

In [30]:
# Create user-item interaction matrix
ratings = movies_ratings_tags[['userId', 'movieId', 'rating']]

In [31]:
# Create user and movie mappings
user_ids = ratings['userId'].unique()
movie_ids = ratings['movieId'].unique()

In [32]:
user_to_index = {user_id: idx for idx, user_id in enumerate(user_ids)}
movie_to_index = {movie_id: idx for idx, movie_id in enumerate(movie_ids)}

In [33]:
# Convert userId and movieId to indices
ratings['userIndex'] = ratings['userId'].map(user_to_index)
ratings['movieIndex'] = ratings['movieId'].map(movie_to_index)

In [34]:
# Split the data into training and test sets
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

X_train = train[['userIndex', 'movieIndex']]
y_train = train['rating']
X_test = test[['userIndex', 'movieIndex']]
y_test = test['rating']