# 1.0 Introduction

This project involves analysis of movies data to build a recommendation system model that provides diverse options and accurate recommendations to customers that improves their shopping experience and increase engagement with shop catalogs, subsequently increasing sales. The research follows cross industry standard procedures (CRISP-DM) methodlogy fo the movies industry.

# 2.0 Business Understanding


# 2.1 Objective

The research mainly aims at developing a movie recommendation system, which would be helpful in recommending other similar movies to customers depending on the preference that a customer may have for a particular movie. A customer interested in a particular movie-he asks questions about it or looks at it in a catalog-the system should suggest other movies similar to the target movie. 

# 3.0 The Data
The dataset for modelling was drawn from https://grouplens.org/datasets/movielens/latest/.
Merged dataset contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

### Content

* **userId:** Unique identifier for the user.

* **movieId:** Unique identifier  for movie.

* **rating:** Ratings given by the user to the movie.

* **timestamp:** Time at which the rating was given by user.

* **title:** Name of the movie.

* **genres:** The genres for which movies belong.

* **tag:** A glimpse of what the movie is about or like.


# 3.1 Data Understanding 

## Data Preview

This is important  as it provides a snapshot of the type of information contained in the dataset for analysis.

### Import relevant python libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise.prediction_algorithms import SVD
from surprise.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import coo_matrix
import warnings
warnings.filterwarnings('ignore')

### Loading of the MovieLens datasets for preview

In [2]:
links = pd.read_csv("links.csv")
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
tags = pd.read_csv("tags.csv")

In [3]:
print(f'Links dataset first 3 records \n {links.head(3)} ' )
print('------------')
print(f'Movies dataset first 3 records \n  {movies.head(3)}' )
print('------------')
print(f'Ratings dataset first 3 records \n  {ratings.head(3)}' )
print('------------')
print(f'Tags dataset first 3 records \n  {tags.head(3)}' )

Links dataset first 3 records 
    movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0 
------------
Movies dataset first 3 records 
     movieId                    title  \
0        1         Toy Story (1995)   
1        2           Jumanji (1995)   
2        3  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
------------
Ratings dataset first 3 records 
     userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
------------
Tags dataset first 3 records 
     userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferrell  1445714992


### *Observations*

*  Movies, Ratings and Tags datasets will be merged to form data enriched dataset for analysis. Merging criteria on *movieId* with an *inner joint*.

* Links datasets only contains unique identifies (IDs) and may not be useful for this study, thus will not be utilized.

In [4]:
#Merge movie and ratings datasets on movieId with an inner joint and assign movie_ratings
movie_ratings = pd.merge(ratings,movies, on='movieId', how='inner')

#Merge the resultant movie_ratings with tags on movieId with inner joint and assign movie_rating_tags
movie_rating_tags = pd.merge(movie_ratings, tags, on=['movieId'], how='inner')

#Remove duplicates if any
movie_rating_tags = movie_rating_tags.drop_duplicates()

#Check the first 5 rows of the merged dataset
movie_rating_tags.head()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,userId_y,tag,timestamp_y
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,pixar,1139045764
1,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,pixar,1137206825
2,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,fun,1525286013
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,pixar,1139045764
4,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,pixar,1137206825


In [5]:
movie_rating_tags.duplicated().sum()

0

In [6]:
#Check merged dataset info
movie_rating_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   userId_y     233213 non-null  int64  
 7   tag          233213 non-null  object 
 8   timestamp_y  233213 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 16.0+ MB


### *Observations*
* The dataset has 233213 rows and 9 columns, although there,s duplication of columns (userId & timestamp).
* It has 6 numerical features and 3 object features.
* Dataset has equal number of non_null counts in all columns, indicates that there are no missing values.
* Contains movieId and userId making the dataset suitable for building recommendation system(user-based and content-based).

# 3.2 Problem Statement

A new movie shop opens a branch in a new town with an aim to invent better interaction with customers by offering personalized movie recommendations. The company aims to recommend movies in which the customers have shown interest, liked, or even inquired about. This customized service will expose the customer to films they might not have considered but will likely enjoy based on the films they browse or inquire about. It would, therefore, be able to provide personalized recommendations through customer data on movie preference, past queries, and behavior to enhance customer experience, thereby commanding high satisfaction, loyalty, and repeat visits.

### General Objective

* To build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

### Specific Objectives

* **Personalized Recommendations:** Build a system that will be able to recommend movies based on what customers have done, liked, or searched.
* **Enhanced Discovery:** Help customers discover movies that they may have never considered but might like and thus increase their tastes and knowledge of films.
* **Customer Engagement:** Incentivize customers to spend more time on the website with value-added recommendations relevant to their interests.
* **Increased Sales and Retention:** Personalized suggestions will increase sales and improve customer retention, as they will revisit your site for more and remain longer-term.
* **Enhanced User Experience:** Facilitate an easy and smooth recommendation experience for your customers.


# 3.3 Metrics of success

This project will be deemed successful if the built models will be able to predict top 5 movie recommendations to a user, based on their ratings of other movies.


# 4.0 Data Preparation

## 4.1 Data Cleaning

Involves checking and removal of duplicates,checking for missing values and mitigation, and feature engineering.

Dataset preview revealed duplicated columns and non-uniform feature naming. Therefore, all feature names will be converted to lowercase and remove the duplicated columns(userId_y, timestamp_y). Subsequently, rename 'userId_x' and 'timestamp_x' features to remove the suffixes.

In [7]:
#Check for duplicates if any and print out
print(f'Duplicates: \n......\n{movie_rating_tags.duplicated().sum()}')
#Check for missing values duplicates if any and print out
print(f'Missing values: \n....... \n {movie_rating_tags.isna().sum()}')

Duplicates: 
......
0
Missing values: 
....... 
 userId_x       0
movieId        0
rating         0
timestamp_x    0
title          0
genres         0
userId_y       0
tag            0
timestamp_y    0
dtype: int64


### Observation

There are no duplicate rows and missing values in all columns

In [8]:
#Remove 'userId_y' and 'timestap' features
movie_rating_tags = movie_rating_tags.drop(["userId_y","timestamp_x","timestamp_y"], axis=1)
#Rename 'userId_x' as 'userid' 
movie_rating_tags = movie_rating_tags.rename(columns={"userId_x": "userid", "timestamp_x": "timestamp"})
#Convert feature lowercase for uniformity
movie_rating_tags.columns = movie_rating_tags.columns.str.strip().str.lower()


### Save cleaned dataset to df

In [9]:
#Making a copy of cleaned dataset and save as df
df = movie_rating_tags.copy(deep=True)

In [10]:
df.head()

Unnamed: 0,userid,movieid,rating,title,genres,tag
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
1,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
2,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,fun
3,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
4,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar


In [11]:
df.columns

Index(['userid', 'movieid', 'rating', 'title', 'genres', 'tag'], dtype='object')

# 5.0 Modelling

Build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies. This will be deployed to ....

### Modelling packages

In [12]:
#Modelling packages
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity

## 1. User-based Collaborative Filtering (user-user CF)


This model recommends movies to a user based on the ....

In [13]:
# Create a user-item matrix with users being rows and columns being movies
user_movie_matrix = df.pivot_table(index='userid', columns='movieid', values='rating').fillna(0)

# Compute similarity between users using cosine similarity
user_similarity = cosine_similarity(user_movie_matrix)

# Convert to a DataFrame for easier manipulation and visualization
user_similarity_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

# Function to get recommendations for a user
def user_user_cf_recommendations(user_id, user_movie_matrix, user_similarity, top_n=5):
    # Ensure the user_id exists in the index
    if user_id not in user_movie_matrix.index:
        print(f"User ID {user_id} not found in the user-item matrix.")
        return None

    # Find the index of the user in the user_movie_matrix (account for 0-indexing)
    user_idx = user_movie_matrix.index.get_loc(user_id)
    
    # Get the similarity scores for the specific user
    similarity_scores = user_similarity[user_idx]

    # Get movies rated by the user
    rated_movies = user_movie_matrix.iloc[user_idx]
    rated_movie_ids = rated_movies[rated_movies > 0].index.tolist()

    # Predict ratings for all movies by multiplying similarity scores with user ratings
    movie_scores = user_similarity[user_idx].dot(user_movie_matrix)
    movie_scores = movie_scores / np.array([np.abs(user_similarity[user_idx]).sum()])  # Normalize the scores

    # Set the scores for already rated movies to 0 (avoid recommending rated movies)
    rated_movie_ids = [int(movie_id) for movie_id in rated_movie_ids]
    movie_scores[rated_movie_ids] = 0  # Set the scores for rated movies to 0

    # Get the top N movie recommendations (excluding already rated movies)
    recommended_movie_indices = movie_scores.argsort()[-top_n:][::-1]
    recommended_movie_ids = user_movie_matrix.columns[recommended_movie_indices].tolist()

    # Ensure movie_id type consistency between recommended_movie_ids and movies_df['movieid']
    recommended_movie_ids = [str(movie_id) for movie_id in recommended_movie_ids]  # Ensure IDs are strings if needed
    df['movieid'] = df['movieid'].astype(str)  # Make sure 'movieid' is string type

    # Remove duplicates from movies_df
    clean_df = df.drop_duplicates(subset='movieid')

    # Map movie IDs to titles
    recommended_movies = clean_df[clean_df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]
    
    return recommended_movies

# Example: Get top 5 recommendations for user 5
user_id = 5
top_5_movies = user_user_cf_recommendations(user_id, user_movie_matrix, user_similarity, top_n=5)
print(top_5_movies)

       movieid                             title
12388      296               Pulp Fiction (1994)
68205      356               Forrest Gump (1994)
71166      457              Fugitive, The (1993)
74185      593  Silence of the Lambs, The (1991)
114914     318  Shawshank Redemption, The (1994)


## 2. Item-based Collaborative Filtering

The model recommends a movie based on the similarity between movies.

In [14]:
# Compute item-item similarity matrix
item_similarity = cosine_similarity(user_movie_matrix.T)  # Transpose to compare items
item_similarity = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)

In [15]:
def item_item_cf_recommendations(user_id, user_movie_matrix, item_similarity, top_n=5):
 
    # Get the index of the user (adjusting for 0-indexing)
    user_idx = user_id - 1

    # Get the movies rated by the user
    rated_movies = user_movie_matrix.iloc[user_idx]
    rated_movie_ids = rated_movies[rated_movies > 0].index.tolist()

    # Predict ratings for movies the user hasn't rated using item-item similarity
    predicted_ratings = item_similarity.dot(user_movie_matrix.iloc[user_idx])  # Predicted ratings for all movies
    predicted_ratings /= np.array(np.abs(item_similarity).sum(axis=1))  # Normalize the ratings by similarity

    # Set the scores for already rated movies to 0 (exclude them from recommendations)
    rated_movie_ids = np.array(rated_movie_ids, dtype=int)
    predicted_ratings[rated_movie_ids] = 0  # Remove already rated movies

    # Get top N movie recommendations
    recommended_movie_indices = predicted_ratings.argsort()[-top_n:][::-1]  # Top N recommendations
    recommended_movie_ids = user_movie_matrix.columns[recommended_movie_indices].tolist()

    # Ensure movie_id type consistency between recommended_movie_ids and movies_df['movieid']
    recommended_movie_ids = [str(movie_id) for movie_id in recommended_movie_ids]  # Ensure IDs are strings if needed
    df['movieid'] = df['movieid'].astype(str)  # Make sure 'movieid' in df is string type

    # Remove duplicates from movies_df
    clean_df = df.drop_duplicates(subset='movieid')

    # Map movie IDs to titles
    recommended_movies = clean_df[clean_df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]

    return recommended_movies

# Example: Get top 5 recommendations for user 5 using item-item CF
user_id = 5
top_5_movies = item_item_cf_recommendations(user_id, user_movie_matrix, item_similarity, top_n=5)
print(top_5_movies)

       movieid                                 title
148731     161                   Crimson Tide (1995)
149116     248                     Houseguest (1994)
149336     279                      My Family (1995)
155636     540                         Sliver (1993)
233210    7023  Wedding Banquet, The (Xi yan) (1993)


## 3. Matrix Factorization with Singular Value Decomposition (SVD)


In [16]:
# import packages
from surprise import SVD, Reader, Dataset
from surprise.model_selection import train_test_split
from surprise import accuracy,SVD

In [17]:
# Define the svd_recommendations function
def svd_recommendations(user_id, svd_model, all_movie_ids, top_n=5):
    # Predict ratings for all movies for the given user
    predictions = [svd_model.predict(user_id, movie_id) for movie_id in all_movie_ids]
    
    # Sort predictions by estimated rating
    sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)
    
    # Get top N recommended movie IDs
    recommended_movie_ids = [prediction.iid for prediction in sorted_predictions[:top_n]]

    # Remove duplicates from movies_df (optional if you're cleaning the movie list)
    clean_df = df.drop_duplicates(subset='movieid')

    # Map movie IDs to titles
    svd_recommended_movies = clean_df[clean_df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]
    
    return svd_recommended_movies

# Define the reader and load the data
reader = Reader(rating_scale=(1, 5))  # Assuming ratings are between 1 and 5
data = Dataset.load_from_df(df[['userid', 'movieid', 'rating']], reader)

# Train an SVD model
trainset = data.build_full_trainset()
svd = SVD()
svd.fit(trainset)

# Get the list of all movie IDs
all_movie_ids = df['movieid'].unique()

# Get top 5 movie recommendations for a specific user (e.g., user 5)
user_id = 5
top_5_movies_svd = svd_recommendations(user_id, svd, all_movie_ids, top_n=5)

# Display the top 5 recommendations
print("Top 5 Recommended Movies: \n", top_5_movies_svd)

Top 5 Recommended Movies: 
        movieid                    title
12388      296      Pulp Fiction (1994)
72026      527  Schindler's List (1993)
98727     2959        Fight Club (1999)
125625  109487      Interstellar (2014)
179775     858    Godfather, The (1972)


## 3. Alternating Least Squares (ALS)

In [18]:
import implicit
import numpy as np
import scipy.sparse as sp
from scipy.sparse import csr_matrix

In [19]:
user_item_matrix = df.pivot_table(index='userid', columns='movieid', values='rating').fillna(0)

In [20]:
user_item_sparse = user_item_matrix.astype(np.float32).values
user_item_sparse[user_item_sparse > 0] = 1  # Convert ratings to binary interaction

In [21]:
user_item_sparse = sp.csr_matrix(user_item_sparse)

In [22]:
als_model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=50)
als_model.fit(user_item_sparse)

100%|██████████| 50/50 [00:00<00:00, 164.83it/s]


In [23]:
recommended_items = als_model.recommend(user_id, user_item_sparse[user_id], N=10)

In [24]:
recommended_movie_indices = [int(item[0]) for item in recommended_items]
movie_ids = user_item_matrix.columns.tolist()
recommended_movie_ids = [movie_ids[idx] for idx in recommended_movie_indices]

In [25]:
recommended_movie_ids = list(set(recommended_movie_ids))

In [26]:
recommended_movie_titles = df[df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]
print("Top 5 Recommended Movies for User {}:".format(user_id))
print(recommended_movie_titles)

Top 5 Recommended Movies for User 5:
       movieid                                   title
0            1                        Toy Story (1995)
1            1                        Toy Story (1995)
2            1                        Toy Story (1995)
3            1                        Toy Story (1995)
4            1                        Toy Story (1995)
...        ...                                     ...
208696     551  Nightmare Before Christmas, The (1993)
208697     551  Nightmare Before Christmas, The (1993)
208698     551  Nightmare Before Christmas, The (1993)
208699     551  Nightmare Before Christmas, The (1993)
208700     551  Nightmare Before Christmas, The (1993)

[831 rows x 2 columns]


In [27]:
# Convert the DataFrame to a binary interaction matrix (values > 0 set to 1)
user_item_matrix = df.pivot_table(index='userid', columns='movieid', values='rating').fillna(0)
user_item_sparse = user_item_matrix.astype(np.float32).values
user_item_sparse[user_item_sparse > 0] = 1  # Convert ratings to binary interaction

# Convert the dense matrix to a sparse CSR matrix
user_item_sparse = sp.csr_matrix(user_item_sparse)

# Train ALS model using the Implicit library
als_model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=50)
als_model.fit(user_item_sparse)

# Example: Recommend 5 items for a given user (e.g., user 5)
user_id = 5
recommended_items = als_model.recommend(user_id, user_item_sparse[user_id], N=5)

# Extract movie indices from the recommendations
recommended_movie_indices = [int(item[0]) for item in recommended_items]  # Convert to integers

# Get the movie IDs for the top recommended movies using .tolist() to convert Index to list
movie_ids = user_item_matrix.columns.tolist()

# Map the recommended movie indices to the actual movie IDs
recommended_movie_ids = [movie_ids[idx] for idx in recommended_movie_indices]

#Remove duplicates from movies_df (optional if you're cleaning the movie list)
clean_df = df.drop_duplicates(subset='movieid')

# # Remove duplicates from the recommended movie IDs
# recommended_movie_ids = list(set(recommended_movie_ids))  # Use set to remove duplicates

# Ensure we only return the top 5 recommendations
recommended_movie_ids = recommended_movie_ids[:5]  # Take the top 5 unique movie IDs

# Now, we get the movie titles from the original DataFrame
recommended_movie_titles = clean_df[clean_df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]

# Display the recommended movies
print("Top 5 Recommended Movies for User {}:".format(user_id))
print(recommended_movie_titles)

100%|██████████| 50/50 [00:00<00:00, 162.25it/s]

Top 5 Recommended Movies for User 5:
       movieid                                   title
0            1                        Toy Story (1995)
208515     551  Nightmare Before Christmas, The (1993)





In [28]:
# Convert the DataFrame to a binary interaction matrix (values > 0 set to 1)
user_item_matrix = df.pivot_table(index='userid', columns='movieid', values='rating').fillna(0)
user_item_sparse = user_item_matrix.astype(np.float32).values
user_item_sparse[user_item_sparse > 0] = 1  # Convert ratings to binary interaction

# Convert the dense matrix to a sparse CSR matrix
user_item_sparse = sp.csr_matrix(user_item_sparse)

# Train ALS model using the Implicit library
als_model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=50)
als_model.fit(user_item_sparse)

# Example: Recommend 5 items for a given user (e.g., user 5)
user_id = 5
recommended_items = als_model.recommend(user_id, user_item_sparse[user_id], N=5)

# Extract movie indices from the recommendations
recommended_movie_indices = [int(item[0]) for item in recommended_items]  # Convert to integers

# Get the movie IDs for the top recommended movies using .tolist() to convert Index to list
movie_ids = user_item_matrix.columns.tolist()

# Map the recommended movie indices to the actual movie IDs
recommended_movie_ids = [movie_ids[idx] for idx in recommended_movie_indices]

# Remove duplicates from the recommended movie IDs
recommended_movie_ids = list(set(recommended_movie_ids))  # Use set to remove duplicates

# Now, we get the movie titles from the original DataFrame
recommended_movie_titles = df[df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]

# Display the recommended movies
print(recommended_movie_titles)

100%|██████████| 50/50 [00:00<00:00, 147.56it/s]

       movieid             title
0            1  Toy Story (1995)
1            1  Toy Story (1995)
2            1  Toy Story (1995)
3            1  Toy Story (1995)
4            1  Toy Story (1995)
...        ...               ...
143522     300  Quiz Show (1994)
143523     300  Quiz Show (1994)
143524     300  Quiz Show (1994)
143525     300  Quiz Show (1994)
143526     300  Quiz Show (1994)

[726 rows x 2 columns]





In [29]:
# Convert the DataFrame to a binary interaction matrix (values > 0 set to 1)
user_item_sparse = user_movie_matrix.astype(np.float32).values
user_item_sparse[user_item_sparse > 0] = 1  # Convert ratings to binary interaction

# Convert the dense matrix to a sparse CSR matrix
user_item_sparse = sp.csr_matrix(user_item_sparse)

# Train ALS model using implicit library
als_model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=50)
als_model.fit(user_item_sparse)

# Example: Recommend 5 items for a given user (e.g., user 5)
user_id = 5
recommended_items = als_model.recommend(user_id, user_item_sparse[user_id], N=5)

# Map the recommended movie IDs to their corresponding titles
recommended_movie_ids = [item[0] for item in recommended_items]
recommended_movie_titles = df[df['movieid'].isin(recommended_movie_ids)][['movieid', 'title']]

# Display recommended movie titles
print(recommended_movie_titles)

100%|██████████| 50/50 [00:00<00:00, 131.55it/s]


Empty DataFrame
Columns: [movieid, title]
Index: []
