# Personalized Movie Recommendations: A Collaborative Filtering Approach

## 1. Business Understanding
### (a) Introduction

CineCollab, established in 2006, has been at the forefront of digital entertainment, captivating audiences worldwide with a rich collection of films and sophisticated recommendation systems. Our success is deeply rooted in our commitment to elevating user experiences through advanced analytics.

This project mirrors CineCollab's dedication, with a focus on refining recommendation systems. We develop complex predictive models, taking into account factors such as individual viewing habits and user behaviors. By mapping these variables, we intend to work closely with CineCollab to boost the precision and effectiveness of its recommendation algorithm.

The MovieLens dataset plays a pivotal role in realizing this objective. Users can rate movies, and our model harnesses these ratings to craft personalized recommendations. The hurdle lies in designing a system that accurately deciphers user ratings and converts them into relevant film suggestions.

The core business problem we tackle is optimizing user engagement by delivering desired movie recommendations. This involves understanding user preferences based on ratings and creating an intuitive and user-friendly platform for users to provide these ratings. We attempt to maintain a balance between simplicity and capturing nuanced user preferences to ensure the recommendations are both accurate and well-received.

In essence, our project seeks to transform the way users rate movies and receive recommendations, aligning with CineCollab's ongoing commitment to personalized content discovery. Through our model, we aim to not only enhance efficiency but also provide a seamless and enriching experience for users globally, delivering top-notch movie recommendations tailored to their unique tastes and preferences.

### (b) Problem Statement

Online streaming platforms are the dominant form of media consumption but viewers often face a paradox of choice when presented with the entire catalogue. Users struggle to navigate through massive catalogues, leading to subscriber frustration and an increased likelihood of churn. By leveraging the MovieLens dataset comprised of over 100,000 ratings applied to nearly 10,000 films by hundreds of users, we intend to push CineCollabs recommendation algorithm forward. Enhancing recommendations based upon movies streamers prefer has the potential to improve CineCollab's user experience and to help audiences discover unexplored content that matches their interests. This project, implemented successfully will  promote media discovery, encourage niche, cult-like viewers and inspire artistry through expanded access to cinema. Building a working, accurate recommendation system will be the key to unlocking the full creative and commercial potential of CineCollab.

### (c) Defining Metrics of Success

The success of a movie recommendation model using collaborative filtering can be assessed using various metrics that measure its effectiveness in providing accurate and relevant movie suggestions. The combination of the metrics provides a comprehensive understanding of its performance in terms of accuracy, relevance, and user satisfaction. It's essential to choose metrics that align with the specific goals and objectives of the recommendation system and the preferences of the user base.

### (d) Research Questions

1. What features contribute most to the accuracy of collaborative filtering in generating top  movie recommendations?

2. How does the frequency of user ratings influence the accuracy and stability of the movie recommendation model?

3. What are the correlation between user ratings and various movie features?

4. Which movie features demonstrate the highest correlation with collaborative filtering recommendations, and how do they impact the model's predictions?

5. How successful is the collaborative filtering model in providing accurate and tailored movie recommendations based on user ratings and preferences?

### (e) The Main Objective

To develop and implement a movie recommendation system that leverages collaborative filtering techniques to provide personalized top 5 movie recommendations for users.

### (f) The Specific Objectives

1. To clean and preprocess the MovieLens datasets to ensure it is suitable for building a recommendation system.

2. To understand the distribution of movie ratings, explore user behavior, and identify patterns in the datasets.

3. To investigate and compare collaborative filtering techniques for building the recommendation system, such as Singular Value Decomposition (SVD), user-based and item-based.

4. To implement and evaluate the performance of the collaborative filtering model using appropriate metrics such as RSME and MSE.

5. To generate top 5 movie recommendations for a user based on their historical ratings.

### (g) Data Understanding

The MovieLens dataset (ml-latest-small) provides a comprehensive snapshot of user preferences and interactions within the MovieLens movie recommendation service. This dataset, created by 610 users over a period spanning from March 29, 1996, to September 24, 2018, is a valuable resource for gaining insights into user behavior, preferences, and movie metadata.
It has 9 attributes.

**Attributes:**

1. **movieId:** Identifier for movies used by MovieLens.
2. **title:** Contains the names of individual movies and serves as a unique identifier for each film within the dataset.
3. **genres:** Classifies films based on overarching themes, narrative structures, and intended emotional impact.
4. **imdbId:** Identifier associated with a movie on the IMDb (Internet Movie Database) platform.
5. **tmdbId:** Identifier associated with movies on TMDb (The Movie Database).
6. **rating:** Numerical evaluation given by users on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
7. **timestamp:** represents seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
8. **tags:** User-generated metadata about movies. Each tag is typically a single word or short phrase, with meaning, value, and purpose determined by each user.
9. **userId:** Unique identifier assigned to each user who participated in movie rating and tagging activities.


**Dataset Statistics:**

100,836 ratings and 3,683 tag applications.
9,742 movies encompassing a diverse array of genres.



In [None]:
#importing relevant packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#create data frames 
movies = pd.read_csv("data/ml-latest-small/movies.csv")
links = pd.read_csv("data/ml-latest-small/links.csv")
ratings = pd.read_csv("data/ml-latest-small/ratings.csv")
tags = pd.read_csv("data/ml-latest-small/tags.csv")

In [None]:
#reading the first 3 rows
links.head(3)

In [None]:
#reading the first 3 rows
ratings.head(3)

In [None]:
#reading the first 3 rows
tags.head(3)

In [None]:
#merge movies and links on 'movieId'
df = pd.merge(movies, links, on='movieId')

#merge with ratings on 'movieId'
df = pd.merge(df, ratings, on='movieId')

#merge with tags on 'movieId'
df = pd.merge(df, tags, on='movieId')

df.head()

In [None]:
df.shape

This signifies that the DataFrame consists of 233,213 rows and 9 columns. Each row corresponds to a unique combination of user-movie interaction, which may include ratings and tags, while the columns represent the attributes of these interactions.

In [None]:
df.info

In [None]:
df.describe()

In [None]:
columns_to_drop = ['timestamp_x', 'timestamp_y']
df.drop(columns=columns_to_drop, inplace=True)

# Display the first few rows of the DataFram
df.head()

In [None]:
# Number of movies and users
num_movies = movies['movieId'].nunique()
num_users = ratings['userId'].nunique()

print(f"Number of movies: {num_movies}")
print(f"Number of users: {num_users}")

In [None]:
df.isnull().sum()

EDA

In [None]:
# Distribution of ratings
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=df)
plt.title('Distribution of Ratings')
plt.show()

In [None]:
# Most rated movies
most_rated_movies = df.groupby('title')['rating'].count().sort_values(ascending=False)

# Display the top 10 most rated movies
print("Top 10 most rated movies:")
print(most_rated_movies.head(10))

In [None]:
# Average ratings per user
# Calculate the average rating per user
average_rating_per_user = ratings.groupby('userId')['rating'].mean()

# Display the average rating per user
print("Average rating per user:")
print(average_rating_per_user.head(10))

In [None]:
# Average rating per movie
avg_rating_per_movie = ratings.groupby('movieId')['rating'].mean()

print("Average rating per movie:")
print(avg_rating_per_movie.head(10))

In [None]:
# Split genres and create a count of each genre
genre_counts = df['genres'].str.split('|', expand=True).stack().value_counts()

# Plot the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values, color='skyblue')
plt.title('Number of Movies in Each Genre')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')  
plt.show()

In [None]:
# Calculate the top N most rated movies
top_rated_movies = df.groupby('movieId')['rating'].count().sort_values(ascending=False).head(20)
top_rated_movies = pd.merge(top_rated_movies, df[['movieId', 'title']], on='movieId', how='left')

# Plot the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='rating', y='title', data=top_rated_movies, palette='muted', orient='h')
plt.title('Top 20 Most Rated Movies')
plt.xlabel('Number of Ratings')
plt.ylabel('Movie Title')
plt.show()

In [None]:
# Calculate the correlations
correlations = df.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(round(correlations, 2), annot=True, linewidths=.7)
plt.title('Correlation Heatmap')
plt.show()

UNIVARIATE DATA ANALYSIS

In [None]:
# to have one userId
def load_and_join_csv(file_path_1, file_path_2, file_path_3):
    # Load CSV files into Pandas DataFrames
    movies = pd.read_csv(file_path_1)
    ratings = pd.read_csv(file_path_2)
    # tags = pd.read_csv(file_path_3)
    
    # Perform inner joins to combine the datasets based on common column movieId
    movies_ratings_df = pd.merge(movies, ratings, on='movieId', how='inner')
      
    return movies_ratings_df

# Replace 'file1.csv', 'file2.csv', 'file3.csv', and 'file4.csv' with your actual file paths
movie_rating_df = load_and_join_csv("data/ml-latest-small/movies.csv", "data/ml-latest-small/ratings.csv", "ml-latest-small/tags.csv")

# Display the resulting dataset
movie_rating_df

In [None]:
# A function to create plots

def create_plots(df, plot_type, columns_to_plot = None, y = None):
    if plot_type == 'count_plot':
        plt.figure(figsize=(12,8))
        sns.countplot(data=df, x=columns_to_plot)
        plt.title(f'Distribution of movie {columns_to_plot}')
        plt.xticks(rotation=90)
        plt.show()
    elif plot_type == 'bar_plot1':
        plt.figure(figsize=(12, 7))
        sns.barplot(columns_to_plot.index, columns_to_plot.values)
        plt.title('Top 10 most frequently rated movies')
        plt.xlabel('Movie tile')
        plt.xticks(rotation=90)
        plt.ylabel('Count')
        plt.show()
    elif plot_type == 'bar_plot2':
        plt.figure(figsize=(12, 7))
        sns.barplot(columns_to_plot.index, columns_to_plot.values)
        plt.title('Top 10 most frequently rated movies and their average ratings')
        plt.xlabel('Movie tile')
        plt.xticks(rotation=90)
        plt.ylabel('Average Rating')
        plt.show()

In [None]:
create_plots(movie_rating_df, 'count_plot', 'rating')

In [None]:
# Splitting the genres
def splitting_string(movies):
    movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))
    from collections import Counter
    genre_frequency = Counter(g for genres in movies['genres'] for g in genres)
    
    return genre_frequency

splitting_string(movie_rating_df)

In [None]:
new_df = movie_rating_df.copy()
new_df = new_df.explode('genres')
new_df

In [None]:
create_plots(new_df, 'count_plot', 'genres')

In [None]:
top_10_views = movie_rating_df['title'].value_counts().nlargest(10)
average_ratings = movie_rating_df.groupby('title')['rating'].mean().loc[top_10_views.index]
create_plots(movie_rating_df, 'bar_plot2', average_ratings)

In [None]:
def bar_plot(x, y, data):
    
    plt.figure(figsize=(12,6))
    
    sns.barplot(x=x, y=y, data=data)
    plt.title('Genres and their average ratings')
    plt.xlabel(f'{x}')
    plt.xticks(rotation=90)
    plt.ylabel(f'{y}')
    plt.show()

bar_plot('genres', 'rating', new_df)

In [None]:
plt.figure(figsize=(23,10))
sns.countplot(data=new_df, x='genres', hue='rating')
plt.title(f'Distribution of rating per genre')
plt.show()