![logo.png](attachment:logo.png)

# BUSINESS UNDERSTANDING

## Overview

StreamFlix aims to implement a movie recommendation system to enhance user experience, increase watch time and improve overall customer retention. The project will utilize the MovieLens dataset to build a collaborative filtering model that provides personalized top 5 movie recommendations for each user based on their previous ratings. StreamFlix has also observed that users often struggle to find movies they enjoy leading to decreased engagement and potential churn.
By implementing a personalized recommendation system, StreamFlix hopes to:

* Increase user satisfaction and engagement
* Boost average watch time per user
* Improve customer retention rates
* Differentiate itself from competitors in the streaming market

The following are approaches were proposed to enable new users to provide their ratings:

+ Initial Survey:
When a new user signs up for StreamFlix, they're presented with a quick survey of 20 popular movies across various genres. Users rate at least 10 of these movies on a scale of 1-5 stars.
Continuous Rating:
After watching a movie, users are prompted to rate it. This can be done through:

+ A pop-up immediately after the movie ends
A "Rate this movie" button on the movie's page.
A dedicated "My Ratings" section in the user's profile.


+ Rating Import:
Offer users the option to import their ratings from other platforms (e.g., IMDb, Rotten Tomatoes) to quickly build their profile.
+ Gamification:
Implement a "Movie Critic" badge system where users earn badges for rating a certain number of movies, encouraging more ratings.




## Business Problem



StreamFlix is facing challenges with user retention and engagement. Users are also overwhelmed by the vast library of movies available and often spend a considerable amount of time searching for movies they would enjoy.  StreamFlix is, therefore, looking for a way to provide personalized movie recommendations to its users to improve their viewing experience and increase platform usage. 



## Objectives

### Main Objective

To develop and deploy a collaborative filtering-based recommendation system that accurately predicts user preferences and provides relevant movie suggestions.


### Specific Objectives

1. To build a collaborative filtering model that uses user ratings to generate top 5 movie recommendations.
2. To address the cold start problem using content-based filtering for new users.
3. To evaluate the recommendation system using appropriate metrics like RMSE and MAP.



## Success Metrics

1. Root Mean Square Error (RMSE) < 0.9 for rating predictions
2. Mean Average Precision @5 (MAP@5) > 0.3 for recommended movies
3. User engagement increase: 15% boost in average watch time within 3 months of deployment
4. Hit rate: Percentage of times that a recommended movie was actually watched by the user

# DATA UNDERSTANDING

In this section, we load and explore the datasets for our movie recommendation system. The datasets include `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. These datasets are then merged on the `movieId` column to form a unified DataFrame for further analysis.

In [48]:
# Import relevant libraries

# Data manipulation 
import pandas as pd 
import numpy as np 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

# Data modelling
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

# Filter future warnings
import warnings
warnings.filterwarnings('ignore')


In [49]:
class DataExplorer:
    '''
    A class to handle and explore multiple pandas DataFrames.

    Attributes:
        file_paths (dict): A dictionary of file paths for the csv files to load.
        data (dict): A dictionary of the loaded DataFrames.
        merged_data (pandas.DataFrame): The merged DataFrame.

    Methods:
        load_data(): Load the CSV files into DataFrames.
        merge_data(): Merge the DataFrames on movieId.
        get_shape(): Print the number of rows and columns in the DataFrame.
        summarize_info(): Print a summary of the DataFrame columns.
        describe_data(): Print descriptive statistics of the DataFrame.
        display_column_types(): Display numerical and categorical columns.
    '''

    def __init__(self, file_paths):
        '''
        Initialize the DataExplorer object.

        Args:
            file_paths (dict): A dictionary of file paths for the csv files to load.
        '''
        self.file_paths = file_paths
        self.data = {}
        self.merged_data = None

    def load_data(self):
        '''
        Load the csv files into DataFrames.

        Returns:
            None
        '''
        for name, path in self.file_paths.items():
            print(f'Loading {name} data csv file...')
            try:
                self.data[name] = pd.read_csv(path)
                print(f'{name} dataset loaded successfully from {path}\n')
            except FileNotFoundError:
                print(f'Error: The file \'{path}\' was not found.')
            except Exception as e:
                print(f'Error: An unexpected error occurred: {e}')

    def merge_data(self):
        '''
        Merge the DataFrames on movieId.

        Returns:
            None
        '''
        try:
            print('Merging data on movieId...')
            self.merged_data = pd.merge(self.data['movies'], self.data['ratings'], on='movieId', how='left')
            self.merged_data = pd.merge(self.merged_data, self.data['tags'], on='movieId', how='left')
            self.merged_data = pd.merge(self.merged_data, self.data['links'], on='movieId', how='left')
            print('Data merged successfully\n')
        except KeyError as e:
            print(f'Error: A KeyError occurred: {e}. Please ensure all DataFrames contain the column \'movieId\'.')

    def get_shape(self):
        '''
        Print the number of rows and columns in the merged DataFrame.

        Returns:
            None
        '''
        if self.merged_data is not None:
            rows, columns = self.merged_data.shape
            print(f'The merged DataFrame has {rows} rows and {columns} columns.\n')
        else:
            print('Error: No merged data available. Please call the merge_data() method first.')

    def summarize_info(self):
        '''
        Print a summary of the merged DataFrame columns.

        Returns:
            None
        '''
        print('Summarizing the merged DataFrame info')
        print('-------------------------------')
        if self.merged_data is not None:
            print(self.merged_data.info())
        else:
            print('Error: No merged data available. Please call the merge_data() method first.')

    def describe_data(self):
        '''
        Print descriptive statistics of the merged DataFrame.

        Returns:
            None
        '''
        print('\nDescribing the merged DataFrame data')
        print('--------------------------------')
        if self.merged_data is not None:
            display(self.merged_data.describe())
        else:
            print('Error: No merged data available. Please call the merge_data() method first.')

    def display_column_types(self):
        '''
        Display numerical and categorical columns of the merged DataFrame.

        Returns:
            None
        '''
        print('\nDisplaying numerical and categorical columns of the merged DataFrame')
        print('-----------------------------------------------')
        if self.merged_data is not None:
            numerical_columns = self.merged_data.select_dtypes(include='number').columns
            categorical_columns = self.merged_data.select_dtypes(include='object').columns
            print(f'Numerical Columns: {numerical_columns}\n')
            print(f'Categorical Columns: {categorical_columns}\n')
        else:
            print('Error: No merged data available. Please call the merge_data() method first.')



In [50]:
# Instantiate 
file_paths = {
    'links': 'movies_data/links.csv',
    'movies': 'movies_data/movies.csv',
    'ratings': 'movies_data/ratings.csv',
    'tags': 'movies_data/tags.csv'
}
data_explorer = DataExplorer(file_paths)

# Load data
data_explorer.load_data()

# Merge data
data_explorer.merge_data()

# Get dimensions
data_explorer.get_shape()

# Summarize info
data_explorer.summarize_info()

# Describe data
data_explorer.describe_data()

# Display numerical and categorical columns
data_explorer.display_column_types()


Loading links data csv file...
links dataset loaded successfully from movies_data/links.csv

Loading movies data csv file...
movies dataset loaded successfully from movies_data/movies.csv

Loading ratings data csv file...
ratings dataset loaded successfully from movies_data/ratings.csv

Loading tags data csv file...
tags dataset loaded successfully from movies_data/tags.csv

Merging data on movieId...
Data merged successfully

The merged DataFrame has 285783 rows and 11 columns.

Summarizing the merged DataFrame info
-------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      285783 non-null  int64  
 1   title        285783 non-null  object 
 2   genres       285783 non-null  object 
 3   userId_x     285762 non-null  float64
 4   rating       285762 non-null  float64
 5   timestamp_x  285762 non-null

Unnamed: 0,movieId,userId_x,rating,timestamp_x,userId_y,timestamp_y,imdbId,tmdbId
count,285783.0,285762.0,285762.0,285762.0,233234.0,233234.0,285783.0,285770.0
mean,14927.663741,313.894279,3.84127,1214707000.0,470.681354,1384754000.0,295605.0,12797.31532
std,31402.673519,179.451387,1.020798,223373000.0,153.324249,153470500.0,515015.6,43479.255523
min,1.0,1.0,0.5,828124600.0,2.0,1137179000.0,417.0,2.0
25%,296.0,160.0,3.0,1019133000.0,424.0,1242494000.0,109830.0,489.0
50%,1721.0,314.0,4.0,1211377000.0,477.0,1457901000.0,112573.0,680.0
75%,5673.0,465.0,4.5,1445346000.0,599.0,1498457000.0,241527.0,8963.0
max,193609.0,610.0,5.0,1537799000.0,610.0,1537099000.0,8391976.0,525662.0



Displaying numerical and categorical columns of the merged DataFrame
-----------------------------------------------
Numerical Columns: Index(['movieId', 'userId_x', 'rating', 'timestamp_x', 'userId_y',
       'timestamp_y', 'imdbId', 'tmdbId'],
      dtype='object')

Categorical Columns: Index(['title', 'genres', 'tag'], dtype='object')



### Loading the Data

We begin by loading each dataset into a separate pandas DataFrame and then merge them on the `movieId` column. The following datasets were loaded:

- **links.csv**: Contains identifiers that link MovieLens IDs with IMDb and TMDb IDs.
- **movies.csv**: Contains movie information including titles and genres.
- **ratings.csv**: Contains user ratings for different movies.
- **tags.csv**: Contains user-generated tags for different movies.

### Merged DataFrame

After loading and merging the datasets, the resulting DataFrame has **285,783 rows and 11 columns**. Here is a brief summary of the columns in the merged DataFrame:

- **movieId**: Unique identifier for each movie.
- **title**: Movie title.
- **genres**: Movie genres.
- **userId_x**: User ID for ratings.
- **rating**: Rating given by the user.
- **timestamp_x**: Timestamp of the rating.
- **userId_y**: User ID for tags.
- **tag**: User-generated tag.
- **timestamp_y**: Timestamp of the tag.
- **imdbId**: IMDb identifier for the movie.
- **tmdbId**: TMDb identifier for the movie.


### Numerical and Categorical Columns

The DataFrame contains the following numerical and categorical columns:

- **Numerical Columns**: `movieId`, `userId_x`, `rating`, `timestamp_x`, `userId_y`, `timestamp_y`, `imdbId`, `tmdbId`.
- **Categorical Columns**: `title`, `genres`, `tag`.


# DATA PREPROCESSING

In this section, we prepare the data for exploratory data analysis (EDA) and modeling. The preprocessing involves several key steps to clean and transform the dataset.

In [51]:
# Display first few rows of the merged data
display(data_explorer.merged_data.head())

Unnamed: 0,movieId,title,genres,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982703.0,336.0,pixar,1139046000.0,114709,862.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982703.0,474.0,pixar,1137207000.0,114709,862.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982703.0,567.0,fun,1525286000.0,114709,862.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847434962.0,336.0,pixar,1139046000.0,114709,862.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847434962.0,474.0,pixar,1137207000.0,114709,862.0


### Drop Less Relevant Columns
To streamline the dataset, we are going to drop `timestamp_x`, `timestamp_y`, `userId_y`, `tag`, `imdbId`and `tmdbId` which are less relevant for our analysis:

In [52]:
# Drop less relevant columns
data_explorer.merged_data.drop(columns=['timestamp_x', 'timestamp_y', 'userId_y', 'tag', 'imdbId', 'tmdbId'], inplace=True)

# Display the first few rows of the cleaned data
display(data_explorer.merged_data.head())


Unnamed: 0,movieId,title,genres,userId_x,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0


### Check for Missing Values

In [53]:
# Check for missing values
print(data_explorer.merged_data.isnull().sum())

movieId      0
title        0
genres       0
userId_x    21
rating      21
dtype: int64


This revealed some missing values in `userId_x`and `rating`. Since there were only 21 missing values in rating, we decided to drop rows with missing rating:

In [54]:
# Drop null values since they are just 21.
data_explorer.merged_data.dropna(subset=['rating'], inplace=True)
print(data_explorer.merged_data.isnull().sum())


movieId     0
title       0
genres      0
userId_x    0
rating      0
dtype: int64


### Remove Duplicate Rows

In [55]:
# Drop duplicates
data_explorer.merged_data.drop_duplicates(inplace=True)
print(data_explorer.merged_data.duplicated().sum())

0


### Rename Columns
To improve clarity, we renamed the column userId_x to user_id using the following function:

In [56]:
# A function for renaming columns
def rename_column(df, current_name, new_name):
    """
    Renames a column in the DataFrame.

    Args:
        df (pandas.DataFrame): The DataFrame containing the column to rename.
        current_name (str): The current name of the column.
        new_name (str): The new name for the column.

    Returns:
        pandas.DataFrame: DataFrame with the renamed column.
    """
    if current_name in df.columns:
        df.rename(columns={current_name: new_name}, inplace=True)
        print(f"Column '{current_name}' has been renamed to '{new_name}'.")
    else:
        print(f"Column '{current_name}' does not exist in the DataFrame.")
    return df

# Example usage:
# df = pd.DataFrame({
#     'old_name': [1, 2, 3],
#     'other_column': [4, 5, 6]
# })
# df = rename_column(df, 'old_name', 'new_name')


In [57]:
rename_column(data_explorer.merged_data, 'userId_x', 'user_id')

Column 'userId_x' has been renamed to 'user_id'.


Unnamed: 0,movieId,title,genres,user_id,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0
6,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5
9,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5
12,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5
...,...,...,...,...,...
285778,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0
285779,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5
285780,193585,Flint (2017),Drama,184.0,3.5
285781,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5


### Extract Release Year from Title

In [58]:
# Define a regular expression pattern to extract title and year
pattern = r'(.*) \((\d{4})\)'

# Create 'release_year' and 'title' columns based on the pattern
data_explorer.merged_data[['title', 'release_year']] = data_explorer.merged_data['title'].str.extract(pattern)
    
# Convert 'release_year' to numeric (it might be extracted as strings)
data_explorer.merged_data['release_year'] = pd.to_numeric(data_explorer.merged_data['release_year'],errors='coerce').astype('Int64')   
# Remove the parentheses from the title
data_explorer.merged_data['title'] = data_explorer.merged_data['title'].str.strip()

In [59]:
data_explorer.merged_data.head()

Unnamed: 0,movieId,title,genres,user_id,rating,release_year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,1995
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1995
6,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1995
9,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1995
12,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1995
