![logo.png](attachment:logo.png)

# BUSINESS UNDERSTANDING

## Overview

StreamFlix aims to implement a movie recommendation system to enhance user experience, increase watch time and improve overall customer retention. The project will utilize the MovieLens dataset to build a collaborative filtering model that provides personalized top 5 movie recommendations for each user based on their previous ratings. StreamFlix has also observed that users often struggle to find movies they enjoy leading to decreased engagement and potential churn.
By implementing a personalized recommendation system, StreamFlix hopes to:

* Increase user satisfaction and engagement
* Boost average watch time per user
* Improve customer retention rates
* Differentiate itself from competitors in the streaming market

The following are approaches were proposed to enable new users to provide their ratings:

+ Initial Survey:
When a new user signs up for StreamFlix, they're presented with a quick survey of 20 popular movies across various genres. Users rate at least 10 of these movies on a scale of 1-5 stars.
Continuous Rating:
After watching a movie, users are prompted to rate it. This can be done through:

+ A pop-up immediately after the movie ends
A "Rate this movie" button on the movie's page
A dedicated "My Ratings" section in the user's profile


+ Rating Import:
Offer users the option to import their ratings from other platforms (e.g., IMDb, Rotten Tomatoes) to quickly build their profile.
Gamification:
Implement a "Movie Critic" badge system where users earn badges for rating a certain number of movies, encouraging more ratings.




## Business Problem



StreamFlix is facing challenges with user retention and engagement. Users are also overwhelmed by the vast library of movies available and often spend a considerable amount of time searching for movies they would enjoy.  StreamFlix is, therefore, looking for a way to provide personalized movie recommendations to its users to improve their viewing experience and increase platform usage. 



## Objectives

### Main Objective

To develop and deploy a collaborative filtering-based recommendation system that accurately predicts user preferences and provides relevant movie suggestions.


### Specific Objectives

1. To build a collaborative filtering model that uses user ratings to generate top 5 movie recommendations.
2. To address the cold start problem using content-based filtering for new users.
3. To evaluate the recommendation system using appropriate metrics like RMSE and MAP.



## Success Metrics

1. Root Mean Square Error (RMSE) < 0.9 for rating predictions
2. Mean Average Precision @5 (MAP@5) > 0.3 for recommended movies
3. User engagement increase: 15% boost in average watch time within 3 months of deployment
4. Hit rate: Percentage of times that a recommended movie was actually watched by the user

# DATA UNDERSTANDING

# DATA PREPROCESSING

This section involves preparing the data for exploratory data analysis (EDA) and  modeling. The first step will be to import the modules relevant to this project then load the dataset into a pandas dataframe, preview the data and check for any missing, null or duplicate values.


In [3]:
# Import relevant libraries

# Data manipulation 
import pandas as pd 
import numpy as np 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

# Data modelling
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

# Filter future warnings
import warnings
warnings.filterwarnings('ignore')


In [5]:
class DataExplorer:
    '''
    A class to handle and explore pandas DataFrames.

    Attributes:
        file_path (str): The file path of the csv file to load.
        data (pandas.DataFrame): The loaded DataFrame.

    Methods:
        load_data(): Load the CSV file into a DataFrame.
        get_shape(): Print the number of rows and columns in the DataFrame.
        summarize_info(): Print a summary of the DataFrame columns.
        describe_data(): Print descriptive statistics of the DataFrame.
    '''

    def __init__(self, file_path):
        '''
        Initialize the DataExplorer object.

        Args:
            file_path (str): The file path of the csv file to load.
        '''
        self.file_path = file_path
        self.data = None

    def load_data(self):
        '''
        Load the csv file into a DataFrame.

        Returns:
            None
        '''
        print('Loading movie data csv file...')
        try:
            self.data = pd.read_csv(self.file_path)
            print(f'Dataset loaded successfully from {self.file_path}\n')
        except FileNotFoundError:
            print(f'Error: The file \'{self.file_path}\' was not found.')
        except Exception as e:
            print(f'Error: An unexpected error occurred: {e}')

    def get_shape(self):
        '''
        Print the number of rows and columns in the DataFrame.

        Returns:
            None
        '''
        if self.data is not None:
            rows, columns = self.data.shape
            print(f'The DataFrame has {rows} rows and {columns} columns.\n')
        else:
            print('Error: No data loaded yet. Please call the load_data() method first.')
            

    def summarize_info(self):
        '''
        Print a summary of the DataFrame columns.

        Returns:
            None
        '''
        print('Summarizing the DataFrame info')
        print('-------------------------------')
        if self.data is not None:
            print(self.data.info())
        else:
            print('Error: No data loaded yet. Please call the load_data() method first.')

    def describe_data(self):
        '''
        Print descriptive statistics of the DataFrame.

        Returns:
            None
        '''
        print('\nDescribing the DataFrame data')
        print('--------------------------------')

        if self.data is not None:
            print(self.data.describe())
        else:
            print('Error: No data loaded yet. Please call the load_data() method first.')
            
    def display_column_types(self):
        '''
        Display numerical and categorical columns.

        Returns:
            None
        '''
        print('\nDisplaying numerical and categorical columns')
        print('-----------------------------------------------')
        if self.data is not None:
            numerical_columns = self.data.select_dtypes(include='number').columns
            categorical_columns = self.data.select_dtypes(include='object').columns
            print(f'Numerical Columns: {numerical_columns}\n')
            print(f'Categorical Columns: {categorical_columns}\n')
        else:
            print('Error: No data loaded yet. Please call the load_data() method first.')

            

# Instantiate 
file_path = 'movies.csv'
data_explorer = DataExplorer(file_path)

# Load data
data_explorer.load_data()

# Get dimensions
data_explorer.get_shape()

# Summarize info
data_explorer.summarize_info()

# Describe data
data_explorer.describe_data()

# Display numerical and categorical columns
data_explorer.display_column_types()

Loading movie data csv file...
Error: The file 'movies.csv' was not found.
Error: No data loaded yet. Please call the load_data() method first.
Summarizing the DataFrame info
-------------------------------
Error: No data loaded yet. Please call the load_data() method first.

Describing the DataFrame data
--------------------------------
Error: No data loaded yet. Please call the load_data() method first.

Displaying numerical and categorical columns
-----------------------------------------------
Error: No data loaded yet. Please call the load_data() method first.
