
# Business Understanding:
In a data-driven world, businesses aim to personalize user experiences to boost engagement, satisfaction, and revenue. For a movie streaming platform, a recommendation system can provide tailored suggestions based on user preferences and behavior. This project focuses on building a movie recommender system using collaborative filtering, matrix factorization, and a hybrid model to enhance recommendation quality and precision, ultimately improving user experience and platform engagement.

`Business Question:
How can we build a recommendation system to provide personalized movie suggestions to users, improving engagement and satisfaction?`

## Problem Statement:
The movie streaming platform needs a recommendation system to:

-Develop a robust recommendation system that utilizes user and item interactions (ratings) and metadata (genres) to predict user preferences and generate personalized movie recommendations.

-Enhance customer satisfaction and retention by providing personalized movie suggestions.

-Increase user engagement, conversion rates, and platform revenue.

-Address the "cold start problem" for new users and movies with limited interaction data.

-Overcome the limitations of current recommendation methods in capturing the dynamics between users and movies, ensuring 
recommendations are relevant and diverse. The system must balance accuracy, scalability, and real-time processing to deliver seamless recommendations.

## Objectives:

-Collaborative Filtering: Develop a model that recommends movies based on user-item interactions, such as ratings and viewing history.

-Matrix Factorization: Apply techniques like Singular Value Decomposition (SVD) to uncover latent factors in user-movie relationships, improving predictions.

-Hybrid Model: Combine collaborative filtering and matrix factorization to leverage the strengths of both approaches, improving recommendation accuracy and coverage.

-Cold Start Problem: Address challenges of recommending movies to new users and suggesting newly added movies with limited data.
-Performance Optimization: Ensure the system is scalable and capable of providing real-time, personalized recommendations.

-Model Evaluation: Measure and compare model effectiveness using precision, recall, and F1 score to evaluate the accuracy of the recommendations.
## Success Criteria:
-Improved Accuracy: The hybrid model should outperform collaborative filtering and matrix factorization methods in terms of precision, recall, and user satisfaction.

-Cold Start Handling: The system should offer meaningful recommendations for new users and movies with minimal performance drop.

-Scalability: The recommendation system must scale efficiently as the platform grows, ensuring responsiveness even as the number of users and movies increases.

-Increased Engagement: There should be a measurable increase in user interactions with recommended movies (e.g., views, ratings, or watch time).

-Business Impact: The system should contribute to key business metrics such as increased subscriptions, user retention, and overall revenue.

## Reasons and Importance of Models:

#### Collaborative Filtering:

-Reason: Collaborative filtering leverages historical user-item interactions (e.g., ratings, watch history) to make personalized recommendations.

-Importance: It helps identify similar users and suggests movies based on what others with similar tastes have liked, improving user satisfaction and engagement.

#### Matrix Factorization:

-Reason: Matrix factorization techniques like Singular Value Decomposition (SVD) decompose the user-movie interaction matrix into latent factors, revealing hidden relationships between users and movies.

-Importance: It enables the system to make better predictions, especially in sparse datasets (where many user-movie interactions are missing), enhancing recommendation accuracy.`

#### Hybrid Model:

-Reason: A hybrid model combines the strengths of both collaborative filtering and matrix factorization to overcome their individual weaknesses, such as the cold start problem in collaborative filtering.

-Importance: Hybrid models improve recommendation robustness by integrating multiple techniques, ensuring more diverse and accurate movie recommendations across different user profiles.

`By utilizing these models, the movie recommender system will be more effective, engaging, and scalable, ultimately leading to a better user experience and helping the platform achieve its key business goals.`

###### Surprise

- Python library specifically designed for building and evaluating recommender systems, supporting collaborative filtering, matrix factorization, and other algorithms to predict user-item interactions.

In [1]:
!pip install surprise



In [2]:

import pandas as pd ## data manipulation and analysis,
import numpy as np #multi-dimensional arrays and matrices
import matplotlib.pyplot as plt #Visualisation 
import seaborn as sns #Enhanced data visualization
from sklearn.metrics import mean_squared_error, mean_absolute_error #assess model accuracy.
from surprise import Reader, Dataset, KNNBaseline, SVD #Recommender system algorithms
from surprise.model_selection import train_test_split  #Splitting datasets for training/testing  
from surprise import accuracy #Evaluate recommender system predictions  
from collections import defaultdict #Efficient key-value pair operations
from sklearn.feature_extraction.text import TfidfVectorizer #Text vectorization (TF-IDF)  
from sklearn.metrics.pairwise import cosine_similarity #Similarity computation (cosine-based)


###### Loading of the datasets

In [3]:
df2 = pd.read_csv("movies.csv")
df3 = pd.read_csv("ratings.csv")

### Data Understanding
###### MOVIE-dataset
-`Data Overview:` The dataset contains 9,742 rows and 3 columns: movieId (integer), title (string), and genres (string). All columns have non-null values, with movieId being unique for each entry.

-`Summary Statistics:` The movieId values range from 1 to 193,609, with a mean value of 42,200, and the most frequent genre is "Drama" (appearing 1,053 times). The dataset has minimal missing data, and the most common movie title is "Confessions of a Dangerous Mind (2002)" which appears twice.

-`Shape and Uniqueness:` The dataset has a shape of (9742, 3), with movieId containing 9742 unique values, title having 9737 unique values (indicating some duplicate titles), and genres having 951 unique genre types.

-`No Duplicates:` There are no duplicate movieId values in the dataset, but some movie titles are repeated, suggesting potential cases where the same movie appears with slightly different versions or formats.

###### RATINGS-dataset
-`Data Overview:` The ratings dataset contains 100,836 rows and 4 columns: userId (integer), movieId (integer), rating (float), and timestamp (integer). All columns have non-null values, and the data is structured to track movie ratings by users.

-`Summary Statistics:` The userId ranges from 1 to 610, with a mean value of 326, and the movieId spans from 1 to 193,609, covering a wide range of movies. The average rating is 3.5 (on a 1-5 scale), and the most frequent timestamp corresponds to the period around 1.2 billion seconds since January 1970.

-`Shape and Uniqueness:` The dataset has a shape of (100836, 4), with userId containing 610 unique values, movieId containing 9,724 unique movie identifiers, and rating having 10 possible unique rating values. The timestamp column has 85,043 unique values, showing a diverse set of rating times.

-`No Duplicates:` There are no duplicate rows in the dataset, ensuring that each rating is unique for a given userId and movieId combination, although there may be multiple ratings by the same user for different movies.

In [4]:
datasets = {'movies': df2, 'ratings': df3}

for name, df in datasets.items():
    print(f"Dataset: {name}")
    print("\nInfo:")
    print(df.info())
    print("\nDescription:")
    print(df.describe(include='all'))  # Include all types of columns
    print("\nShape:")
    print(df.shape)
    print("\nColumns:")
    print(df.columns.tolist())
    print("\nUnique Values per Column:")
    for col in df.columns:
        print(f"{col}: {df[col].nunique()} unique values")
    print("\nNumber of Duplicates:")
    print(df.duplicated().sum())
    print("\n" + "="*50 + "\n")

Dataset: movies

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None

Description:
              movieId            title genres
count     9742.000000             9742   9742
unique            NaN             9737    951
top               NaN  Saturn 3 (1980)  Drama
freq              NaN                2   1053
mean     42200.353623              NaN    NaN
std      52160.494854              NaN    NaN
min          1.000000              NaN    NaN
25%       3248.250000              NaN    NaN
50%       7300.000000              NaN    NaN
75%      76232.000000              NaN    NaN
max     193609.000000              NaN    NaN

Shape:
(9742, 3)

Columns:
['movieId', 'title', 'genres']

Uni

In [5]:
#remove duplicate rows
df2.drop_duplicates(subset=['title', 'genres'])

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


#### Importance of the Datasets
-`Movies Dataset:` The movies dataset contains essential metadata like movieId, title, and genres, enabling the system to categorize and recommend films based on genres or other attributes. This helps provide meaningful movie suggestions for users.

-`Ratings Dataset:` The ratings dataset holds critical user data, including userId, movieId, rating, and timestamp, enabling the system to track user preferences and understand movie perceptions. It forms the core of personalized recommendations by evaluating user ratings.

`Collaborative Filtering:` The ratings dataset is crucial for collaborative filtering, which identifies patterns and similarities between users or items to generate personalized recommendations. This method leverages user-item interaction data for better movie suggestions.

-`Cold Start Problem:` The movies dataset helps mitigate the cold start problem by offering genre information for new movies or users with limited interaction data, facilitating more accurate recommendations even with sparse data.

-`Hybrid Model:` By combining the genres from the movies dataset with collaborative filtering from the ratings dataset, a hybrid model can generate more precise and diverse recommendations, enhancing the system’s overall performance.

-`System Foundation:` Together, the movies and ratings datasets form the backbone of a robust recommendation system, enabling both content-based and collaborative filtering approaches to deliver accurate, personalized movie recommendations to users.

### MERGER--ratings&movies
-ratings and movies, is merged on the movieId column using an inner join.

-An inner join was used to merge the two DataFrames based on movieId, ensuring only movies that have both ratings and metadata (title/genres) are included, excluding entries with missing data in either DataFrame.

-`Shape:` The resulting DataFrame has 100,836 entries and 6 columns.

-`Columns`: contains columns: `userId, movieId, rating, timestamp, title, and genres`.

-`Data Types:` The columns contain data types: int64 for identifiers, float64 for ratings, and object for movie title and genres.

-`Summary Statistics:` Basic descriptive statistics show mean ratings of 3.5, with a rating range from 0.5 to 5.

-`Value Counts:`  100,836 entries, each representing a unique rating event, with a mix of ratings across various movies and genres.

-`Unique Entries:` The data includes various combinations of userId, movieId, rating, and timestamp for different movies, showing no duplicates or repeated entries.

-`Duplicated Entries:` no duplicated rows.

In [6]:
# merging of the two dataframes
movies = df2
ratings = df3
merged_df = pd.merge(ratings, movies, on='movieId', how='inner')


In [7]:
merged_df.to_excel('film_analysis.xlsx', index=False)

In [8]:
# dataframe Summary 
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
 5   genres     100836 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.4+ MB


In [9]:
# Descriptive Statistics
merged_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [10]:
#dataFrame Dimensions
merged_df.shape

(100836, 6)

In [11]:
#Dtaframe columns
merged_df.columns

Index(['userId', 'movieId', 'rating', 'timestamp', 'title', 'genres'], dtype='object')

`userId`: Unique identifier for users.

`movieId`: Unique identifier for movies.

`rating`: Ratings given by users (float).

`timestamp`: Time when the rating was recorded.

`title`: Title of the movie, including release year.

`genres`: Pipe-separated genres for the movie.

In [12]:
# counting the occurrences of unique values
merged_df.value_counts()

userId  movieId  rating  timestamp   title                                                                                           genres                                     
610     170875   3.0     1493846415  The Fate of the Furious (2017)                                                                  Action|Crime|Drama|Thriller                    1
227     54259    2.5     1447210634  Stardust (2007)                                                                                 Adventure|Comedy|Fantasy|Romance               1
        55721    5.0     1447210041  Elite Squad (Tropa de Elite) (2007)                                                             Action|Crime|Drama|Thriller                    1
        55820    4.0     1447209881  No Country for Old Men (2007)                                                                   Crime|Drama                                    1
        56367    4.5     1447210824  Juno (2007)                                               

In [13]:
#Duplicated rows
merged_df.duplicated().sum()

0