🎯 Problem Statement:

In the era of digital streaming, users are overwhelmed by an abundance of content options. Netflix, one of the largest streaming platforms, hosts thousands of titles, making it challenging for users to discover movies or shows that align with their preferences. The goal of this project is to build a movie recommendation system using Netflix’s user-rating data. By analyzing past user behaviors and preferences, the system aims to suggest movies that individual users are likely to enjoy.

This project explores collaborative filtering techniques to:

Analyze user-item interactions from a large dataset of ratings.

Identify similar users or items using similarity metrics (e.g., cosine similarity).

Generate personalized recommendations to enhance user experience.

In [None]:
# Importing essential libraries for data handling and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Datasets : https://drive.google.com/drive/folders/1H6BPO2-xBTkCCAjv_QutG-9ph4i3EgYB?usp=sharing

In [None]:
# Loading the Netflix dataset from a zipped file.
# The dataset is read without headers and two columns are named: 'Cust_id' and 'Ratings'.

netflix_dataset = pd.read_csv('/content/drive/MyDrive/Netflix DataSet/Copy of combined_data_1.txt.zip', header=None, names=['Cust_id', 'Ratings'], usecols=[0,1])

# **Brief Explanation of Project**

**SVD (Singular Value Decomposition) in a recommendation system works by finding patterns in user preferences and item similarities. Here's a basic idea without going deep into the topic**

**1) What the System Has: A big table (matrix) with users on one side and items (like movies) on the other. Users give ratings to items, but not everyone has rated everything**

**2) What SVD Does: SVD looks at the ratings that are available and tries to figure out the hidden connections between users and items. It learns what kind of movies users like based on their previous ratings**

**3) How It Helps: Once SVD understands these patterns, it can predict how a user might rate a movie they haven’t seen yet. Based on these predictions, the system recommends movies that the user is most likely to enjoy**

**4) Step-by-Step Implementation of SVD in a Recommendation System**

In [None]:
netflix_dataset

Unnamed: 0,Cust_id,Ratings
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0
...,...,...
24058258,2591364,2.0
24058259,1791000,2.0
24058260,512536,5.0
24058261,988963,3.0


In [None]:
# Checking for missing values (NaNs) in the dataset.
netflix_dataset.isnull().sum()

Unnamed: 0,0
Cust_id,0
Ratings,4499


In [None]:
# Counting the number of missing values in the 'Ratings' column.
# Those are actual 'Ratings' Count
movie_count = netflix_dataset.isnull().sum()
movie_count = movie_count['Ratings']
movie_count

4499

In [None]:
# Counting the total number of unique customer IDs.
total_count = netflix_dataset['Cust_id'].nunique()
total_count

475257

In [None]:
# Estimating the number of actual customers (by subtracting NaNs treated as movie IDs).
customer_count = total_count - movie_count
customer_count

470758

In [None]:
# Estimating the number of ratings (excluding nulls assumed as movie IDs).
rating_count = netflix_dataset['Cust_id'].count() - movie_count
rating_count

24053764

In [None]:
# Displaying the frequency of each rating value to understand rating distribution.
netflix_dataset['Ratings'].value_counts()

Unnamed: 0_level_0,count
Ratings,Unnamed: 1_level_1
4.0,8085741
3.0,6904181
5.0,5506583
2.0,2439073
1.0,1118186


In [None]:
# Lets just make a clear dataframe to find how many movie id are there

movie_id=None
movie_np = [ ]  # Empty list to store all movie id

# Iterate over the dataframe rows
for i in netflix_dataset["Cust_Id"]:
  if ":" in i:
    # Update the current movie ID in column
    movie_id = int(i.replace(":" , ''))         # removing : from number
  movie_np.append(movie_id) # Adding movie id number in the list

In [None]:
movie_np

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [None]:
netflix_dataset["Movie_id"] = movie_np

In [None]:
netflix_dataset

Unnamed: 0,Cust_id,Ratings,Movie_id
0,1:,,1
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
...,...,...,...
24058258,2591364,2.0,4499
24058259,1791000,2.0,4499
24058260,512536,5.0,4499
24058261,988963,3.0,4499


In [None]:
# Dropping rows with missing values from the dataset.
netflix_dataset.dropna(inplace=True)

In [None]:
netflix_dataset

Unnamed: 0,Cust_id,Ratings,Movie_id
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
5,823519,3.0,1
...,...,...,...
24058258,2591364,2.0,4499
24058259,1791000,2.0,4499
24058260,512536,5.0,4499
24058261,988963,3.0,4499


In [None]:
netflix_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24053764 entries, 1 to 24058262
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Cust_id   object 
 1   Ratings   float64
 2   Movie_id  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 734.1+ MB


In [None]:
# Change the datatype of cust id from object to int
netflix_dataset['Cust_id'] = netflix_dataset['Cust_id'].astype(int)

In [None]:
netflix_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24053764 entries, 1 to 24058262
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Cust_id   int64  
 1   Ratings   float64
 2   Movie_id  int64  
dtypes: float64(1), int64(2)
memory usage: 734.1 MB


In [None]:
movie_rew_count = netflix_dataset['Movie_id'].value_counts()

In [None]:
movie_rew_count

Unnamed: 0_level_0,count
Movie_id,Unnamed: 1_level_1
1905,193941
2152,162597
3860,160454
4432,156183
571,154832
...,...
4294,44
915,43
3656,42
4338,39


In [None]:
# Pre -filtering
# Remove all the users that have rated less movies
# Remove all the movies that have been rated less

In [None]:
# Now we will create a benchmark ( consider a benchmark value like 60 percentile )
bench_mark = round(movie_rew_count.quantile(0.6), 0) # Round off to how many decimal places 0 means 48 not 48.0

In [None]:
bench_mark

908.0

In [None]:
# If any movie is having ratings less than 908 exclude that movie
drop_movie_index = movie_rew_count[movie_rew_count < bench_mark].index
drop_movie_index # List that will store the index of all movie which are having ratings less than 908.

Index([1598, 1733, 1647, 4099, 1616, 1446,  263, 4259,  160, 1988,
       ...
       1858, 4035, 3693, 2805,  820, 4294,  915, 3656, 4338, 4362],
      dtype='int64', name='Movie_id', length=2699)

In [None]:
#How many movie we are going to remove
len(drop_movie_index)

2699

In [None]:
cust_rew_count = netflix_dataset['Cust_id'].value_counts()

In [None]:
cust_rew_count

Unnamed: 0_level_0,count
Cust_id,Unnamed: 1_level_1
305344,4467
387418,4422
2439493,4195
1664010,4019
2118461,3769
...,...
1300341,1
2550360,1
11848,1
930788,1


In [None]:
bench_mark_cus = round(cust_rew_count.quantile(0.6), 0)

In [None]:
bench_mark_cus

36.0

In [None]:
# If any user have rated less than 36 movie please remove them
drop_cust_index = cust_rew_count[cust_rew_count < bench_mark_cus].index

In [None]:
drop_movie_index

Index([1598, 1733, 1647, 4099, 1616, 1446,  263, 4259,  160, 1988,
       ...
       1858, 4035, 3693, 2805,  820, 4294,  915, 3656, 4338, 4362],
      dtype='int64', name='Movie_id', length=2699)

In [None]:
drop_cust_index

Index([2194851,  600295, 1739398, 1157368,  532108, 2157249,  256134,  640441,
       1272324, 1346990,
       ...
       1969065,  899932,  611596, 2147176,  811650, 1300341, 2550360,   11848,
        930788,  594210],
      dtype='int64', name='Cust_id', length=282042)

In [None]:
netflix_dataset = netflix_dataset[~netflix_dataset['Movie_id'].isin(drop_movie_index)]
netflix_dataset = netflix_dataset[~netflix_dataset['Cust_id'].isin(drop_cust_index)]

In [None]:
netflix_dataset # Final shape after removing extra users and extra less rated movies

Unnamed: 0,Cust_id,Ratings,Movie_id
696,712664,5.0,3
697,1331154,4.0,3
698,2632461,3.0,3
699,44937,5.0,3
700,656399,4.0,3
...,...,...,...
24056842,1055714,5.0,4496
24056843,2643029,4.0,4496
24056844,267802,4.0,4496
24056845,1559566,3.0,4496


In [None]:
# Final data that we are left behind is 1 crore 96 lakh 95 thousand 8 hundred thirty six

# **Model Building**

In [None]:
# Load the second dataset for movie names as with previous dataset we only have movie id
movie_titles = pd.read_csv('/content/drive/MyDrive/Netflix DataSet/Copy of movie_titles.csv', encoding='ISO-8859-1', header = None, names = ['Movie_Id', 'Year', 'Name'], usecols=[0, 1, 2])

In [None]:
movie_titles

Unnamed: 0,Movie_Id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004.0,Fidel Castro: American Experience
17767,17768,2000.0,Epoch
17768,17769,2003.0,The Company


In [None]:
# !pip install numpy==1.23.5

In [None]:
!pip install scikit-surprise



In [None]:
from surprise import SVD, Reader, Dataset
from surprise.model_selection import cross_validate

In [None]:
reader = Reader()

In [None]:
# We only work with top 100k records for quick runtime
data = Dataset.load_from_df(netflix_dataset[['Movie_id', 'Cust_id', 'Ratings']][:100000], reader)

In [None]:
data

<surprise.dataset.DatasetAutoFolds at 0x7931aca88a90>

In [None]:
model = SVD() # Creating a SVD model

In [None]:
cross_validate(model, data, measures = ['RMSE'], cv=3) # Training the SVD model with top 100k data

{'test_rmse': array([1.01490714, 1.01968288, 1.02276303]),
 'fit_time': (1.5137643814086914, 1.7642238140106201, 2.242418050765991),
 'test_time': (0.26822757720947266, 0.27857065200805664, 0.2857801914215088)}

# **Recommendations**

In [None]:
# Filter the data for finding a specific user 1331154 to whom we are going to suggest the movies
user_rating = netflix_dataset[netflix_dataset['Cust_id']==1331154]

In [None]:
user_rating

Unnamed: 0,Cust_id,Ratings,Movie_id
697,1331154,4.0,3
5178,1331154,4.0,8
31460,1331154,3.0,18
92840,1331154,4.0,30
224761,1331154,3.0,44
...,...,...,...
23439584,1331154,4.0,4389
23546489,1331154,2.0,4402
23649431,1331154,4.0,4432
23844441,1331154,3.0,4472


# User 1331154 have rated 253 movies


In [None]:
# Make a copy of each individual customer for recommnedation
user_1331154 = movie_titles.copy()
user_1331154

Unnamed: 0,Movie_Id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004.0,Fidel Castro: American Experience
17767,17768,2000.0,Epoch
17768,17769,2003.0,The Company


In [None]:
# Remove the less rates movies from 2nd dataset also
user_1331154 = user_1331154[~user_1331154['Movie_Id'].isin(drop_movie_index)]

In [None]:
user_1331154

Unnamed: 0,Movie_Id,Year,Name
2,3,1997.0,Character
4,5,2004.0,The Rise and Fall of ECW
5,6,1997.0,Sick
7,8,2004.0,What the #$*! Do We Know!?
15,16,1996.0,Screamers
...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004.0,Fidel Castro: American Experience
17767,17768,2000.0,Epoch
17768,17769,2003.0,The Company


In [None]:
# Prediction
user_1331154['Estimated'] = user_1331154['Movie_Id'].apply(lambda x:model.predict(1331154, x).est)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  user_1331154['Estimated'] = user_1331154['Movie_Id'].apply(lambda x:model.predict(1331154, x).est)


In [None]:
user_1331154

Unnamed: 0,Movie_Id,Year,Name,Estimated
2,3,1997.0,Character,3.585282
4,5,2004.0,The Rise and Fall of ECW,3.585282
5,6,1997.0,Sick,3.585282
7,8,2004.0,What the #$*! Do We Know!?,3.585282
15,16,1996.0,Screamers,3.585282
...,...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...,3.585282
17766,17767,2004.0,Fidel Castro: American Experience,3.585282
17767,17768,2000.0,Epoch,3.585282
17768,17769,2003.0,The Company,3.585282


In [None]:
# Display top 5 movies with highest estimates score that user 1331154 can like ?
user_1331154.sort_values('Estimated', ascending = False)

Unnamed: 0,Movie_Id,Year,Name,Estimated
17450,17451,2000.0,Along for the Ride,3.863183
10373,10374,1997.0,Goosebumps: Scary House,3.854691
13510,13511,1993.0,Much Ado About Nothing,3.825953
5748,5749,1997.0,The Woodlanders,3.747195
15217,15218,1998.0,Bear in the Big Blue House: Shapes,3.734392
...,...,...,...,...
6177,6178,2001.0,Underwaterworld Trilogy: Deep Encouters / Ocea...,3.319239
14388,14389,2002.0,The Skulls 2,3.298003
15795,15796,1968.0,The One and Only,3.283021
3320,3321,1999.0,In Dreams,3.277550


In [None]:
user_1331154.head() #Top 5 recommended Movies to user 1331154.

Unnamed: 0,Movie_Id,Year,Name,Estimated
2,3,1997.0,Character,3.585282
4,5,2004.0,The Rise and Fall of ECW,3.585282
5,6,1997.0,Sick,3.585282
7,8,2004.0,What the #$*! Do We Know!?,3.585282
15,16,1996.0,Screamers,3.585282
