# Netflix Movie Recommendation System

**Project Goal:** To build a movie recommendation system using the Netflix dataset. This project will utilize a collaborative filtering approach, specifically the Singular Value Decomposition (SVD) algorithm, to predict movie ratings for a specific user.

**Author:** Shashank S Murthy
**Date:** September 2025

---
## 1. Data Loading and Initial Exploration

First, we load the necessary libraries and the dataset. The Netflix dataset is composed of multiple text files. For this project, we'll work with `combined_data_1.txt`, which contains over 24 million ratings.

The data is in a peculiar format:
- A row with a colon (e.g., `1:`) indicates the Movie ID.
- Subsequent rows are the customer ID and their rating for that movie.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import numpy as np
import pandas as pd

In [3]:
netflix_dataset = pd.read_csv('/content/drive/MyDrive/Netflix/Copy of combined_data_1.txt.zip',header=None,names=['Cust_id','Ratings'],usecols=[0,1])

In [4]:
netflix_dataset

Unnamed: 0,Cust_id,Ratings
0,1:,
1,1488844,3.0
2,822109,5.0
3,885013,4.0
4,30878,4.0
...,...,...
24058258,2591364,2.0
24058259,1791000,2.0
24058260,512536,5.0
24058261,988963,3.0


Let's get a summary of our dataset's size.

In [5]:
# Get a summary of the dataset's characteristics
movie_count = netflix_dataset['Ratings'].isnull().sum()
print(f"Number of movies in the dataset: {movie_count}")

customer_count = netflix_dataset['Cust_id'].nunique() - movie_count
print(f"Number of unique customers: {customer_count}")

rating_count = netflix_dataset['Cust_id'].count() - movie_count
print(f"Total number of ratings: {rating_count}")

Number of movies in the dataset: 4499
Number of unique customers: 470758
Total number of ratings: 24053764


---
## 2. Data Preprocessing and Cleaning

The raw data needs to be transformed into a structured format (`Customer_id`, `Movie_id`, `Rating`).

**Note on Efficiency:** The following cell uses a `for` loop to iterate through all 24 million rows to create the `movie_id` column. While this approach is clear and easy to understand, but it can be very slow.

In [6]:
movie_id = None
movie = []

for customer in netflix_dataset['Cust_id']:
  if ":" in customer:
    movie_id = int(customer.replace(":",""))
  movie.append(movie_id)

In [7]:
netflix_dataset['movie_id'] = movie

In [8]:
netflix_dataset

Unnamed: 0,Cust_id,Ratings,movie_id
0,1:,,1
1,1488844,3.0,1
2,822109,5.0,1
3,885013,4.0,1
4,30878,4.0,1
...,...,...,...
24058258,2591364,2.0,4499
24058259,1791000,2.0,4499
24058260,512536,5.0,4499
24058261,988963,3.0,4499


Now, we remove the rows that contained the movie IDs and convert our data to the correct numerical types.

In [9]:
netflix_dataset.dropna(inplace = True)

In [10]:
netflix_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24053764 entries, 1 to 24058262
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Cust_id   object 
 1   Ratings   float64
 2   movie_id  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 734.1+ MB


In [11]:
netflix_dataset['Cust_id'] = netflix_dataset['Cust_id'].astype(int)

In [12]:
netflix_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24053764 entries, 1 to 24058262
Data columns (total 3 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Cust_id   int64  
 1   Ratings   float64
 2   movie_id  int64  
dtypes: float64(1), int64(2)
memory usage: 734.1 MB


### Filtering Sparse Data
To improve model performance and reduce noise, we will filter out movies with very few ratings and users who have provided very few ratings. We'll set a benchmark at the 60th percentile for both.

In [13]:
#Based on movie review count
movie_review_count = netflix_dataset['movie_id'].value_counts()

In [14]:
movie_review_count

Unnamed: 0_level_0,count
movie_id,Unnamed: 1_level_1
1905,193941
2152,162597
3860,160454
4432,156183
571,154832
...,...
4294,44
915,43
3656,42
4338,39


In [15]:
bench_mark = round(movie_review_count.quantile(0.6))
print(f"Benchmark for the movie review counts : {bench_mark}")

Benchmark for the movie review counts : 908


In [16]:
#grouping the movie ids which has reviews less than benchmark
drop_movie_index = movie_review_count[movie_review_count<bench_mark].index

In [17]:
drop_movie_index

Index([1598, 1733, 1647, 4099, 1616, 1446,  263, 4259,  160, 1988,
       ...
       1858, 4035, 3693, 2805,  820, 4294,  915, 3656, 4338, 4362],
      dtype='int64', name='movie_id', length=2699)

In [18]:
#Based on customer review count
cust_review_count = netflix_dataset['Cust_id'].value_counts()

In [19]:
cust_review_count

Unnamed: 0_level_0,count
Cust_id,Unnamed: 1_level_1
305344,4467
387418,4422
2439493,4195
1664010,4019
2118461,3769
...,...
1300341,1
2550360,1
11848,1
930788,1


In [20]:
bench_mark_cust = round(cust_review_count.quantile(0.6))

In [21]:
#grouping the customer ids who has reviewed less than benchmark
drop_cust_index = cust_review_count[cust_review_count<bench_mark_cust].index
drop_cust_index

Index([2194851,  600295, 1739398, 1157368,  532108, 2157249,  256134,  640441,
       1272324, 1346990,
       ...
       1969065,  899932,  611596, 2147176,  811650, 1300341, 2550360,   11848,
        930788,  594210],
      dtype='int64', name='Cust_id', length=282042)

In [22]:
#Removing movie ids which has less reviews than benchmark from the dataset
netflix_dataset = netflix_dataset[~netflix_dataset['movie_id'].isin(drop_movie_index)]

In [23]:
#Removing customer ids who has reviewed less than benchmark from the dataset
netflix_dataset = netflix_dataset[~netflix_dataset['Cust_id'].isin(drop_cust_index)]

In [24]:
#final dataset for model building
netflix_dataset

Unnamed: 0,Cust_id,Ratings,movie_id
696,712664,5.0,3
697,1331154,4.0,3
698,2632461,3.0,3
699,44937,5.0,3
700,656399,4.0,3
...,...,...,...
24056842,1055714,5.0,4496
24056843,2643029,4.0,4496
24056844,267802,4.0,4496
24056845,1559566,3.0,4496


---
## 3. Building the Recommendation Model

Now that the data is clean, we'll use the `scikit-surprise` library to build our collaborative filtering model. We first need to address a version incompatibility with NumPy.

**Note:** To keep training time manageable for this demonstration, we will train the model on the first 100,000 rows of the filtered data.

In [25]:
# The surprise library requires NumPy < 2.0. We will install the correct version.
# You must restart the runtime after this step for it to take effect.
!pip install "numpy<2"



In [26]:
!pip install scikit-surprise

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4.tar.gz (154 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2611307 sha256=c70769b97c464c1c1d604ba50f33a51544720c2d9ce95ba951cf1e22aac3c00c
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4


In [27]:

# After restarting the runtime, we can import the necessary modules.
from surprise import SVD, Dataset, Reader
from surprise.model_selection import cross_validate

In [28]:

# Prepare the data for the Surprise library
reader = Reader()
data = Dataset.load_from_df(netflix_dataset[['movie_id','Cust_id','Ratings']][:100000],reader)

In [29]:
# Use the SVD algorithm
model = SVD()

In [30]:
# Perform 3-fold cross-validation to evaluate the model
cross_validate(model, data, measures=['RMSE'], cv=3)

{'test_rmse': array([1.02713917, 1.0157789 , 1.01415319]),
 'fit_time': (1.358821153640747, 1.3894798755645752, 1.5876071453094482),
 'test_time': (0.1491377353668213, 0.34366822242736816, 0.23902249336242676)}

---
## 4. Generating Recommendations for a Specific User

With a trained model, we can now predict ratings for movies that a specific user has not yet seen. This allows us to generate a personalized list of top recommendations.

In [31]:
# First, load the movie titles data to map IDs to names.
movie_title = pd.read_csv('/content/drive/MyDrive/Netflix/Copy of movie_titles.csv',encoding = 'ISO-8859-1',header=None,names=['Movie_id','Year','Name'],usecols=[0,1,2])

In [32]:
# Select an example user and find all movies they have rated.
user_rating = netflix_dataset[netflix_dataset['Cust_id']==1331154]

In [33]:
user_rating

Unnamed: 0,Cust_id,Ratings,movie_id
697,1331154,4.0,3
5178,1331154,4.0,8
31460,1331154,3.0,18
92840,1331154,4.0,30
224761,1331154,3.0,44
...,...,...,...
23439584,1331154,4.0,4389
23546489,1331154,2.0,4402
23649431,1331154,4.0,4432
23844441,1331154,3.0,4472


In [34]:
# Create a DataFrame of movies that this user has NOT rated.
user_1331154 = movie_title.copy()
user_1331154 = user_1331154[~user_1331154['Movie_id'].isin(drop_movie_index)]

In [35]:
# Predict the rating for each unrated movie
est = []
for x in user_1331154['Movie_id']:
  temp = model.predict(1331154,x).est
  est.append(temp)

In [36]:
user_1331154['Estimated'] = est

In [37]:
user_1331154

Unnamed: 0,Movie_id,Year,Name,Estimated
2,3,1997.0,Character,3.582822
4,5,2004.0,The Rise and Fall of ECW,3.582822
5,6,1997.0,Sick,3.582822
7,8,2004.0,What the #$*! Do We Know!?,3.582822
15,16,1996.0,Screamers,3.582822
...,...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...,3.582822
17766,17767,2004.0,Fidel Castro: American Experience,3.582822
17767,17768,2000.0,Epoch,3.582822
17768,17769,2003.0,The Company,3.582822


### Top Movie Recommendations
Finally, we sort the DataFrame by the predicted rating to see the top recommendations for this user.

In [38]:
user_1331154 = user_1331154.sort_values('Estimated',ascending=False)

In [39]:
user_1331154.head()

Unnamed: 0,Movie_id,Year,Name,Estimated
13874,13875,1982.0,Gilbert and Sullivan: The Mikado,3.867331
15673,15674,1999.0,Arlington Road,3.825053
12875,12876,2000.0,Rat,3.793301
10373,10374,1997.0,Goosebumps: Scary House,3.783407
7132,7133,2002.0,Baby Shakespeare: World of Poetry,3.782759


---
## 6. Conclusion & Future Work

This project successfully demonstrated a complete, albeit basic, recommendation system pipeline. We preprocessed a large, raw dataset and used the SVD algorithm to generate personalized movie recommendations for a specific user.

**Potential Improvements:**
- **Use More Data:** Train the model on the full filtered dataset (~20 million ratings) instead of a small sample to drastically improve accuracy.
- **Hyperparameter Tuning:** Use `GridSearchCV` to find the optimal parameters for the SVD model.
- **Optimize Code:** Replace the slow Python loops with faster, vectorized functions for both data parsing and prediction generation.
- **Explore Other Algorithms:** Test other collaborative filtering methods (e.g., SVD++, KNN) or explore content-based and hybrid models.
- **Deployment:** For a real-world application, this model would be best deployed as a web API using a framework like Flask or FastAPI.

In [40]:
from surprise import dump

# Save the trained model to a file
dump.dump('recommendation_model.pkl', algo=model)