<a href="https://colab.research.google.com/github/Shrutika-Prabhulkar/Python-notebooks/blob/master/Netflix_Movie_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Netflix Movie Recommendation System**

### **Introduction**

In today's digital age, streaming platforms like Netflix have revolutionized the way we consume media. With an extensive library of movies and TV shows, Netflix offers an unparalleled variety of content. However, this abundance of choices can be overwhelming for users, making it challenging to decide what to watch next. To enhance user experience and keep viewers engaged, Netflix employs sophisticated recommendation systems that suggest content tailored to individual preferences.

The core objective of a movie recommendation system is to predict and recommend movies that a user is likely to enjoy based on their past behavior and the behavior of similar users. This personalized approach not only improves user satisfaction but also increases the time users spend on the platform, thus boosting the platform's overall success.

### **How the Recommendation System Works**

Our Netflix Movie Recommendation System leverages collaborative filtering, one of the most effective techniques for building recommendation systems. Collaborative filtering operates on the principle that users who have agreed in the past will agree in the future. Therefore, by analyzing the preferences of a large number of users, we can identify patterns and similarities that help predict what individual users might like.

### **Steps Involved in Building the Recommendation System**

1. **Data Collection and Preparation**:
   - We start by collecting data from two primary sources: a list of movies with their details and user ratings for these movies. This data is typically stored in CSV files.
   - The datasets are then merged to create a comprehensive view of each movie's ratings and the users who rated them.

2. **Data Cleaning**:
   - Data often contains missing values and inconsistencies that need to be addressed. We clean the data by removing any rows with missing values to ensure our analysis is accurate.

3. **Filtering Movies**:
   - Not all movies have enough ratings to provide meaningful recommendations. We filter out movies with fewer ratings to focus on those that have sufficient data.

4. **Creating User-Movie Matrix**:
   - We create a matrix where rows represent users, columns represent movies, and the values are the ratings given by the users. This matrix is essential for calculating similarities between users.

5. **Identifying Similar Users**:
   - By analyzing the user-movie matrix, we identify users with similar movie-watching patterns. This is done by calculating the correlation between users' ratings.

6. **Generating Recommendations**:
   - Once we have identified the users most similar to the target user, we aggregate their ratings to recommend movies that the target user has not yet watched but is likely to enjoy based on similar users' preferences.

### **Benefits of the Recommendation System**

- **Personalized Experience**: Users receive recommendations tailored to their unique tastes, increasing their satisfaction and engagement.
- **Discover New Content**: Users can discover movies and TV shows they might not have found otherwise, enhancing their viewing experience.
- **User Retention**: By consistently providing content that users enjoy, the platform can maintain and grow its user base.



### **Code Block 1: Import Libraries and Upload Data**

In this step, we start by importing the pandas library, which is essential for data manipulation and analysis. We also use the files.upload() function from Google Colab to upload our datasets. This function opens a dialog box that allows us to upload files directly from our local machine. Once the files are uploaded, we read them into pandas DataFrames using pd.read_csv(). By displaying the first 10 rows of each dataset with head(10), we get an initial look at the structure and contents of our data. Printing the column names helps us verify that the data has been loaded correctly and that the necessary columns are present.



In [None]:
import pandas as pd
from google.colab import files

# Upload the datasets
uploaded = files.upload()
uploaded = files.upload()

# Load the datasets
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Display the first 10 rows of each dataset
print("Movies DataFrame:")
print(movies.head(10))
print("\nRatings DataFrame:")
print(ratings.head(10))

# Display columns of each DataFrame to confirm structure
print("\nMovies DataFrame Columns:", movies.columns)
print("\nRatings DataFrame Columns:", ratings.columns)


Saving movies.csv to movies (2).csv


Saving ratings.csv to ratings (1).csv
Movies DataFrame:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   
5        6                         Heat (1995)   
6        7                      Sabrina (1995)   
7        8                 Tom and Huck (1995)   
8        9                 Sudden Death (1995)   
9       10                    GoldenEye (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
5                        Action|Crime|Thriller  
6                               Comedy|Romance  
7

### **Code Block 2: Merge Datasets and Display Information**

Here, we perform a merge operation on the movies and ratings DataFrames. This is done using the merge() function, specifying movieId as the key for merging and using a left join to ensure all movies are included even if they have no ratings. This combined DataFrame df now contains both movie details and user ratings. We then print the first few rows of this merged DataFrame to inspect the result. The shape attribute tells us the number of rows and columns in the DataFrame, while info() provides a summary of the DataFrame, including the data types of each column and the presence of any missing values. This information helps us understand the structure and completeness of our merged dataset

In [None]:
# Merge the datasets
df = movies.merge(ratings, how='left', on='movieId')

# Display the merged DataFrame
print("\nMerged DataFrame:")
print(df.head(5))

# Display the shape and info of the merged DataFrame
print("\nShape of Merged DataFrame:")
print(df.shape)
print("\nInfo of Merged DataFrame:")
print(df.info())
print("\nFirst 5 rows of Merged DataFrame:")
print(df.head(5))

# Display columns of the merged DataFrame to confirm structure
print("\nMerged DataFrame Columns:", df.columns)



Merged DataFrame:
   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   userId  rating     timestamp  
0     1.0     4.0  9.649827e+08  
1     5.0     4.0  8.474350e+08  
2     7.0     4.5  1.106636e+09  
3    15.0     2.5  1.510578e+09  
4    17.0     4.5  1.305696e+09  

Shape of Merged DataFrame:
(100854, 6)

Info of Merged DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100854 entries, 0 to 100853
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    100854 non-null  int64  
 1   title    

### **Code Block 3: Drop Missing Values and Filter Common Movies**

Data cleaning is a crucial step in any data analysis process. Here, we use dropna() to remove any rows with missing values, ensuring our dataset is complete and ready for analysis. We then count how many times each movie has been rated using value_counts() and store this in a new DataFrame rating_counts. To make the data easier to work with, we reset the index and rename the columns to 'title' and 'count'. Movies that have been rated 50 times or less are considered rare; we identify these movies and filter them out from our dataset. The remaining movies are stored in common_movies. We pivot this data to create a user-movie matrix user_movie_df using pivot_table(). This matrix has users as rows, movies as columns, and ratings as values, which is a useful format for similarity calculations.

In [None]:
# Drop missing values
df.dropna(inplace=True)

# Confirm the presence of 'title' column
print("\nColumns after dropping missing values:", df.columns)

# Calculate rating counts
rating_counts = pd.DataFrame(df['title'].value_counts())
rating_counts.reset_index(inplace=True)
rating_counts.columns = ['title', 'count']
print("\nRating Counts DataFrame:")
print(rating_counts.head())

# Select movies which have less than 50 ratings
rare_movies = rating_counts[rating_counts['count'] <= 50]['title']
print("\nRare Movies:")
print(rare_movies)

# Find common movies
common_movies = df[~df['title'].isin(rare_movies)]

# Confirm the presence of 'title' column in common_movies DataFrame
print("\nColumns in common_movies DataFrame:", common_movies.columns)

# Pivot table to form the matrix for similarity calculation
user_movie_df = common_movies.pivot_table('rating', 'userId', 'title')
print("\nUser-Movie DataFrame:")
print(user_movie_df.head())



Columns after dropping missing values: Index(['movieId', 'title', 'genres', 'userId', 'rating', 'timestamp'], dtype='object')

Rating Counts DataFrame:
                              title  count
0               Forrest Gump (1994)    329
1  Shawshank Redemption, The (1994)    317
2               Pulp Fiction (1994)    307
3  Silence of the Lambs, The (1991)    279
4                Matrix, The (1999)    278

Rare Movies:
437                                          Crash (2004)
438     Harry Potter and the Deathly Hallows: Part 2 (...
439                                    Scary Movie (2000)
440                                You've Got Mail (1998)
441                             The Imitation Game (2014)
                              ...                        
9714                We're Back! A Dinosaur's Story (1993)
9715                             American Hardcore (2006)
9716                             Shanghai Surprise (1986)
9717                               Let's Get Harry (1

### **Code Block 4: Select a Random User and Get Watched Movies**

To personalize the movie recommendations, we start by selecting a random user from our dataset. We achieve this using the sample() method, which randomly picks an index value. We then filter the user-movie matrix to get the data specific to this user, storing it in random_user_df. The columns in this DataFrame represent the movies watched by the selected user. We create a list movies_watched that contains these movie titles. We then extract a subset of the user-movie matrix (movies_watched_df) that includes only the movies watched by the random user. Additionally, we count how many users have watched each of these movies, which helps in identifying users with similar viewing habits.

In [None]:
# Randomly select a user
random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=20).values)

# Filter the user's data
random_user_df = user_movie_df[user_movie_df.index == random_user]
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
print("\nNumber of movies watched by the random user:", len(movies_watched))

# Create a DataFrame for movies watched by the random user
movies_watched_df = user_movie_df[movies_watched]

# Count the number of users who watched the same movies
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ['userId', 'movie_count']
print("\nUser-Movie Count DataFrame:")
print(user_movie_count.head())



Number of movies watched by the random user: 47

User-Movie Count DataFrame:
   userId  movie_count
0     1.0           14
1     2.0            1
2     3.0            1
3     4.0            5
4     5.0           20


  random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=20).values)


### **Code Block 5: Find Similar Users and Create Final DataFrame**

In this step, we aim to find users who have a significant overlap in their movie-watching history with our selected user. We filter for users who have watched more than 20 of the same movies as our selected user, storing these user IDs in users_same_movies. We then create a final DataFrame final_df that includes the ratings of the common movies by these similar users as well as the selected user. This DataFrame serves as the basis for calculating user similarity, as it contains the relevant ratings data needed to compare users' preferences.

In [None]:
# Select users who watched more than 20 same movies
users_same_movies = user_movie_count[user_movie_count['movie_count'] > 20]['userId']

# Create the final DataFrame
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies)],
                      random_user_df[movies_watched]])
print("\nFinal DataFrame:")
print(final_df.head())
len(final_df)




Final DataFrame:
title   Ace Ventura: Pet Detective (1994)  \
userId                                      
6.0                                   3.0   
8.0                                   NaN   
14.0                                  2.0   
18.0                                  2.5   
19.0                                  2.0   

title   Ace Ventura: When Nature Calls (1995)  American President, The (1995)  \
userId                                                                          
6.0                                       2.0                             4.0   
8.0                                       NaN                             4.0   
14.0                                      1.0                             NaN   
18.0                                      NaN                             NaN   
19.0                                      2.0                             NaN   

title   Apollo 13 (1995)  Batman (1989)  Beauty and the Beast (1991)  \
userId                    

107

### **Code Block 6: Calculate Correlations**

We calculate the correlation between users based on their movie ratings to identify users with similar tastes. Using the corr() method, we compute the Pearson correlation coefficient for the users in final_df. The result is unstacked and sorted, with duplicates removed to get unique pairs of users. We convert this correlation data into a DataFrame and reset the index for easier manipulation. We then filter this DataFrame to keep only the correlations involving our selected user, excluding the correlation of the user with themselves. By sorting these correlations, we identify the top 50 users who have the highest similarity scores with the selected user.

In [None]:
# Calculate correlations
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
print("\nCorrelation DataFrame:")
print(corr_df.head())

# Filter rows where userId1 == random_user
corr_df = pd.DataFrame(corr_df, columns=["correlation"])
corr_df.index.names = ['userId1', 'userId2']
corr_df = corr_df.reset_index()

# Sort based on the correlation column
sorted_corr_df = corr_df[(corr_df['userId1'] == random_user) & (corr_df['userId2'] != random_user)].sort_values(by='correlation', ascending=False)

# Find top 50 similar users
top_50_users = sorted_corr_df.head(50)
print("\nTop 50 Similar Users:")
print(top_50_users)



Correlation DataFrame:
userId  userId
56.0    604.0    -0.819092
580.0   43.0     -0.810163
524.0   483.0    -0.749797
483.0   468.0    -0.670774
133.0   604.0    -0.666667
dtype: float64

Top 50 Similar Users:
      userId1  userId2  correlation
5537    121.0     40.0     0.725983
5236    121.0    239.0     0.600556
5230    121.0    425.0     0.597800
5022    121.0    182.0     0.552149
4978    121.0    489.0     0.542107
4977    121.0    489.0     0.542107
4963    121.0    404.0     0.538894
4749    121.0    179.0     0.497709
4748    121.0    179.0     0.497709
4580    121.0    288.0     0.469035
4520    121.0    373.0     0.458635
4335    121.0    448.0     0.432394
4334    121.0    448.0     0.432394
4286    121.0    480.0     0.426274
4256    121.0    455.0     0.422398
4133    121.0    330.0     0.408101
4132    121.0    330.0     0.408101
4070    121.0     94.0     0.401256
4018    121.0    305.0     0.394278
4013    121.0     28.0     0.394055
3981    121.0    411.0     0.390

### **Code Block 7: Recommend Movies**

Finally, we recommend movies to our selected user based on the preferences of the top 50 similar users. We merge the top 50 users' data with the original ratings DataFrame to get the ratings given by these users. By calculating the average rating for each movie, we identify which movies are highly rated by the similar users. We merge this data with the movies DataFrame to include movie titles in our recommendations. Sorting the movies by their average ratings in descending order allows us to recommend the best-rated movies to our selected user. We print the top 10 recommended movies, providing personalized suggestions based on the collective preferences of similar users.







In [None]:
1 2 3 4 5 6 7

1 5 4.5
2 5 3.2
3 5 5
4 5 3
6
7
15.7/4 -> 4
5 4
6 3
7 5


In [None]:
# Merge with ratings DataFrame to get ratings of these users
top_50_ratings = top_50_users.merge(ratings, left_on='userId2', right_on='userId')

# Calculate the average rating of every movie by all the top users
movie_recommendation = top_50_ratings.groupby('movieId').agg({'rating': 'mean'}).reset_index()

# Merge with movies DataFrame to get movie titles
movie_recommendation = movie_recommendation.merge(movies, on='movieId')

# Recommend the movies sorted by average rating
recommended_movies = movie_recommendation.sort_values(by='rating', ascending=False)
print("\nRecommended Movies:")
print(recommended_movies.head(10))



Recommended Movies:
      movieId  rating                        title                genres
4523   187593     5.0            Deadpool 2 (2018)  Action|Comedy|Sci-Fi
3956    93838     5.0  The Raid: Redemption (2011)          Action|Crime
3342    47200     5.0                 Crank (2006)       Action|Thriller
798      1411     5.0                Hamlet (1996)   Crime|Drama|Romance
201       299     5.0                Priest (1994)                 Drama
2095     4334     5.0                 Yi Yi (2000)                 Drama
3897    89864     5.0                 50/50 (2011)          Comedy|Drama
342       514     5.0              Ref, The (1994)                Comedy
2100     4349     5.0              Catch-22 (1970)            Comedy|War
2118     4389     5.0    Lost and Delirious (2001)                 Drama


In [None]:
# filter these movies by removing movies watched by 121

metrics

1 -> rating
X(j)->
theta(i)