# Movie Recommendation Sysytem 

#### ***Context***

Over the past two decades, there has been a monumental shift in how people access 
and consume video content. With universal access to broadband internet, numerous 
platforms like YouTube, Netflix, and HBO Go emerged and steadily grew to prominence.

Although not a household name in itself, OTT is the exact technology that made the 
streaming revolution possible.

OTT stands for Over The Top, refers to any video streaming service delivering content 
to users over the internet, however there are subscription charges associated with the 
usage of such platforms such as PrimeVideo, Netflix, HotStar, Zee5, Sony Liv etc. But 
choosing your next movie to watch can still be a daunting task, even if you have access 
to all the platforms.

The data for this exercise is open-source data which has been collected and made 
available from the MovieLens website (http://movielens.org), a part of GroupLens 
Research. The data sets were collected over various periods of time, depending on the 
size of the set.

### ***Data Description:***

The data consists of 105339 ratings applied over 10329 movies. The average rating is 
3.5, and the minimum and maximum rating is 0.5 and 5, respectively. There are 668 
users who have given their ratings for 149532 movies.
There are two data files that are provided:

**Movies.csv** 

➢ movieId: ID assigned to a movie.

➢ title: Title of a movie.

➢ genres: pipe separated list of movie genres.

**Ratings.csv**

➢ userId: ID assigned to a user

➢ movieId: ID assigned to a movie

➢ rating: rating by a user of a movie

➢ Timestamp: time at which the rating was provided

### ***Steps and Tasks:***
  
  ➢ Import libraries and load dataset
  
  ➢ Exploratory Data Analysis including:
  
     • Understanding of distribution of the features available
     
     • Finding unique users and movies
     
     • Average rating and Total movies at genre level
     
     • Unique genres considered
     
  ➢ Design the 3 different types of recommendation modules as mentioned in the 
     objectives.

In [1]:
# import libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from scipy.sparse import csr_matrix 

In [29]:
#read csv file
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')


In [30]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


In [31]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523
...,...,...,...,...
105334,668,142488,4.0,1451535844
105335,668,142507,3.5,1451535889
105336,668,143385,4.0,1446388585
105337,668,144976,2.5,1448656898


In [32]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10329 entries, 0 to 10328
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  10329 non-null  int64 
 1   title    10329 non-null  object
 2   genres   10329 non-null  object
dtypes: int64(1), object(2)
memory usage: 242.2+ KB


In [33]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105339 entries, 0 to 105338
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     105339 non-null  int64  
 1   movieId    105339 non-null  int64  
 2   rating     105339 non-null  float64
 3   timestamp  105339 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.2 MB


In [34]:
movies.describe()

Unnamed: 0,movieId
count,10329.0
mean,31924.282893
std,37734.741149
min,1.0
25%,3240.0
50%,7088.0
75%,59900.0
max,149532.0


In [35]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,105339.0,105339.0,105339.0,105339.0
mean,364.924539,13381.312477,3.51685,1130424000.0
std,197.486905,26170.456869,1.044872,180266000.0
min,1.0,1.0,0.5,828565000.0
25%,192.0,1073.0,3.0,971100800.0
50%,383.0,2497.0,3.5,1115154000.0
75%,557.0,5991.0,4.0,1275496000.0
max,668.0,149532.0,5.0,1452405000.0


In [36]:
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [37]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [38]:
# Find the number of unique users in the ratings dataset
num_users = len(ratings['userId'].unique())
print("Number of unique users: ", num_users)

# Find the number of unique movies in the ratings dataset
num_movies = len(ratings['movieId'].unique())
print("Number of unique movies: ", num_movies)



Number of unique users:  668
Number of unique movies:  10325


In [39]:
# Split genres and create a new dataframe with split genres
max_genres = movies['genres'].str.split('|').apply(len).max()
genres_df = movies['genres'].str.split('|', expand=True)
genres_df.columns = [f'genre{i+1}' for i in range(max_genres)]
    
# Add the movieId column to the new dataframe
genres_df['movieId'] = movies['movieId']

In [40]:
# Find the unique genres considered
unique_genres = set()
for genre in movies['genres']:
    unique_genres.update(genre.split('|'))
print("Unique genres considered: ", unique_genres)


Unique genres considered:  {'Animation', 'Horror', 'Thriller', 'Film-Noir', 'Action', 'Musical', 'Fantasy', 'Crime', 'Sci-Fi', 'Romance', 'Comedy', 'Western', 'IMAX', 'War', 'Drama', 'Children', 'Documentary', 'Mystery', 'Adventure', '(no genres listed)'}


## 1. Popularity Based Movie Recommendation
***Create a popularity-based recommender system at a genre level. User will input a genre (g), minimum ratings threshold (t) for a movie and no. of recommendations(N) for which it should be recommended top N movies which are most popular within that genre (g) ordered by ratings in descending order where each movie has at least (t) reviews.***

In [41]:
#import libraries
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

In [42]:
# read in the ratings and movies data
ratings = pd.read_csv("ratings.csv")
movies = pd.read_csv("movies.csv")

In [43]:

# define a function for popularity-based movie recommendation
def popularity_based_recommendation():
    
    # merge the ratings and movies dataframes
    df = movies.merge(ratings, on='movieId', how='inner')
    
    # group by movie title to calculate average rating and number of reviews
    title_rating = df.groupby(['title'])['rating'].mean().reset_index()
    title_review = df.groupby(['title'])['userId'].count().reset_index()
    
    # group by movie title to concatenate all genres associated with each movie
    title_genres = df.groupby(['title'])['genres'].sum().reset_index()
    
    # merge the rating, review, and genre dataframes
    df1 = pd.merge(title_rating, title_review, on='title')
    df1 = pd.merge(df1, title_genres, on='title')
    
    # rename columns for readability
    df1 = df1.rename(columns={'title': 'Movie Title', 'rating': 'Average Movie Rating', 'userId': 'Reviews'})
    
    # get user input for genre, minimum rating threshold, number of reviews, and number of recommended movies
    g = input('Genre: ')
    t = int(input('Minimum rating threshold: '))
    N = int(input('Num recommendations (review): '))
    n_movies = int(input('Number of Movies Recommendation: '))
    
    # filter the dataframe to movies with the specified genre, minimum number of reviews, and minimum average rating
    popularity_recommendation = df1[(df1['genres'].str.contains(g)) & (df1['Reviews'] >= N) & (df1['Average Movie Rating'] >= t)].reset_index(drop=True)
    
    # sort the filtered dataframe by average rating in descending order and return the top N recommended movies
    return popularity_recommendation.sort_values('Average Movie Rating', ascending=False).head(n_movies)


In [45]:
sol=popularity_based_recommendation()
sol

Genre: Action
Minimum rating threshold: 4
Num recommendations (review): 100
Number of Movies Recommendation: 10


Unnamed: 0,Movie Title,Average Movie Rating,Reviews,genres
13,"Matrix, The (1999)",4.264368,261,Action|Sci-Fi|ThrillerAction|Sci-Fi|ThrillerAc...
18,Star Wars: Episode V - The Empire Strikes Back...,4.22807,228,Action|Adventure|Sci-FiAction|Adventure|Sci-Fi...
15,Raiders of the Lost Ark (Indiana Jones and the...,4.212054,224,Action|AdventureAction|AdventureAction|Adventu...
9,Inception (2010),4.18932,103,Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMA...
17,Star Wars: Episode IV - A New Hope (1977),4.188645,273,Action|Adventure|Sci-FiAction|Adventure|Sci-Fi...
6,Fight Club (1999),4.188406,207,Action|Crime|Drama|ThrillerAction|Crime|Drama|...
3,Blade Runner (1982),4.169872,156,Action|Sci-Fi|ThrillerAction|Sci-Fi|ThrillerAc...
14,"Princess Bride, The (1987)",4.163743,171,Action|Adventure|Comedy|Fantasy|RomanceAction|...
0,Aliens (1986),4.146497,157,Action|Adventure|Horror|Sci-FiAction|Adventure...
5,"Dark Knight, The (2008)",4.141732,127,Action|Crime|Drama|IMAXAction|Crime|Drama|IMAX...


### 2. Content Based Movie Recommendation
***Create a content-based recommender system which recommends top N movies  based on similar movie(m) genres.***

In [46]:
from sklearn.metrics.pairwise import cosine_similarity

In [48]:
movies_df = pd.read_csv('movies.csv')
movies_df = movies_df[['title', 'genres']]
movies_df = movies_df.drop_duplicates()
movies_df

Unnamed: 0,title,genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Comedy|Romance
3,Waiting to Exhale (1995),Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Comedy
...,...,...
10324,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,Le Grand Restaurant (1966),Comedy
10326,A Very Murray Christmas (2015),Comedy
10327,The Big Short (2015),Drama


In [49]:
genres_df = movies_df['genres'].str.get_dummies(sep='|')
movies_df = pd.concat([movies_df, genres_df], axis=1)
movies_df = movies_df.drop('genres', axis=1)

In [50]:
genres_df

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10324,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10325,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10326,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10327,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [51]:
movies_df

Unnamed: 0,title,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,Toy Story (1995),0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Jumanji (1995),0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Grumpier Old Men (1995),0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,Waiting to Exhale (1995),0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,Father of the Bride Part II (1995),0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10324,Cosmic Scrat-tastrophe (2015),0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10325,Le Grand Restaurant (1966),0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10326,A Very Murray Christmas (2015),0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10327,The Big Short (2015),0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [52]:
similarity_matrix = cosine_similarity(genres_df, genres_df)
similarity_matrix

array([[1.        , 0.77459667, 0.31622777, ..., 0.4472136 , 0.        ,
        0.        ],
       [0.77459667, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.31622777, 0.        , 1.        , ..., 0.70710678, 0.        ,
        0.        ],
       ...,
       [0.4472136 , 0.        , 0.70710678, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [68]:
# Define a function to implement content-based recommendation
def content_based():
    
    # Set the number of recommended movies to return
    top_n = int(input('Enter no of reccomndation'))
    
    
    # Get the index of the input movie from the movies dataframe
    movie_index = movies_df[movies_df['title'] == input("Enter the movie: ")].index[0]
    
    # Compute the similarity scores between the input movie and all the other movies
    similar_movies = list(enumerate(similarity_matrix[movie_index]))
    
    # Sort the similar movies by their similarity score in descending order
    sorted_similar_movies = sorted(similar_movies, key=lambda x: x[1], reverse=True)[1:]
    
    # Get the top N recommended movies with their similarity scores
    recommended_movies = [(movies_df.iloc[i[0]]['title'], i[1]) for i in sorted_similar_movies[:top_n]]
    
    # Create a pandas dataframe to display the recommended movies with their similarity scores
    df = pd.DataFrame(recommended_movies, columns=['Movie Title', 'Similarity Score'])
    
    # Return the dataframe of recommended movies
    return df

# Call the content-based recommendation function and store the output in the "output" variable
output = content_based()

# Print the output
print(output)


Enter no of reccomndation10
Enter the movie: Toy Story (1995)
                                         Movie Title  Similarity Score
0                                        Antz (1998)               1.0
1                                 Toy Story 2 (1999)               1.0
2     Adventures of Rocky and Bullwinkle, The (2000)               1.0
3                   Emperor's New Groove, The (2000)               1.0
4                              Monsters, Inc. (2001)               1.0
5  DuckTales: The Movie - Treasure of the Lost La...               1.0
6                                   Wild, The (2006)               1.0
7                             Shrek the Third (2007)               1.0
8                     Tale of Despereaux, The (2008)               1.0
9  Asterix and the Vikings (Astérix et les Viking...               1.0


### *3. Collaborative Based Movie Recommendation*
***Create a collaborative based recommender system which recommends top N movies based on “K” similar users for a target user “u”***

**Recommends top N movies based on K similar users for a target user**
    
parameter user_id: target user ID

parameter n_recommendations: number of recommendations to be made

parameter k_similar_users: threshold for similar users
    
return: top N movie recommendations for the target user

In [55]:
from scipy import sparse
movies_df1 = pd.read_csv('movies.csv')
ratings_df1 = pd.read_csv('ratings.csv')

In [56]:
def collaborative_filtering(user_id, n_recommendations, k_similar_users):
   
    # Create a sparse matrix for users and movies
    sparse_matrix = sparse.csr_matrix((ratings_df1['rating'], (ratings_df1['userId'], ratings_df1['movieId'])))
    
    # Calculate cosine similarity between users
    similarity_matrix = cosine_similarity(sparse_matrix)
    
    # Find similar users for the target user
    similar_users = similarity_matrix[user_id]
    similar_users_indices = similar_users.argsort()[-k_similar_users-1:-1][::-1]
    
    # Find movies watched by similar users but not by the target user
    similar_users_movies = set(ratings_df1[ratings_df1['userId'].isin(similar_users_indices)]['movieId'])
    target_user_movies = set(ratings_df1[ratings_df1['userId'] == user_id]['movieId'])
    recommendations = similar_users_movies - target_user_movies
    
    # Calculate average rating of the recommended movies
    recommendations_df = pd.DataFrame(recommendations, columns=['movieId'])
    recommendations_df['rating'] = recommendations_df['movieId'].apply(lambda x: ratings_df1[(ratings_df1['movieId']==x)]['rating'].mean())
    
    # Sort recommended movies based on average rating
    recommendations_df = recommendations_df.sort_values(by='rating', ascending=False)
    
    # Get movie titles
    recommendations_df = pd.merge(recommendations_df, movies_df1, on='movieId', how='inner')
    
    # Return top N recommendations
    return recommendations_df.head(n_recommendations)['title']

In [57]:
user_id = int(input('Target_User_id :'))
n_recommendations = int(input('n_recommenadation :'))
k_similar_users = int(input('k_similar_users :'))

collaborative_filtering(user_id, n_recommendations, k_similar_users)

Target_User_id :1
n_recommenadation :10
k_similar_users :100


0     Gentlemen of Fortune (Dzhentlmeny udachi) (1972)
1                        Waiting for 'Superman' (2010)
2              Fallen Angels (Duo luo tian shi) (1995)
3                                         Earth (2007)
4                 Star Wreck: In the Pirkinning (2005)
5    Werckmeister Harmonies (Werckmeister harmóniák...
6                         School For Scoundrels (1960)
7               Nobody Knows (Dare mo shiranai) (2004)
8    Irony of Fate, or Enjoy Your Bath! (Ironiya su...
9                    Batman: Under the Red Hood (2010)
Name: title, dtype: object