## Movie Recommendation Systems

In the dynamic landscape of entertainment, the sheer volume of available movies can be overwhelming for audiences seeking their next cinematic experience. To navigate this vast array of options, a personalized movie recommendation system emerges as a beacon, guiding viewers to films that align with their unique preferences and tastes. 

Recommendation systems utilize advanced algorithms to analyze user preferences and historical viewing patterns to curate a tailored list of recommendations that resonate with the user's cinematic interests. They help improve the quality of search results and provides more relevant items to user saving their time and effort. 

There are three main types of recommendation systems:

* **Popularity-based Filtering:** This is the most basic type of filtering which recommends an item based on its popularity and/or genre. It recommends the same top items to users with similar demographic features. The underlying concept in this filtering method is that items enjoying higher popularity are more likely to be appreciated by the average audience. 

* **Content-based Filtering:** It recommends related items taking a specific item as a basis. It makes recommendations based on item metadata, like actors, directors, genres, description etc. The fundamental concept driving these systems is that if someone enjoyed a specific item, they are likely to appreciate another item that shares similarities with it.

* **Collaborative Filtering:** This system finds people who share interests in common and makes recommendations based on that finding. Unlike their content-based equivalents, collaborative filters do not require item metadata.

### Loading the Dataset: 

In [1]:
import pandas as pd 
import numpy as np 
df1=pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
df2=pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')

The first dataset contains the following features:-

* movie_id - A unique identifier for each movie.
* cast - The name of lead and supporting actors.
* crew - The name of Director, Editor, Composer, Writer etc.


The second dataset has the following features:-

* budget - The budget in which the movie was made.
* genre - The genre of the movie, Action, Comedy ,Thriller etc.
* homepage - A link to the homepage of the movie.
* id - This is infact the movie_id as in the first dataset.
* keywords - The keywords or tags related to the movie.
* original_language - The language in which the movie was made.
* original_title - The title of the movie before translation or adaptation.
* overview - A brief description of the movie.
* popularity - A numeric quantity specifying the movie popularity.
* production_companies - The production house of the movie.
* production_countries - The country in which it was produced.
* release_date - The date on which it was released.
* revenue - The worldwide revenue generated by the movie.
* runtime - The running time of the movie in minutes.
* status - "Released" or "Rumored".
* tagline - Movie's tagline.
* title - Title of the movie.
* vote_average - average ratings the movie recieved.
* vote_count - the count of votes recieved.

In [2]:
# joining the two datasets on the id column
df1.columns = ['id', 'title1', 'cast', 'crew']
df2 = df2.merge(df1, on='id')
df2.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,title1,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


### 1. Popularity-based Filtering:

Using the average ratings of the movie as the determining factor would not be fair since the number of votes is not consistent, hence, these ratings are not diverse. 

Therefore, we need a score metric that takes into account the average rating as well as the number of votes for each movie, to calculate its popularity.

The Weighted Rating here is calculated as done by IMDB:

**Weighted Rating** = (v/(v+m) * R) + (m/(v+m) * C)

where, 
* v = no. of votes for the movie
* m = minimum no. of votes required to be listed in the chart
* R = average rating of the movie
* C = mean of vote_average

We have, **v = vote_count** and **R = vote_average** in the dataset

In [3]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [4]:
C = df2['vote_average'].mean()
C

6.092171559442016

In [5]:
#taking the 90th percentile as the minimum threshold of votes required
m = df2['vote_count'].quantile(0.9).round()
m

1838.0

The movies that have more than 1838 votes (more than 90 percentile) qualify to be in the top chart. Filtering the dataset according to this:

In [6]:
top_movies = df2.copy().loc[df2['vote_count'] >= m]
print(top_movies.shape)

(481, 23)


Now we can calculate the Popularity Score for these top movies:

In [7]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    
    return (v/(v+m) * R) + (m/(m+v) * C)

top_movies['score'] = top_movies.apply(weighted_rating, axis=1)

In [8]:
top_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 481 entries, 0 to 4602
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                481 non-null    int64  
 1   genres                481 non-null    object 
 2   homepage              349 non-null    object 
 3   id                    481 non-null    int64  
 4   keywords              481 non-null    object 
 5   original_language     481 non-null    object 
 6   original_title        481 non-null    object 
 7   overview              481 non-null    object 
 8   popularity            481 non-null    float64
 9   production_companies  481 non-null    object 
 10  production_countries  481 non-null    object 
 11  release_date          481 non-null    object 
 12  revenue               481 non-null    int64  
 13  runtime               481 non-null    float64
 14  spoken_languages      481 non-null    object 
 15  status                481 n

In [9]:
top_movies = top_movies.sort_values('score', ascending=False)
top_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059336
662,Fight Club,9413,8.3,7.939322
65,The Dark Knight,12002,8.2,7.920073
3232,Pulp Fiction,8428,8.3,7.904716
96,Inception,13752,8.1,7.863285
3337,The Godfather,5893,8.4,7.851327
95,Interstellar,10867,8.1,7.809533
809,Forrest Gump,7927,8.2,7.803258
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727309
1990,The Empire Strikes Back,5879,8.2,7.697967


This list provides the top rated movies in the dataset irrespective of their genres. We can also modify this model to get a list of the top rated movies in a particular genre.

This type of filtering provides a general chart of recommendations to all the users. They are not sensitive to the interests and tastes of a particular user.

### 2. Content-based Filtering

This approach recommends items based on user preferences. It matches the requirement, considering the past actions of the user, patterns detected, or any explicit feedback provided by the user, and accordingly, makes a recommendation.
For eg - 
![image.png](attachment:15bceafe-4d97-4903-a00c-8b41bd3fb0bf.png)

Here we will compute pairwise similarity scores for all movies based on their metadata such as overview, cast, crew, keywords, taglines, etc. and recommned movies on the basis of that similarity score. 

In [10]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

Next, we'll write functions that will help us to extract the required information from each feature. We are going to build a recommender based on the following metadata: the 3 top actors, the director, related genres and the movie plot keywords.

In [11]:
# Get the director's name from the crew feature.
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [12]:
# Returns the list top 3 elements 
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names

    return []

In [13]:
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [14]:
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan,"[dc comics, crime fighter, terrorist]","[Action, Crime, Drama]"
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton,"[based on novel, mars, medallion]","[Action, Adventure, Science Fiction]"


Next we would convert the names and keyword instances into lowercase and strip all the spaces between them. 

In [15]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [16]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)


Now we join all these columns together to create the metadata input for the vectorizer. This will give a matrix where each column represents a word in the overview vocabulary (all the words that appeared in at least one document) and each row represents a movie. This matrix is used to calculate the similarity between a pair of movies.

In [17]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

We use cosine similarity to calculate the similarity between two movies, since it is independent of magnitude and relatively faster to calculate.
![image.png](attachment:58745c8b-0a2f-49c0-a1e6-16aee0c64a7d.png)

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

We will now define a function that, given a movie title as input, returns a list of the ten most comparable films.

In [20]:
#Constructing a reverse map of indices and movie titles
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()

Now, our recommendation model will follow the following steps:
* Take the title of the movie as input
* Calculate the cosine similarity scores of that particular movie with all the movies 
* Then return the top 10 movies with the highest similarity score

In [21]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Calculate pairwise similarity scores with all movies
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the list according to the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the top 10 movies leaving the first one (first one will be the movie itself)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]

    return df2['title'].iloc[movie_indices]

In [22]:
get_recommendations('Avatar', cosine_sim)

206                         Clash of the Titans
71        The Mummy: Tomb of the Dragon Emperor
786                           The Monkey King 2
103                   The Sorcerer's Apprentice
131                                     G-Force
215      Fantastic 4: Rise of the Silver Surfer
466                            The Time Machine
715                           The Scorpion King
1      Pirates of the Caribbean: At World's End
5                                  Spider-Man 3
Name: title, dtype: object

### 3. Collaborative Filtering

The content-based filtering model has some limitations. It can only suggest items that are similar to a given item. They are not capable of capturing the personal tastes and biases of a user and provide recommendations across genres. Anyone querying our engine for movie recommendations will receive the same list, regardless of who he/she is. Hence, we use a collaborative filtering model.

This approach uses similarities between users and items simultaneously, to provide recommendations. It is the idea of recommending an item or making a prediction, depending on other like-minded individuals. 
For eg-
![image.png](attachment:a9d4dcf0-0bb9-4541-984a-f2626d2e8baf.png)

We use the **Singular Value Decomposition (SVD)** algorithm to capture the similarity between the users and items and generate the recommendations. 

SVD is a matrix factorisation technique, which reduces the number of features of a dataset by reducing the space dimension from N-dimension to K-dimension (where K<N). 
* It uses a matrix structure where each row represents a user, and each column represents an item. 
* The elements of this matrix are the ratings that are given to items by users.
* The factorisation of this matrix is done by the singular value decomposition.
* The SVD decreases the dimensions of the matrix A by extracting its latent factors, which are the characteristics of the items.
* It maps each user and each item which facilitates a clear representation of relationships between users and items.


In [23]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('../input/the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [24]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
from surprise.model_selection import KFold

kf = KFold(n_splits=5)
kf.split(data)

<generator object KFold.split at 0x7c1d3768d4d0>

In [25]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9015  0.8932  0.8948  0.8954  0.9034  0.8977  0.0040  
MAE (testset)     0.6919  0.6850  0.6898  0.6942  0.6945  0.6911  0.0035  
Fit time          1.41    1.48    1.47    1.45    1.43    1.45    0.03    
Test time         0.34    0.20    0.34    0.20    0.20    0.26    0.07    


{'test_rmse': array([0.90148219, 0.89316062, 0.89482899, 0.89540596, 0.90344311]),
 'test_mae': array([0.69187451, 0.68497292, 0.68975451, 0.69424098, 0.69448954]),
 'fit_time': (1.4050288200378418,
  1.4843063354492188,
  1.4655203819274902,
  1.454526424407959,
  1.433316946029663),
 'test_time': (0.33728623390197754,
  0.2016284465789795,
  0.3381190299987793,
  0.20484113693237305,
  0.20073986053466797)}

We get a mean Root Mean Sqaure Error of 0.89 approx which is good enough for our case. Let us now train on our dataset and arrive at predictions.

In [26]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7c1d3761abc0>

Let's use the model to predict what rating would user '1' give to movie '1172'

In [27]:
svd.predict(1, 1172, 3)

Prediction(uid=1, iid=1172, r_ui=3, est=3.6400475696480754, details={'was_impossible': False})

The model predicts that user will rate this movie as ~3.6.
Using these predicted ratings, we can find out what movies will the user like and recommend it to the user.

### 4. Hybrid Recommender

This model will bring together these techniques used to build the content-based and collaborative model, to produce the recommendations.

* It'll take the UserID and title of a movie as an input
* And return similar movies sorted according to the expected ratings by that user

In [28]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [29]:
links_small = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
md = pd. read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv')
md = md.drop([19730, 29503, 35587])
md['id'] = md['id'].astype('int')
smd = md[md['id'].isin(links_small)]

  md = pd. read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv')


In [30]:
id_map = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')
id_map.head()

Unnamed: 0_level_0,movieId,id
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1,862.0
Jumanji,2,8844.0
Grumpier Old Men,3,15602.0
Waiting to Exhale,4,31357.0
Father of the Bride Part II,5,11862.0


In [31]:
indices_map = id_map.set_index('id')

In [32]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [33]:
hybrid(20, 'The Postman')

Unnamed: 0,title,vote_count,vote_average,id,est
1169,The Third Man,431.0,7.9,1092,3.74644
927,Mr. Smith Goes to Washington,245.0,7.9,3083,3.618888
1407,Lost Highway,572.0,7.5,638,3.51108
892,The Wizard of Oz,1689.0,7.4,630,3.47002
820,The Story of Xinghua,0.0,0.0,297645,3.389704
2188,History of the World: Part I,205.0,6.5,10156,3.327273
80,Things to Do in Denver When You're Dead,87.0,6.7,400,3.315966
2293,Romancing the Stone,477.0,6.6,9326,3.310977
3443,Where the Heart Is,159.0,6.9,10564,3.272625
6249,The Crazies,67.0,6.0,29425,3.255151


In [34]:
hybrid(1, 'The Postman')

Unnamed: 0,title,vote_count,vote_average,id,est
927,Mr. Smith Goes to Washington,245.0,7.9,3083,3.441493
1169,The Third Man,431.0,7.9,1092,3.440676
892,The Wizard of Oz,1689.0,7.4,630,3.057074
1407,Lost Highway,572.0,7.5,638,3.036414
2188,History of the World: Part I,205.0,6.5,10156,2.985069
3689,Breaker Morant,37.0,7.2,13783,2.952526
2068,Family Plot,86.0,6.7,5854,2.896877
2160,Rush Hour,1254.0,6.8,2109,2.840741
1092,The Abyss,822.0,7.1,2756,2.825104
2293,Romancing the Stone,477.0,6.6,9326,2.790549


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

### References:
* https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system/notebook#Collaborative-Filtering
* https://www.kaggle.com/code/rounakbanik/movie-recommender-systems/notebook
* https://jusst.org/wp-content/uploads/2020/12/Comparative-Study-of-Machine-Learning-Algorithms-for-Recommendation.pdf
* https://www.mygreatlearning.com/blog/matrix-factorization-explained/
* https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/#:~:text=SVD%20is%20a%20matrix%20factorisation,(where%20K%3CN).