In [None]:
# -*- coding: utf-8 -*-
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Hybrid Recommender system

We will build a hybrid recommender system that leverages both collaborative and content-based filtering methods. The goal of a hybrid system is to combine the strengths of both approaches to provide more accurate recommendations.

**Inspired by:** [Movie Recommendation Engine](https://github.com/jalajthanaki/Movie_recommendation_engine/blob/master/Movie_recommendation_engine.ipynb).

## Datasets
In this Notebook, we will work once again with movies. You can download this public dataset from provided link: **[The Movies Dataset](https://drive.google.com/drive/folders/1JnQXDCsGAb75I4PRRMDHUO0WxmXT-usv)**

You will need to download one more file **`movies_metadata.csv`** from following link: **[movies_metadata.csv (Kaggle)](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)**

Here, we can see the files we are going to work with:
- `credits.csv`
- `keywords.csv`
- `links_small.csv`
- `movies_metadata.csv`
- `ratings.csv`

In [None]:
# !pip install surprise

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings

warnings.filterwarnings('ignore', category=pd.errors.SettingWithCopyWarning)

In [None]:
credits = pd.read_csv('hybrid_data/credits.csv')
keywords = pd.read_csv('hybrid_data/keywords.csv')
links_small = pd.read_csv('hybrid_data/links_small.csv')
md = pd.read_csv('hybrid_data/movies_metadata.csv')
ratings = pd.read_csv('hybrid_data/ratings_small.csv')

### Credits

In [None]:
credits.head()

In [None]:
credits.columns

- `cast`: Information about casting. Name of actor, gender and it's character name in movie
- `crew`: Information about crew members. Like who directed the movie, editor of the movie and so on.
- `id`: It's movie ID given by TMDb

In [None]:
credits.shape

In [None]:
credits.info()

### Keywords

In [None]:
keywords.head()

In [None]:
keywords.columns

- `id`: It's movie ID given by TMDb
- `keywords`: Tags/keywords for the movie. It list of tags/keywords

In [None]:
keywords.shape

In [None]:
keywords.info()

### Links

In [None]:
links_small.head()

In [None]:
links_small.columns

- `movieId`: It's serial number for movie
- `imdbId`: Movie id given on IMDb platform
- `tmdbId`: Movie id given on TMDb platform

In [None]:
links_small.shape

In [None]:
links_small.info()

### Metadata

In [None]:
md.iloc[0:3].transpose()

In [None]:
md.columns

- `adult`: Indicates if the movie is X-Rated or Adult.
- `belongs_to_collection`: A stringified dictionary that gives information on the movie series the particular film belongs to.
- `budget`: The budget of the movie in dollars.
- `genres`: A stringified list of dictionaries that list out all the genres associated with the movie.
- `homepage`: The Official Homepage of the move.
- `id`: The ID of the movie.
- `imdb_id`: The IMDB ID of the movie.
- `original_language`: The language in which the movie was originally shot in.
- `original_title`: The original title of the movie.
- `overview`: A brief blurb of the movie.
- `popularity`: The Popularity Score assigned by TMDB.
- `poster_path`: The URL of the poster image.
- `production_companies`: A stringified list of production companies involved with the making of the movie.
- `production_countries`: A stringified list of countries where the movie was shot/produced in.
- `release_date`: Theatrical Release Date of the movie.
- `revenue`: The total revenue of the movie in dollars.
- `runtime`: The runtime of the movie in minutes.
- `spoken_languages`: A stringified list of spoken languages in the film.
- `status`: The status of the movie (Released, To Be Released, Announced, etc.)
- `tagline`: The tagline of the movie.
- `title`: The Official Title of the movie.
- `video`: Indicates if there is a video present of the movie with TMDB.
- `vote_average`: The average rating of the movie.
- `vote_count`: The number of votes by users, as counted by TMDB.

In [None]:
md.shape

In [None]:
md.info()

### Ratings

In [None]:
ratings.head()

In [None]:
ratings.columns

- `userId`: It is id for User
- `movieId`: It is TMDb movie id.
- `rating`: Rating given for the particular movie by specific user
- `timestamp`: Time stamp when rating has been given by user

In [None]:
ratings.shape

In [None]:
ratings.info()

## Build Recommendation System

### Simple Recommendation System

__Approach__:

- The Simple Recommender offers __generalized__ recommendations to every user __based on movie popularity and (sometimes) genre__.

- The __basic idea__ behind this recommender is that __movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.__

- This model __does not give personalized recommendations based on the user__.

__What we are actually doing:__

- The implementation of this model is extremely trivial.
- All we have to do is __sort our movies based on ratings and popularity__ and display the top movies of our list.
- As an added step, we can __pass in a genre argument to get the top movies of a particular genre.__


In [None]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i[
    'name'] for i in x] if isinstance(x, list) else [])

- We use the TMDB Ratings to come up with our Top Movies Chart.
- We will use IMDB's weighted rating formula to construct my chart.
- Mathematically, it is represented as follows:

$$
\text{Weighted Rating (WR)} = \left(\frac{v}{v+m} \cdot R\right) + \left(\frac{m}{v+m} \cdot C\right)
$$

where:

- \( v \) is the number of votes for the movie
- \( m \) is the minimum votes required to be listed in the chart
- \( R \) is the average rating of the movie
- \( C \) is the mean vote across the whole report


In [None]:
# this is V
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

# this is R
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

# this is C
C = vote_averages.mean()
C

In [None]:
m = vote_counts.quantile(0.95)
m

In [None]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(
    lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
qualified = md[(md['vote_count'] >= m) & 
               (md['vote_count'].notnull()) & 
               (md['vote_average'].notnull())][['title', 
                                                'year', 
                                                'vote_count', 
                                                'vote_average', 
                                                'popularity', 
                                                'genres']]

qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

- Therefore, to qualify to be considered for the chart, a movie has to have at least __434 votes__ on TMDB.
- We also see that the __average rating__ for __a movie on TMDB__ is __5.244 on a scale of 10__.
- Here, only __2274 movies__ are qualify to be on our chart.

In [None]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [None]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [None]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [None]:
qualified.head(15)

- We see that three Christopher Nolan Films, Inception, The Dark Knight and Interstellar occur at the very top of our chart.
- The chart also indicates a strong bias of TMDB Users towards particular genres and directors.
- Let us now construct our function that builds charts for particular genres.

- For this, we relax our default conditions to the 85th percentile instead of 95.

In [None]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md.head(3).transpose()

In [None]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
                   (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: 
                        (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C),
                        axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the __Top 15 Romance Movies__ (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

#### Top 15 Romantic Movies

In [None]:
build_chart('Romance').head(15)

## Contend-based Recommender system

In [None]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [None]:
md['id'] = md['id'].apply(convert_int)
md[md['id'].isnull()]

In [None]:
md = md.drop([19730, 29503, 35587])

In [None]:
md['id'] = md['id'].astype('int')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

We have __9099 movies__ available in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

#### Theoretical note:
#### CountVectorizer vs TfidfVectorizer

In text processing and machine learning, `CountVectorizer` and `TfidfVectorizer` are two techniques used to convert text data into numerical vectors, suitable for use with machine learning algorithms. Here's an overview of the differences:

__CountVectorizer__

- **Purpose**: `CountVectorizer` transforms a list of text documents into a matrix of token (word) counts, commonly known as the "bag of words" model.
- **Output**: Each word in a document is represented by the count of its occurrence within that document.
- **Usage**: Useful when the frequency information is important, but it does not account for the relative importance of a word across the document set.
- **Advantages**: Simplicity and good performance on some tasks.
- **Disadvantages**: It doesn't consider the relative importance of words. Common words that appear in many documents, like "is" or "the", will have the same weight as rarer but potentially more informative words.

__TfidfVectorizer__

- **Purpose**: `TfidfVectorizer` operates similarly to `CountVectorizer` in creating features from text but also weights the words by their importance using the term frequency-inverse document frequency (TF-IDF) metric.
- **Output**: Words are weighted in a document higher if they occur frequently in a document, but not in many documents across the set, thus reducing the influence of common words.
- **Usage**: TF-IDF is useful when you need to assess the importance of words in documents and a collection of documents (e.g., document retrieval and recommendation).
- **Advantages**: It takes into account not only the frequency of words in a document but also how unique these words are with respect to the whole collection of documents.
- **Disadvantages**: It's a bit more complex and computationally intensive than `CountVectorizer`.

In general, if only raw word frequency matters, `CountVectorizer` may suffice. However, if you want to consider word importance relative to the entire dataset, `TfidfVectorizer` is the better option.


### Content-based Recommender System: Using movie description and taglines

- Let us first try to build a recommender using movie descriptions and taglines.
- We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [None]:
smd.loc[:, 'description'] = smd['overview'].fillna('') + smd['tagline'].fillna('')

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2), stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [None]:
tfidf_matrix.shape

- Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score.

- Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [None]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim[0]

- We now have a pairwise cosine similarity matrix for all the movies in our dataset.
- The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

- We're all set...!
- Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [None]:
get_recommendations('The Godfather').head(10)

In [None]:
get_recommendations('The Dark Knight').head(10)

We see that for The __Dark Knight__, our system is able to identify it as a __Batman film and subsequently recommend other Batman films__ as its top recommendations.

But unfortunately, that is all this system can do at the moment.

This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie.

Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than Overview and Tagline.
In the next subsection, we will build a more sophisticated recommender that takes __genre, keywords, cast and crew__ into consideration.

### Content based RS : Using movie description, taglines, keywords, cast, director and genres
- To build our standard metadata based content recommender, we will need to __merge our current dataset with the crew and the keyword datasets.__
- Let us prepare this data as our first step.

In [None]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [None]:
md.shape

In [None]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [None]:
smd = md[md['id'].isin(links_small)]
smd.shape

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

__1. Crew:__ From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.

__2. Cast:__ Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [None]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
smd['director'] = smd['crew'].apply(get_director)
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

- Approach to building the recommender is going to be extremely hacky.

- What We plan on doing is creating a metadata dump for every movie which consists of genres, director, main actors and keywords.

- We then use a __Count Vectorizer__ to create our __count matrix__

- The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:

1. __Strip Spaces and Convert to Lowercase__ from all our features. This way, our engine will not confuse between Johnny Depp and Johnny Galecki.
2. __Mention Director 2 times__ to give it __more weight relative to the entire cast.__

In [None]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

__Keywords__

- We will do a small amount of pre-processing of our keywords before putting them to any use.
- We __calculate the frequenct counts of every keyword__ that appears in the dataset.

In [None]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'
s = s.value_counts()
s[:5]

- Keywords occur in frequencies ranging from 1 to 610.
- We do not have any use for keywords that occur only once.
- Therefore, these can be safely removed.
- Finally, we will convert every word to its stem so that words such as Dogs and Dog are considered the same.

In [None]:
s = s[s > 1]

In [None]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

In [None]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [None]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [None]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [None]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2), stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

- We will reuse the get_recommendations function that we had written earlier.
- Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results.
- Let us check for The Dark Knight again and see what recommendations I get this time around.

In [None]:
get_recommendations('The Dark Knight').head(10)

- The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.
- I enjoyed watching The Dark Knight as well as some of the other ones in the list including Batman Begins, The Prestige and The Dark Knight Rises.

__Improvment__

- We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

In [None]:
get_recommendations('Inception').head(10)

In [None]:
get_recommendations('Mean Girls').head(10)

In [None]:
get_recommendations('Pulp Fiction').head(10)

__Add Popularity and Ratings__
- One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that Batman and Robin has a lot of similar characters as compared to The Dark Knight but
it was a terrible movie that shouldn't be recommended to anyone.

- Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

- We will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of , we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [None]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & 
                       (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [None]:
improved_recommendations('The Dark Knight')

In [None]:
improved_recommendations('Pulp Fiction')

## Collaborative filtering based Recommender System
__Our content based engine suffers from some severe limitations.__

- It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.
- Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who (s)he is.
- Therefore, in this section, we will use Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.
- We will not be implementing Collaborative Filtering from scratch. Instead, We will use the Surprise library that used extremely powerful algorithms like __Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.__

In [None]:
reader = Reader()

In [None]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [None]:
svd = SVD()

# Run 5-fold cross-validation and print results
results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset)

In [None]:
ratings[ratings['userId'] == 1]

In [None]:
svd.predict(1, 302)

- For movie with ID 302, we get an estimated prediction of 2.691. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have perceive the movie.

## Hybrid recommendation system
- In this section, will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

- __Input:__ User ID and the Title of a Movie

- __Output:__ Similar movies sorted on the basis of expected ratings by that particular user.

In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [None]:
id_map = pd.read_csv('hybrid_data/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [None]:
indices_map = id_map.set_index('id')

In [None]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'release_date', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [None]:
hybrid(1, 'Avatar')

In [None]:
hybrid(5000, 'Avatar')