# Introduction

  While the number of choices in Internet is increasing daily, the need to filter, personalize and prioritize relevant information based on the user's need is becoming nowadays a necessity. Recommanding systems are an efficient way to solve this problem by creating a personalized content to each individual using a large panel of Data provided. This technique alleviate the issue of data overload and optimizes the costumer's choices.

The main aim of this notebook is to build a Movie Recommandation system using the TMDB 5000 Movie Dataset.

First, we well do the Data Cleaning in order to prepare the dataset to further manipulation. This step is important to achieve the highest level of efficiency by detecting inaccurate, unreasonnable and invalid information. This helps increase the quality of the data and facilitate the creation of the recommandation system.

Second, we will analyse our database in order to get a closer look into it. This is a crucial way to know the dataset and the relation between its columns. Being able to have information of the evolution, variations and construction of the file we possess will help us construct the logical thinking behind our Recommandation system.

Finally, we will create different recommandation engines starting from the most intuitive and easy one to more sophisticated systems. Each time, we will show the pros and limits of each one.

# Data Cleaning

## Import base

In [None]:
import pandas as pd

In [None]:
credits = pd.read_csv('tmdb_5000_credits.csv')
movies = pd.read_csv('tmdb_5000_movies.csv')

In [None]:
credits.head()

In [None]:
movies.head()

In [None]:
movies.columns

In [None]:
credits.dtypes
credits.columns

In [None]:
movies.shape

In [None]:
credits.shape

In [None]:
# Let's just transform the release date to a datetime type
movies['release_date'] = pd.to_datetime(movies['release_date'])

## Merge credits and movie

In [None]:
# credits and movies have a common column, title, that we need to drop
movies.drop(columns='title', inplace=True)

In [None]:
data = credits.merge(movies, left_on='movie_id', right_on='id')
data.drop(columns='id', inplace=True)

In [None]:
data

In [None]:
data.set_index('movie_id', inplace=True)

## Removing missing values

In [None]:
data.isna().sum()

We see that almost all homepages are missing, and as they are not particularly useful, removing completely the column should do. As for the release date and the runtime, we should be able to substitute a reasonable value. The most difficult is going to be working with the missing taglines, but we might end up not using them. 

In [None]:
# Let's take care of the missing release date
data.drop(index=data[data['release_date'].isna()].index, inplace=True)
# This particular movies was too recently released, and everything is missing so we drop it
# It seems that we didn't take into account all the empty lists and null values 

In [None]:
chimameti = data[data['runtime'].isna()].index[0]
frank_sinatra = data[data['runtime'].isna()].index[1]
data.loc[chimameti, 'runtime'] = 113.0
data.loc[frank_sinatra, 'runtime'] = 81.0

We are going to differentiate in addition of the existing data types lists and strings. It's going to be important as the list have subattributes that can be extracted later, but also because the missing value manifests by an empty list which isn't considered NaN by pandas.

In [None]:
data.dtypes

In [None]:
# Let's split the database by column type so that we can search for missing values properly
(data.select_dtypes('number') == 0).sum()

In [None]:
# Here we split the columns between strings and json 
str_cols, json_cols = [], []
for column in data.select_dtypes('object'):
    if data[column].iloc[0][0] == '[':
        json_cols.append(column)
    else :
        str_cols.append(column)

In [None]:
for column in data[json_cols]:
    data[column] = data[column].map(eval)

In [None]:
(data[json_cols].applymap(len) == 0).sum()

In [None]:
data[str_cols].isna().sum()

In [None]:
chimameti_desc = 'A biopic of the rise of father Jorge Mario Bergoglio SJ from a teacher in a Jesuit High School in Argentina to archbishop and cardinal of Buenos Aires to Pope of the Roman Catholic Church. The story touches on his relation with his fellow Jesuits in Argentina and Europe, to his relation with laureate writer Jorge Luis Borges, Argentine dictator Jorge Rafael Videla, and archbishops Laghi (nuncio to Argentina) and Quarracino (cardinal of Buenos Aires), up to the moment where he is elected Pope in 2013'
frank_sinatra_desc = 'The life of Frank Sinatra, as an actor and singer and the steps along the way that led him to become such an icon.'
food_chains_desc = 'There is so much interest in food these days yet there is almost no interest in the hands that pick that food. In the US, farm labor has always been one of the most difficult and poorly paid jobs and has relied on some of the nation\'s most vulnerable people. While the legal restrictions which kept people bound to farms, like slavery, have been abolished, exploitation still exists, ranging from wage theft to modern-day slavery. These days, this exploitation is perpetuated by the corporations at the top of the food chain: supermarkets. Their buying power has kept wages pitifully low and has created a scenario where desperately poor people are willing to put up with anything to keep their jobs.'

In [None]:
data[data['overview'].isna()]

In [None]:
# Let's fill in manually the missing overviews
data.loc[chimameti, 'overview'] = chimameti_desc
data.loc[frank_sinatra, 'overview'] = frank_sinatra_desc
data.loc[data['overview'].isna(), 'overview'] = food_chains_desc

In [None]:
data.drop(columns='homepage', inplace=True)

We delete the lines where cast or crew is missing because it would be much more work to recover these information and only a few are missing. 

In [None]:
data.drop(index=data[data['crew'].map(len) == 0].index, inplace=True)
data.drop(index=data[data['cast'].map(len) == 0].index, inplace=True)

Now we need to consider all the 'list' missing values. Those are more difficult to handle because they involve ids and conventions from the database. It's relatively easy to scrap imdb but less to match the info we can get there with the ones we already have. We will leave this part for when we have a specific id of our need in terms of missing values. 

## Transforming jsons into lists

We have multiple json columns, which will be more easy to manipulate once transformed into lists

In [None]:
json_cols

In [None]:
def json_to_list(json):
    return [element['name'] for element in json]

In [None]:
data[json_cols] = data[json_cols].applymap(json_to_list)

In [None]:
data[json_cols].head()

# Initial analysis

In [None]:
import ast

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn')

In [None]:
import datetime

In [None]:
import numpy as np

In [None]:
import seaborn as sns

## Descriptive statistics

In [None]:
# statistical description of the data base
data.describe()

To further information, we can explore the boxplots of each column which will allow us to optimize our knowledge of the dataset.

In [None]:
#the budget boxplot
bud = data['budget']
plt.boxplot(bud)
plt.show()

In [None]:
#the revenue boxplot
rev = data['revenue']
plt.boxplot(rev)
plt.show()

In [None]:
#the vote-average boxplot
vote = data['vote_average']
plt.boxplot(vote)
plt.show()

In [None]:
data.corr()

In [None]:
## Correlation map visualisation
f,ax = plt.subplots(figsize = (10,10))
sns.heatmap(data.corr(), annot = True, linewidths=.5, fmt = '.1f', ax = ax)
plt.title('Matrix of Correlation between the variables',weight='bold')
plt.show()


In [None]:
#showing the level of correlation between budget and vote_average using Scatter Plot
data.plot(kind='scatter', x='vote_average', y='budget', alpha=.5, color='r')
plt.xlabel('vote_average')
plt.ylabel('budget')
plt.title('level of correlation between budget and vote average', weight='bold')
plt.show()

In [None]:
#showing the level of correlation between budget and revenue using Scatter Plot
data.plot(kind = 'scatter', x = 'revenue', y = 'budget')
plt.title('level of correlation between budget and revenue',weight='bold')
plt.show()

In [None]:
# we will add a new column 'profit-rate' based on the relation between budget and revenue
data['profit_rate'] = data.revenue / data.budget
data.head()

In [None]:
high_revenue = data[['title', 'budget', 'revenue','profit_rate', 'release_date']].sort_values('revenue', ascending=False)
x = high_revenue.head(10)
x

Here we can see that having the highest budget or revenue does not mean a higher profit rate. For instance, the movie Minions made a profit rate of 15.6 for a relatively low budget-revenue and the Movie Avatar made a lower profit_rate of 11.7 with the highest revenue and budget.

## Genres analysis

In [None]:
from collections import Counter

In [None]:
genres = pd.DataFrame.from_dict(dict(Counter(data['genres'].sum())), orient='index').reset_index()
genres.columns=['genre', 'movies']
genres.sort_values(by='movies', inplace=True, ascending=False)
genres

In [None]:
plt.figure(figsize=(18,8))
ax = sns.barplot(x='movies', y='genre', data=genres)
ax.set_xlabel('number of movies')
ax.set_ylabel('movie\'s genre')
plt.title('Distribution of movies by genre',weight='bold')
plt.show()

To see the distribution of the genres in our Data Base, it seems to be helpfull not to work with the release_date which gives the exact date of release but rather consider the year of release so that we can group the films by year and see the distribution of the genres

### Genres in history

In [None]:
# To have a better insight on the release_date
data['release_date'].head()

In [None]:
# Extrating the year of release
data['year'] = data['release_date'].dt.strftime('%Y')

In [None]:
data.head()

Now that we have the column 'year', the visualisation is easier to code

In [None]:
# we can start by obtaining the range of years in the dataset
# minimum range value
min_year = data['year'].min()
# maximum range value
max_year = data['year'].max()
# print the range
print(min_year, max_year)

In [None]:
#grouping films by decades    
def decennie(x):
    return x %10
data['decade']=data['year'].apply(lambda x: (int(x) - int(x) % 10))
data

In [None]:
dec_genre = pd.DataFrame(data['decade'].value_counts()).reset_index()
dec_genre.columns = ['decades', 'movies']
dec_genre

In [None]:
plt.pie(dec_genre.movies, labels=dec_genre.decades,autopct='%2.1f%%',  startangle=180, shadow=True)
plt.axis('equal')
plt.title('proportion of movies in each decade', weight='bold')
plt.show()

In [None]:
decaded_genre = data.groupby(by='decade').agg({'genres':"sum"})['genres'].apply(lambda x : dict(Counter(x)))
decaded_genre

In [None]:
genre_decaded = pd.DataFrame(decaded_genre.tolist(),index=decaded_genre.index).fillna(0).applymap(int)
genre_decaded

In [None]:
ax = genre_decaded.applymap(lambda x : 0 if x < 10 else x).plot(kind='barh', stacked=True, figsize=(10, 6))
ax.set_ylabel('Decades')
ax.set_xlabel('Number of movies')
plt.legend(title='Genre', bbox_to_anchor=(1.0, 1), loc='upper left')
plt.title ('Distribution of movies\' genres by decades',weight='bold')
plt.show()

## Actors and Director

In [None]:
actors = pd.DataFrame.from_dict(dict(Counter(data['cast'].sum())), orient='index').reset_index()
actors.columns=['actor', 'movies']
actors.sort_values(by='movies', inplace=True, ascending=False)
Top_actors = actors.head(10)
Top_actors

In [None]:
plt.figure(figsize=(18,8))
ax = sns.barplot(x='movies', y='actor', data=Top_actors)
ax.set_xlabel('number of movies')
ax.set_ylabel('the actor')
plt.title('Actors with the highest appearence',weight='bold')
plt.show()

## Visualizing text fields

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
#converting the type of title and overview
data['title'] = data['title'].astype('str')
data['overview'] = data['overview'].astype('str')

In [None]:
title_corpus = ' '.join(data['title'])
overview_corpus = ' '.join(data['overview'])

In [None]:
# Creating and generating a word cloud image based on the title:
wordcloud = WordCloud().generate(title_corpus)
plt.figure(figsize=(12,12))# Display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('title wordcloud',weight='bold')
plt.show()

In [None]:
# Creating and generating a word cloud image based on the overview:
wordcloud = WordCloud().generate(overview_corpus)
plt.figure(figsize=(12,12))# Display the generated image
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('overview wordcloud',weight='bold')
plt.show()

# Recommendation systems

## Recommendation by rating

We will start with a simple recommendation system which simply recommends the best movies in the database.

In [None]:
# Let's choose the minimum amount of ratings for us to consider the movie
# Here we only take into account movies that have more votes than 80% of the list
min_votes = data['vote_count'].quantile(0.8)
min_votes

In [None]:
v = data['vote_count']
m = min_votes
R = data['vote_average']
C = data['vote_average'].mean()

In [None]:
data['weighted_rating'] = (v / (v + m) * R) + (m / (v+m) * C)

In [None]:
# The best movies of all time ...
data.sort_values('weighted_rating', ascending=False).head(10)

In [None]:
weight_average=data.sort_values('weighted_rating',ascending=False)
plt.figure(figsize=(10,6))
axis1=sns.barplot(x=weight_average['weighted_rating'].head(10), y=weight_average['original_title'].head(10), data=weight_average)
plt.xlim(5, 10)
plt.title('Best Movies of all times', weight='bold')
plt.xlabel('Weighted Score', weight='bold')
plt.ylabel('Movie', weight='bold');

In [None]:
# Which is not the same as if we sort by popularity
data.sort_values('popularity', ascending=False).head(10)

In [None]:
weight_popularity=data.sort_values('popularity',ascending=False)
plt.figure(figsize=(14,6))
axis1=sns.barplot(x=weight_popularity['popularity'].head(10), y=weight_popularity['original_title'].head(10), data=weight_popularity);
plt.title('Best Movies based on Popularity', weight='bold')
plt.xlabel('Popularity', weight='bold')
plt.ylabel('Movie', weight='bold');

A rating and recommendation system based on votes or popularity alone is not very effective nor very interesting, because it's not personnalised at all. So let's see what makes movies close to each other.

## Recommendation based on genre and quality 

We can make another simple recommendation system which is more personnalised. Given a movie, this recommendation system will score all other movies in terms of genre proximity and rating. It will give us back the movies with the best score ! This system is very simple but can be extended quite easily to the keyword category and make an overall not so simple recommendation system.

In [None]:
data.head(1)['genres']

In [None]:
def genre_proximity(movie1, movie2):
    genres1, genres2 = movie1.genres.iloc[0], movie2.genres
    common_genres = [genre for genre in genres1 if genre in genres2]
    return 2 * len(common_genres)/(len(genres1) + len(genres2))

We have a measure of genre proximity. As it is pretty simple and we can expect many different movies to have the same genres, we should improve it by recommending the best rated movies with the same genres.

In [None]:
def recommend_by_genre(movie_name):
    movie = data[data['title'] == movie_name]
    sub_data = data.drop(index=movie.index) 
    sub_data['proximity'] = sub_data.apply(lambda x: genre_proximity(movie, x), axis=1)
    sub_data.sort_values(['proximity', 'weighted_rating'], ascending=False, inplace=True)
    return sub_data.head(5)['title']

In [None]:
recommend_by_genre('The Dark Knight Rises')

## Recommendation by description proximity

We are going to vectorize the 'overview' text in order to compute the similarity between different overviews. This recommendation system will recommend movies that are closest based on overview similarity.

Since we will be working with ordered numpy matrices here, it will be easier to reinstate a normal index for the dataframe.

In [None]:
indexed_data = data.reset_index()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vect = TfidfVectorizer(min_df=1, stop_words='english')
tfidf = vect.fit_transform(indexed_data.overview)
pairwise_similarity = tfidf * tfidf.T

In [None]:
def overview_proximity(movie1, movie2):
    index1, index2 = movie1.index, movie2.index
    return pairwise_similarity[index1][index2]

In [None]:
def overview_recommend(movie_name):
    movie_index = indexed_data[indexed_data.title == movie_name].index
    indexed_data['overview_prox'] = pd.DataFrame(pairwise_similarity.toarray()[movie_index].T)
    return indexed_data.sort_values(by='overview_prox', ascending=False).drop(index=movie_index).head(10).title

In [None]:
overview_recommend('The Dark Knight Rises')

We can see that this method is fancy but not necessarily better than the more simple genre recommendation, because it tends to recommand movies which have the same specific vocabulary (such as here, the same hero). It is not necessarily an indicator of similarity.

## Clustering-based recommendation system

### Clustering-specific data cleaning

Let's encode the fields for the columns with mutiple labels binarily

In [None]:
all_genres = ['Drama','Comedy','Thriller','Action','Romance','Adventure','Crime','Science Fiction','Horror','Family','Fantasy','Mystery','Animation','History','Music','War','Documentary','Western','Foreign','TV Movie']

In [None]:
def binarize_genres(genre_list):
    binary_list = [int(genre in genre_list) for genre in all_genres]
    return binary_list