# Predicting movie succes

## Table of contents

    1 Introduction & problem statement
    2 Data check & preparation
    3 Descriptives
        3.1 Basic descriptives
        3.2 Genres
        3.3 Keywords
        3.4 Actors
        3.5 Directors
    4 Predictives
        4.1 K-Nearest-Neighbours
        4.2 Regression
        4.3 Comparison of models
    5 Conclusion

## 1. Introduction & problem statement

For this research we used the TMDB 5000 Movie Dataset. This database contains two datasets, namely: tmdb_5000_movies and tmd_5000_credits. The first dataset mostly contains basic information about the movies like title, budget and vote count. In the second dataset, the cast and crew are defined in combination with the movie title. During our analysis these two datasets are combined to generated a complete overview of the available data. 


After screening the datasets and analyzing the movie industry a few interesting questions arose. But the most relevant question according to us is: 

**What are the factors of success for a movie based on runtime and budget?**

During our research we took a few variables into account which could influence the vote average of a movie. The factors that we analysed and elaborated in this notebook are:
-	Budget
-   Runtime
-	Revenue
-	Genres
-	Keywords
-	Actors
-	Directors

Moreover, we explored the relations between the variables mentioned above. In the descriptives we dealt with some of these relations and elaborated where necessary. One of the examples we examined is the relationship between revenue and the actors that played in a movie. Also the genre and the overall votes rating on the imdb is discussed. With these variables we aim to create a better insight in how to make the most efficient use of resources that the movie industry possesses.

## 2. Data check & preparation

We start by importing all the packages we will be needing for our analyses.

In [None]:
# Package import
from sklearn.preprocessing import Imputer
from sklearn.decomposition import PCA # Principal Component Analysis module
from sklearn.cluster import KMeans # KMeans clustering 
from sklearn.neighbors import KNeighborsClassifier
import nltk
from nltk.corpus import wordnet
PS = nltk.stem.PorterStemmer()
import matplotlib.pyplot as plt
import plotly.offline as pyo
pyo.init_notebook_mode()
from plotly.graph_objs import *
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)
from pandas.plotting import scatter_matrix
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

import json

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from sklearn.preprocessing import Imputer
from sklearn.decomposition import PCA # Principal Component Analysis module
from sklearn.cluster import KMeans # KMeans clustering 


import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output

We import the two datasets (movies, credits) that we use for our research. It’s important to note that we make use of two different databases which both are provided by TMDB. In the descriptive of Genres and Keywords we use the new database and in the second part of the descriptive, Actors and Directors, and in the Country part of the descriptives, we use the old database due the simple reason that the databases differ in some additional columns. However, the primary columns, like title and vote count, are the same for each databse. For an exact overview see the homepage of TMDB 500 Movie Dataset.

We start with the new version of the dataset. The new version consists of two dataframes, one with all the movie data and the other one with all the cast & crew data. We would like to concatenate both these dataframes into a single dataframe.

In [None]:
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

def pipe_flatten_names(keywords):
    return '|'.join([x['name'] for x in keywords])

credits = load_tmdb_credits("../input/tmdb_5000_credits.csv")
movies = load_tmdb_movies("../input/tmdb_5000_movies.csv")

del credits['title']
df_new = pd.concat([movies, credits], axis=1)

df_new['genres'] = df_new['genres'].apply(pipe_flatten_names)
df_new['keywords'] = df_new['keywords'].apply(pipe_flatten_names)

Now we convert the dataframe to its older version

In [None]:
# Columns that existed in the IMDB version of the dataset and are gone.
LOST_COLUMNS = [
    'actor_1_facebook_likes',
    'actor_2_facebook_likes',
    'actor_3_facebook_likes',
    'aspect_ratio',
    'cast_total_facebook_likes',
    'color',
    'content_rating',
    'director_facebook_likes',
    'facenumber_in_poster',
    'movie_facebook_likes',
    'movie_imdb_link',
    'num_critic_for_reviews',
    'num_user_for_reviews'
                ]

# Columns in TMDb that had direct equivalents in the IMDB version. 
# These columns can be used with old kernels just by changing the names
TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES = {
    'budget': 'budget',
    'genres': 'genres',
    'revenue': 'gross',
    'title': 'movie_title',
    'runtime': 'duration',
    'original_language': 'language',  # it's possible that spoken_languages would be a better match
    'keywords': 'plot_keywords',
    'vote_count': 'num_voted_users',
                                         }

IMDB_COLUMNS_TO_REMAP = {'imdb_score': 'vote_average'}


def safe_access(container, index_values):
    # return a missing value rather than an error upon indexing/key failure
    result = container
    try:
        for idx in index_values:
            result = result[idx]
        return result
    except IndexError or KeyError:
        return pd.np.nan


def get_director(crew_data):
    directors = [x['name'] for x in crew_data if x['job'] == 'Director']
    return safe_access(directors, [0])


def pipe_flatten_names(keywords):
    return '|'.join([x['name'] for x in keywords])


def convert_to_original_format(movies, credits):
    # Converts TMDb data to make it as compatible as possible with kernels built on the original version of the data.
    tmdb_movies = movies.copy()
    tmdb_movies.rename(columns=TMDB_TO_IMDB_SIMPLE_EQUIVALENCIES, inplace=True)
    tmdb_movies['title_year'] = pd.to_datetime(tmdb_movies['release_date']).apply(lambda x: x.year)
    # I'm assuming that the first production country is equivalent, but have not been able to validate this
    tmdb_movies['country'] = tmdb_movies['production_countries'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['language'] = tmdb_movies['spoken_languages'].apply(lambda x: safe_access(x, [0, 'name']))
    tmdb_movies['director_name'] = credits['crew'].apply(get_director)
    tmdb_movies['actor_1_name'] = credits['cast'].apply(lambda x: safe_access(x, [1, 'name']))
    tmdb_movies['actor_2_name'] = credits['cast'].apply(lambda x: safe_access(x, [2, 'name']))
    tmdb_movies['actor_3_name'] = credits['cast'].apply(lambda x: safe_access(x, [3, 'name']))
    tmdb_movies['genres'] = tmdb_movies['genres'].apply(pipe_flatten_names)
    tmdb_movies['plot_keywords'] = tmdb_movies['plot_keywords'].apply(pipe_flatten_names)
    return tmdb_movies

credits = load_tmdb_credits("../input/tmdb_5000_credits.csv")
movies = load_tmdb_movies("../input/tmdb_5000_movies.csv")
df_old = convert_to_original_format(movies, credits)

In [None]:
df_new.head()

In [None]:
df_old.head()

Note the difference in structure for the old and new dataframe

## 3. Descriptives

### 3.1 Basic descriptives

To start off we would like to answer the following question:

**How are the numerical values distributed?**

In [None]:
df = df_new.copy()
df['vote_classes'] = pd.cut(df['vote_average'],4, labels=["low", "medium-low","medium-high","high"])

In [None]:
df['log_budget'] = np.log(df['budget'])
df['log_popularity'] = np.log(df['popularity'])
df['log_vote_average'] = np.log(df['vote_average'])
df['log_vote_count'] = np.log(df['vote_count'])
df['log_revenue']= np.log(df['revenue'])
df['log_runtime']= np.log(df['runtime'])
df_copy = df
df=df[df.columns[-6:]]
df_copy2 = df

df=df[df.replace([np.inf, -np.inf], np.nan).notnull().all(axis=1)]
df=df.dropna(axis=1)

We will also use this dataframe in our numerical analysis, so we saved a copy of it.

In [None]:
df_new.head()

In [None]:
scatter_matrix(df,alpha=0.2, figsize=(20, 20), diagonal='kde')

The diagonal shows the distribution of that specific category, the others show a scatter plot between the two categories.

Next, we are interested in the following quesiton:

**What countries produce the most movies?**

We use the old dataframe for this

In [None]:
df = df_old

We first answer the question:

**How many unique countries are represented in our dataframe?**

In [None]:
df['country'].unique()

Next, we extract the number of films per country. Then we answer our main question of this section by means of a pie chart

In [None]:
df_countries = df['title_year'].groupby(df['country']).count()
df_countries = df_countries.reset_index()
df_countries.rename(columns ={'title_year':'count'}, inplace = True)
df_countries = df_countries.sort_values('count', ascending = False)
df_countries.reset_index(drop=True, inplace = True)

In [None]:
sns.set_context("poster", font_scale=0.6)
plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(11, 6))
labels = [s[0] if s[1] > 80 else ' ' 
          for index, s in  df_countries[['country', 'count']].iterrows()]
sizes  = df_countries['count'].values
explode = [0.0 if sizes[i] < 100 else 0.0 for i in range(len(df_countries))]
ax.pie(sizes, explode = explode, labels = labels,
       autopct = lambda x:'{:1.0f}%'.format(x) if x > 1 else '',
       shadow=False, startangle=45)
ax.axis('equal')
ax.set_title('% of films per country',
             bbox={'facecolor':'k', 'pad':5},color='w', fontsize=16);

Not surprisingly most of the movies come from the USA.

We can further visualize this pie chart by using a so-called choropleth map.

In [None]:
data = dict(type='choropleth',
locations = df_countries['country'],
locationmode = 'country names', z = df_countries['count'],
text = df_countries['country'], colorbar = {'title':'Films nb.'},
colorscale=[[0, 'rgb(224,255,255)'],
            [0.01, 'rgb(166,206,227)'], [0.02, 'rgb(31,120,180)'],
            [0.03, 'rgb(178,223,138)'], [0.05, 'rgb(51,160,44)'],
            [0.10, 'rgb(251,154,153)'], [0.20, 'rgb(255,255,0)'],
            [1, 'rgb(227,26,28)']],    
reversescale = False)

layout = dict(title='Number of films in the TMDB database',
geo = dict(showframe = True, projection={'type':'Mercator'}))

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

### 3.2 Genres

Now that we've got a good overview of the distribution of our numerical variables, let's take a closer look at our non-numerical variables. We choose to start with looking at the genres, since this variable has got the least variability, should be the most easy target for analysis.

We start off by renaming our new dataframe to simply df, then we create a list of all the unique genres

In [None]:
df = df_new.copy()

In [None]:
liste_genres = set()
for s in df['genres'].str.split('|'):
    liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres.remove('')

liste_genres

Now, let's reduce our data frame. To get more insight about the influence of a movie's genre, title, vote_average, release_data, runtime, budget and revenue are the most import important variables. We also add a column for every genre, containing only 1s and 0s whether a movie is of a specific genre or not.  

In [None]:
df_reduced = df[['title','vote_average','release_date','runtime','budget','revenue']].reset_index(drop=True)

for genre in liste_genres:
    df_reduced[genre] = df['genres'].str.contains(genre).apply(lambda x:1 if x else 0)
df_reduced[:5]

df_reduced.head()

The first question we want to answer is:

**How are the genres distributed and how do we visualize this?**

By creating a pie chart we can get a good overview/visualition of the distribution of the genres in the dataset.

In [None]:
plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(5,5))
genre_count = []
for genre in liste_genres:
    genre_count.append([genre, df_reduced[genre].values.sum()])
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_count]
ax.pie(sizes, labels=labels_selected,
      autopct = lambda x:'{:2.0f}%'.format(x) if x>1 else '',
      shadow = False, startangle=0)
ax.axis('equal')
plt.tight_layout()

This pie chart shows which genres are most common in the movies dataset. We find that drama movies are most common, followed by comedy. Afterwards, thriller and action movies are the most popular. Interestingly, half of the movies are from the top 5 genres. (51%). This suggests that the main genre of the most movies are drama, comedy, thriller, action. However, the top 5 most common genres could be seen as more general descriptions. For example, movies with the genre war might also be tagged as action movies or drama movies.

Now let's try to get a more in depth view of the genres. In this cell we calculate the average votes, budget, and revenue for the different genres. We create a new data frame consisiting of every genre and the calculated averages. Lastly, we add an extra column, profit, which is basicly mean revenue subtracted by mean budget.

In [None]:
mean_per_genre = pd.DataFrame(liste_genres)

#Mean votes average
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['vote_average'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_votes_average']=newArray2

#Mean budget
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['budget'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_budget']=newArray2

#Mean revenue 
newArray = []*len(liste_genres)
for genre in liste_genres:
    newArray.append(df_reduced.groupby(genre, as_index=True)['revenue'].mean())
newArray2 = []*len(liste_genres)
for i in range(len(liste_genres)):
    newArray2.append(newArray[i][1])

mean_per_genre['mean_revenue']=newArray2

mean_per_genre['profit'] = mean_per_genre['mean_revenue']-mean_per_genre['mean_budget']

mean_per_genre

The next question we want to ask is:

**Which genre is best scoring in the catagories mean votes average, mean budget, mean revenue and profit?**

We sort the table on the category of interest in ascening order, then we'll show the top five highest scoring in that category.

In [None]:
mean_per_genre.sort_values('mean_votes_average', ascending=False).head()

In [None]:
mean_per_genre.sort_values('mean_budget', ascending=False).head()

In [None]:
mean_per_genre.sort_values('mean_revenue', ascending=False).head()

In [None]:
mean_per_genre.sort_values('profit', ascending=False).head()

It's very interesting to see that the top 5 highest vote average consists of *History, War, Drama, Music* and *Foreign*, while none of these genres are in either one of the other three categories, which all have the same top 3: *Animation, Adventure, Fantasy*. This is easily explained, since budget and revenue should be closely related and profit is directly derived from budget and revenue. However, we would have expected a higher correlation between the budget and the quality of a movie.

The next question we want to answer is:

**What are the averages in the categories per year?**

We first extend the dataframe. with the year of release per movie.  Afterwards, we create a new dataframe which contains the average votes, average revenue, average budget and average profit per release year and per genre. 

In the last step in the cell below, only the rows that contain a 1 for genre are kept, so we create a data frame with only the specific genres. 

In [None]:
from datetime import datetime

t = df_reduced['release_date']
t = pd.to_datetime(t)
t = t.dt.year
df_reduced['release_year'] = t

df_list = []*len(liste_genres)
for genre in liste_genres:
    df_list.append(df_reduced.groupby([genre,'release_year']).mean().reset_index())

df_per_genre = []*len(liste_genres)
for i in range(len(df_list)):
    df_per_genre.append(df_list[i][df_list[i].ix[:,0] == 1])

Now we create tables which contain the average budget, average revenue, and average votes per year per genre. We start with creating a new table with the cloumns 1988 till 2017. Afterwards, the data for the different variables is implemented. 

In [None]:
# Budget
columns = range(1988,2018)
budget_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'budget', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    budget_genre.loc[liste_genres.index(genre)]=temp
budget_genre['genre']=liste_genres

# Revenue 

columns = range(1988,2018)
revenue_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'revenue', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    revenue_genre.loc[liste_genres.index(genre)]=temp
revenue_genre['genre']=liste_genres

# Vote average 
columns = range(1988,2018)
vote_avg_genre = pd.DataFrame( columns = columns)

for genre in liste_genres:
    temp=(df_per_genre[liste_genres.index(genre)].pivot_table(index = genre, columns = 'release_year', values = 'vote_average', aggfunc = np.mean))
    temp = temp[temp.columns[-30:]].loc[1]
    vote_avg_genre.loc[liste_genres.index(genre)]=temp
vote_avg_genre['genre']=liste_genres

Let us define our question more in depth in four subquestions:

**What is the average budget per genre per year?**

In [None]:
budget_genre.index = budget_genre['genre']
budget_genre

**What is the average revenue per genre per year?**

In [None]:
revenue_genre.index = revenue_genre['genre']
revenue_genre

**What is the average vote average per genre per year?**

In [None]:
vote_avg_genre.index = vote_avg_genre['genre']
vote_avg_genre

**What is the average profit per genre per year?**

In [None]:
profit_genre = revenue_genre[revenue_genre.columns[0:29]]-budget_genre[budget_genre.columns[0:29]]
profit_genre

Another question we want to answer is:

**How have the different categories per genre evolved over the years?**

We can viually answer this question using heatmaps

Again we split up this question in four sub-questions:

**How has the relationship between budget and genres evolved?**

In [None]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(budget_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

The heatmap shows that in general, movies had  an increasing budget over the years. Especially the genres Fantasy, advernture, family, action, science fiction, and animation. The heatmap also shows that Western movies had an extremely high budget in 2013. This could mean that a costly movie is produced in 2013 which has great influence on the average. 

We would like to remove this very high budget input from the Western genre, to make the heatmap less skewed. We want to asnwer the following question:

**How are the genres and budget related over the years if we remove the outlier?**

In [None]:
temp = budget_genre
temp[2013]=temp[2013].replace(2.550000e+08, 0)

fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(temp.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

This heatmap obviously shows that Fantasy Adventure, Science Fiction, and Animation have on average the highest budget. It is also clear that movies had an increasing budget over the years. However, there are a few exceptions. For example  Western movies had an above average budget in 2004 and history in 2000. This might be an effect of individual movies with a high budget. 

**How has the relationship between revenue and genres evolved?**

In [None]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(revenue_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

This heatmap shows the average revenue of genres from 1988 till 2017. The most clear increase of average is in the genres fantasy, adventure, family, action, science fiction. Interestingly, the graph shows that the revenues of the genre animation are colored black in 1994. This is surprisingly because there are no black colored revenues in the graph and in general revenues are lower in 1994 than movies that are produced in later years.  A reason for this could be that there are only a few movies in the genre animation in 1994 and that those movies did extremely well.  The previous heatmap does not show an above average budget for animation movies in 1994. 

We would also like to analyze revenue when removing the outliers. The values for both 1992 and 1994 are removed from the dataframe. We want to answer the following question:

**How are the genres and revenue related over the years if we remove the outliers?**

In [None]:
temp2 = revenue_genre
temp2[1994] = temp2[1994].replace(788241776.0, 0)
temp2[1992] = temp2[1992].replace(504050219.0, 0)

fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(temp2.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

We now get a much clearer idea of how the revenues per genre has evolved over the years. We once again see the same genres having the highest revenue and we also see that the revenue has greatly increased in the past years.

**How has the relationship between profit and genres evolved?**

In [None]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(profit_genre, xticklabels=3, cmap=cmap, linewidths=0.05)

**How has the relationship between vote average and genres evolved?**

In [None]:
fig, ax = plt.subplots(figsize=(9,9))
cmap = sns.cubehelix_palette(start = 1.5, rot = 1.5, as_cmap = True)
sns.heatmap(vote_avg_genre.ix[:,0:30], xticklabels=3, cmap=cmap, linewidths=0.05)

This heatmap is way darker than the revenue and budget, which suggests that the average is relatively higher than in the other two categories. Most of the categories seem to be getting somewhere around a 6 out of 10 score. Especially notable is the fact that there are very few green or orange colored cells, which should mean that the most movies are on average just a nice watch.

In [None]:
from datetime import datetime

df_genre = pd.DataFrame(columns = ['genre', 'cgenres', 'budget', 'gross', 'year'])
#list(map(datetime.year, df_reduced["release_date"]))
t = df['release_date']
t = pd.to_datetime(t)
t = t.dt.year
df_genre['release_year'] = t

colnames = ['budget', 'genres', 'revenue']
df_clean = df[colnames]
df_clean['release_year'] = t
df_clean = df_clean.dropna()
df_genre = df_genre.dropna()
df_clean.head()

Lastly, we would like to look at connections between genres, we would like to answer the following question:

**How often are genres used in combination with each other**

We can visualize this.

In [None]:
def genreRemap(row):
    global df_genre
    d = {}
    genres = np.array(row['genres'].split('|'))
    n = genres.size
    d['budget'] = [row['budget']]*n
    d['revenue'] = [row['revenue']]*n
    d['year'] = [row['release_year']]*n
    d['genre'], d['cgenres'] = [], []
    for genre in genres:
        d['genre'].append(genre)
        d['cgenres'].append(genres[genres != genre])
    df_genre = df_genre.append(pd.DataFrame(d), ignore_index = True)

df_clean.apply(genreRemap, axis = 1)
df_genre['year'] = df_genre['year'].astype(np.int16)
df_genre = df_genre[['genre', 'budget', 'gross', 'year', 'cgenres']]

In [None]:
####################
# make connections #
####################
d_genre = {}
def connect(row):
    global d_genre
    genre = row['genre']
    cgenres = row['cgenres']
    if genre not in d_genre:
        d_cgenres = dict(zip(cgenres, [1]*len(cgenres)))
        d_genre[genre] = d_cgenres
    else:
        for cgenre in cgenres:
            if cgenre not in d_genre[genre]:
                d_genre[genre][cgenre] = 1
            else:
                d_genre[genre][cgenre] += 1
                
df_genre.apply(connect, axis = 1)
l_genre = list(d_genre.keys())
l_genre.sort()
###########################
# find largest connection #
###########################
cmax = 0
for key in d_genre:
    for e in d_genre[key]:
        if d_genre[key][e] > cmax:
            cmax = d_genre[key][e]
#########################
# visualize connections #
#########################
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import cm
color = cm.get_cmap('rainbow')
f, ax = plt.subplots(figsize = (7, 9))

codes = [Path.MOVETO, Path.CURVE4, Path.CURVE4, Path.CURVE4]

X, Y = 1, 1
wmin, wmax = 1, 32
amin, amax = 0.1, 0.25
getPy = lambda x: Y*(1 - x/len(l_genre))
for i, genre in enumerate(l_genre):
    yo = getPy(i)
    ax.text(0, yo, genre, ha = 'right')
    ax.text(X, yo, genre, ha = 'left')
    for cgenre in d_genre[genre]:
        yi = getPy(l_genre.index(cgenre))
        verts = [(0.0, yo), (X/4, yo), (2*X/4, yi), (X, yi)]
        path = Path(verts, codes)
        r, g, b, a = color(i/len(l_genre))
        width = wmin + wmax*d_genre[genre][cgenre]/cmax
        alpha = amin + amax*(1 - d_genre[genre][cgenre]/cmax)
        patch = patches.PathPatch(path, facecolor = 'none', edgecolor = (r, g, b), lw = width, alpha = alpha)
        ax.add_patch(patch)

ax.grid(False)
ax.set_xlim(0.0, X)
ax.set_ylim(0.0, Y + 1/len(l_genre))
ax.set_yticklabels([])
ax.set_xticklabels([])
plt.show()

This graph shows how often genres are used in combination with another genre. For evey time two genres occur together in a movie, a line is drawn from one genre from the other.

### 3.3 Keywords

Now we have some more insight on the different genres, let's take a look at different keywords. Are there keywords which influence a movie's rating in one way or another? What about the revenue? We use the new dataframe for this section as well.

In [None]:
df = df_new.copy()

In [None]:
liste_keywords = set()
for s in df['keywords'].str.split('|'):
    liste_keywords = set().union(s, liste_keywords)
liste_keywords = list(liste_keywords)
liste_keywords.remove('')

The first question we want to answer is as follows:

**Which keywords occur the most in the dataset?**

We use the following function to count them.

In [None]:
def count_word(df, ref_col, liste):
    keyword_count = dict()
    for s in liste: keyword_count[s] = 0
    for liste_keywords in df[ref_col].str.split('|'):        
        if type(liste_keywords) == float and pd.isnull(liste_keywords): continue        
        for s in [s for s in liste_keywords if s in liste]: 
            if pd.notnull(s): keyword_count[s] += 1
    #______________________________________________________________________
    # convert the dictionary in a list to sort the keywords by frequency
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    return keyword_occurences, keyword_count

keyword_occurences, dum = count_word(df, 'keywords', liste_keywords)
keyword_occurences[:5]

We now have a sorted array that contains all the keywords and how often they occur.

Before we can further analyse the keywords, we need to clean the data a bit.

Of course, different movies use different keywords for their movies. A problem is, that often a lot of those keywords are the same, although they are communicated in a different form by the different movie producers. The function above inventorizes the different keywords using nltk. The package identifies the 'roots' of different words and groups the different words according to its root. Then, we can replace the words that have a common root with their root. In this way, similar words that are phrased differently are assigned a common 'root'.

When executing the function, it also shows the amount of different keywords, 9474 in our case.

In [None]:
#We collect all the keywords:
def keywords_inventory(dataframe, colonne = 'keywords'):
    PS = nltk.stem.PorterStemmer()
    keywords_roots  = dict()  # collect the words / root
    keywords_select = dict()  # association: root <-> keyword
    category_keys = []
    icount = 0
    for s in dataframe[colonne]:
        if pd.isnull(s): continue
        for t in s.split('|'):
            t = t.lower() ; racine = PS.stem(t)
            if racine in keywords_roots:                
                keywords_roots[racine].add(t)
            else:
                keywords_roots[racine] = {t}
    
    for s in keywords_roots.keys():
        if len(keywords_roots[s]) > 1:  
            min_length = 1000
            for k in keywords_roots[s]:
                if len(k) < min_length:
                    clef = k ; min_length = len(k)            
            category_keys.append(clef)
            keywords_select[s] = clef
        else:
            category_keys.append(list(keywords_roots[s])[0])
            keywords_select[s] = list(keywords_roots[s])[0]
                   
    print("Nb of keywords in variable '{}': {}".format(colonne,len(category_keys)))
    return category_keys, keywords_roots, keywords_select

keywords, keywords_roots, keywords_select = keywords_inventory(df, colonne = 'keywords')

The function below displays a sample of 14 keywords and their roots

In [None]:
# Plot of a sample of keywords that appear in close varieties 
#------------------------------------------------------------
icount = 0
for s in keywords_roots.keys():
    if len(keywords_roots[s]) > 1: 
        icount += 1
        if icount < 15: print(icount, keywords_roots[s], len(keywords_roots[s]))

The function below replaces the different forms of the words by their root. Then, we store the cleaned keywords in a new dataframe.

In [None]:
def remplacement_df_keywords(df, dico_remplacement, roots = False):
    df_new = df.copy(deep = True)
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            clef = PS.stem(s) if roots else s
            if clef in dico_remplacement.keys():
                nouvelle_liste.append(dico_remplacement[clef])
            else:
                nouvelle_liste.append(s)       
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste)) 
    return df_new

df_keywords_cleaned = remplacement_df_keywords(df, keywords_select,roots = True)
df_keywords_cleaned.head()

Next, we will use the nltk package to get rid of synonyms. The function below take a word as a parameter and returns all of the synonyms of that word according to the nltk package. Then we inventorize all the different keywords and check what percentage of the keywords is affected by the replacements. Lastly, we show a few examples of replacements.

In [None]:
def get_synonymes(word):
    lemma = set()
    for ss in wordnet.synsets(word):
        for w in ss.lemma_names():
            #_______________________________
            # We just get the 'nouns':
            index = ss.name().find('.')+1
            if ss.name()[index] == 'n': lemma.add(w.lower().replace('_',' '))
    return lemma

def test_keyword(mot, key_count, threshold):
    return (False , True)[key_count.get(mot, 0) >= threshold]

keyword_occurences.sort(key = lambda x:x[1], reverse = False)
key_count = dict()
for s in keyword_occurences:
    key_count[s[0]] = s[1]
#__________________________________________________________________________
# Creation of a dictionary to replace keywords by higher frequency keywords
remplacement_mot = dict()
icount = 0
for index, [mot, nb_apparitions] in enumerate(keyword_occurences):
    if nb_apparitions > 5: continue  # only the keywords that appear less than 5 times
    lemma = get_synonymes(mot)
    if len(lemma) == 0: continue     # case of the plurals
    #_________________________________________________________________
    liste_mots = [(s, key_count[s]) for s in lemma 
                  if test_keyword(s, key_count, key_count[mot])]
    liste_mots.sort(key = lambda x:(x[1],x[0]), reverse = True)    
    if len(liste_mots) <= 1: continue       # no replacement
    if mot == liste_mots[0][0]: continue    # replacement by himself
    icount += 1
    if  icount < 8:
        print('{:<12} -> {:<12} (init: {})'.format(mot, liste_mots[0][0], liste_mots))    
    remplacement_mot[mot] = liste_mots[0][0]

print(90*'_'+'\n'+'The replacement concerns {}% of the keywords.'
      .format(round(len(remplacement_mot)/len(keywords)*100,2)))

# 2 successive replacements
#---------------------------
print('Keywords that appear both in keys and values:'.upper()+'\n'+45*'-')
icount = 0
for s in remplacement_mot.values():
    if s in remplacement_mot.keys():
        icount += 1
        if icount < 10: print('{:<20} -> {:<20}'.format(s, remplacement_mot[s]))

for key, value in remplacement_mot.items():
    if value in remplacement_mot.keys():
        remplacement_mot[key] = remplacement_mot[value]

We now replace all synonyms by the main keyword, after which we show the unique number of keywords once again.

In [None]:
# replacement of keyword varieties by the main keyword
#----------------------------------------------------------
df_keywords_synonyms = remplacement_df_keywords(df_keywords_cleaned, remplacement_mot, roots = False)   
keywords, keywords_roots, keywords_select = keywords_inventory(df_keywords_synonyms, colonne = 'keywords')

# New count of keyword occurences
#-------------------------------------
keywords.remove('')
new_keyword_occurences, keywords_count = count_word(df_keywords_synonyms,
                                                    'keywords',keywords)

Lastly, we delete all keywords that occur less than 5 times, note how the amount of unique keywords has decreased significantly.

In [None]:
# deletion of keywords with low frequencies
#-------------------------------------------
def remplacement_df_low_frequency_keywords(df, keyword_occurences):
    df_new = df.copy(deep = True)
    key_count = dict()
    for s in keyword_occurences: 
        key_count[s[0]] = s[1]    
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            if key_count.get(s, 4) > 3: nouvelle_liste.append(s)
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste))
    return df_new

# Creation of a dataframe where keywords of low frequencies are suppressed
#-------------------------------------------------------------------------
df_keywords_occurence = remplacement_df_low_frequency_keywords(df_keywords_synonyms, new_keyword_occurences)
keywords, keywords_roots, keywords_select = keywords_inventory(df_keywords_occurence, colonne = 'keywords')   

Now that we have cleaned the data, we can start analysing. Let's try to analyse the influence of keywords on movie ratings or revenue. We apply the same methods as we did with the genre analysis..

In [None]:
df_keywords= df_keywords_occurence
keyword_list = set()
for s in df_keywords['keywords'].str.split('|'):
    keyword_list = set().union(s, keyword_list)
keyword_list = list(keyword_list)
keyword_list.remove('')

df_reduced = df_keywords[['title','vote_average','release_date','runtime','budget','revenue']].reset_index(drop=True)

for keyword in keyword_list:
    df_reduced[keyword] = df['keywords'].str.contains(keyword).apply(lambda x:1 if x else 0)
df_reduced[:5]

df_reduced.head()

In [None]:
mean_per_keyword = pd.DataFrame(keyword_list)

#Mean votes average
newArray1 = []*len(keyword_list)
for keyword in keyword_list:
    newArray1.append(df_reduced.groupby(keyword, as_index=True)['vote_average'].mean())
    
#Mean budget
newArray2 = []*len(keyword_list)
for keyword in keyword_list:
    newArray2.append(df_reduced.groupby(keyword, as_index=True)['budget'].mean())
    
#Mean revenue
newArray3 = []*len(keyword_list)
for keyword in keyword_list:
    newArray3.append(df_reduced.groupby(keyword, as_index=True)['revenue'].mean())

mean_per_keyword['mean_vote_average']=list(pd.DataFrame(newArray1)[1])
mean_per_keyword['mean_budget']=list(pd.DataFrame(newArray2)[1])
mean_per_keyword['mean_revenue']=list(pd.DataFrame(newArray3)[1])
mean_per_keyword['profit']= (mean_per_keyword['mean_revenue'] - mean_per_keyword['mean_budget'])

The first question we want to answer is:

**Which keyword is best scoring in the catagories mean votes average, mean budget, mean revenue and profit?**

In [None]:
mean_per_keyword.sort_values('mean_vote_average', ascending=False).head()

In [None]:
mean_per_keyword.sort_values('mean_budget', ascending=False).head()

In [None]:
mean_per_keyword.sort_values('mean_revenue', ascending=False).head()

In [None]:
mean_per_keyword.sort_values('profit', ascending=False).head()

We want to create the tables above again, but then for only the 50 most occuring keywords. Of course it's cool to see the keywords 'hobbit' and 'school of witchcraft' as high-revenue-keywords, but they probably don't occur outside of he Lord of the Rings and Harry Potter movies, so they're actually not that interesting. We start by answering the following quesiton:

**What are the most occuring keywords?**

In [None]:
fig = plt.figure(1, figsize=(18,13))
trunc_occurences = new_keyword_occurences[0:50]
# LOWER PANEL: HISTOGRAMS
ax2 = fig.add_subplot(2,1,2)
y_axis = [i[1] for i in trunc_occurences]
x_axis = [k for k,i in enumerate(trunc_occurences)]
x_label = [i[0] for i in trunc_occurences]
plt.xticks(rotation=85, fontsize = 15)
plt.yticks(fontsize = 15)
plt.xticks(x_axis, x_label)
plt.ylabel("Nb. of occurences", fontsize = 18, labelpad = 10)
ax2.bar(x_axis, y_axis, align = 'center', color='g')
#_______________________
plt.title("Keywords popularity",bbox={'facecolor':'k', 'pad':5},color='w',fontsize = 25)
plt.show()

We now create a data frame with average movie scores, budget and revenue for the most occuring keywords.

In [None]:
Df1 = pd.DataFrame(trunc_occurences)
Df2 = mean_per_keyword
result = Df1.merge(Df2, left_on=0, right_on=0, how='inner')

result = result.rename(columns ={0:'keyword', 1:'occurences'})

result.sort_values('mean_vote_average', ascending= False)

Let's try to visualize this table a little bit clearer. We want to answer the following 2 questions:

**What is the distribution of average votes among the 50 most occuring keywords?**

**What is the distribution of budget among the 50 most occuring keywords?**

In [None]:
import matplotlib.pyplot as plt

ax = result.plot.bar(x = 'keyword', y='mean_vote_average', title="mean vote average",
                     figsize=(15,4), legend=True, fontsize=12, color='green', label = "mean vote average")
ax.set_ylim(5, 8)
ax.axhline(y=result['mean_vote_average'].mean(),c="blue",linewidth=0.5, label='mean')
ax.legend()
plt.show()

import matplotlib.pyplot as plt

ax = result.plot.bar(x = 'keyword', y='mean_budget', title="mean budget",
                     figsize=(15,4), legend=True, fontsize=12, color='blue', label="mean budget")
ax.axhline(y=result['mean_budget'].mean(),c="blue",linewidth=0.5, label='mean')
ax.legend()
plt.show()

Now this is interesting. The keyword with by far the highest average rating - Serial Killer - also has by far the lowest average budget. Also note that superhero movies score below average, but have a highly above average budget. Now we answer the following question:

**What is the distribution of revenue among the 50 most occuring keywords?**

In [None]:
ax = result.plot.bar(x = 'keyword', y='mean_revenue', title="mean revenue",
                     figsize=(15,4), legend=True, fontsize=12, color='yellow', label="mean revenue")
ax.axhline(y=result['mean_revenue'].mean(),c="blue",linewidth=0.5, label='mean')
ax.legend()
plt.show()

So superhero movies do have a high revenue and serial killer movies do not. Next we answer:

**What is the distribution of profit among the 50 most occuring keywords?**

In [None]:
result['profit'] = result['mean_revenue'] - result['mean_budget']

ax = result.plot.bar(x = 'keyword', y='profit', title="profit",
                     figsize=(15,4), legend=True, fontsize=12, color='red', label="profit")
ax.axhline(y=result['profit'].mean(),c="blue",linewidth=0.5, label='mean')
ax.legend()
plt.show()

What's notable is that despite the popularity of female directors and independent movies, they do not have high scores on either revenue or vote_average.

### 3.4 Actors

Part 4 is about the actors of the movies. We analyse which actors made the most revenue, resulted in a high IMDB score, and played in movies with the highest budgets. We also made a graph that contains information about the average budget and every revenue per actor. Lastly, we made a clear overview for all the movies Morgan Freeman played in and what the IMDB scores where. The main purpuse of this part is getting more insight in the data and answering the following question:

**Which actors played in movies with the highest revenue/highest budget/ highest IMDB score?**

For this section, we use the old version of the dataframe.

In [None]:
df = df_old

Next, we delete all the columns we won't be needing for this analysis.

In [None]:
columns = ['homepage', 'plot_keywords', 'language', 'overview', 'popularity', 'tagline',
           'original_title', 'num_voted_users', 'country', 'spoken_languages', 'duration',
          'production_companies', 'production_countries', 'status']

df = df.drop(columns, axis=1)

We are interested in the same descriptives for the actors, as we were for keywords and the genres. To do that, we first have to, once again, restructure the dataframe.

We first create a seperate dataframe for each of the three actors, after which we can combine them to get one dataframe with all three types of actor.

In [None]:
liste_genres = set()
for s in df['genres'].str.split('|'):
    liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres.remove('')

df_reduced = df[['actor_1_name', 'vote_average',
                 'title_year', 'movie_title', 'gross', 'budget']].reset_index(drop = True)
for genre in liste_genres:
    df_reduced[genre] = df['genres'].str.contains(genre).apply(lambda x:1 if x else 0)

df_reduced2 = df[['actor_2_name', 'vote_average',
                 'title_year', 'movie_title', 'gross', 'budget']].reset_index(drop = True)
for genre in liste_genres:
    df_reduced2[genre] = df['genres'].str.contains(genre).apply(lambda x:1 if x else 0)

df_reduced3 = df[['actor_3_name', 'vote_average',
                 'title_year', 'movie_title', 'gross', 'budget']].reset_index(drop = True)
for genre in liste_genres:
    df_reduced3[genre] = df['genres'].str.contains(genre).apply(lambda x:1 if x else 0)

In [None]:
df_reduced = df_reduced.rename(columns={'actor_1_name': 'actor'})
df_reduced2 = df_reduced2.rename(columns={'actor_2_name': 'actor'})
df_reduced3 = df_reduced3.rename(columns={'actor_3_name': 'actor'})

total = [df_reduced, df_reduced2, df_reduced3]
df_total = pd.concat(total)
df_total.head()

We compute averages for all actors in two categories: vote_average and title_year. We also compute an actors favorite genre.

In [None]:
df_actors = df_total.groupby('actor').mean()
df_actors.loc[:, 'favored_genre'] = df_actors[liste_genres].idxmax(axis = 1)
df_actors.drop(liste_genres, axis = 1, inplace = True)
df_actors = df_actors.reset_index()

We expect the dataframe to contain a lot of actors that have only a single observation. These observation are likely to cause outliers if these observations are very extreme. We delete all actors that are linked to less than 10 movies in our dataframe.

In [None]:
df_appearance = df_total[['actor', 'title_year']].groupby('actor').count()
df_appearance = df_appearance.reset_index(drop = True)
selection = df_appearance['title_year'] > 9
selection = selection.reset_index(drop = True)
most_prolific = df_actors[selection]

Now that we have a clear dataframe, let us answer the following three questions:

**Which actors have the highest average score?**

**Which actors have the highest average revenue?**

**Which actors have the highest average budget?**

In [None]:
most_prolific.sort_values('vote_average', ascending=False).head()

In [None]:
most_prolific.sort_values('gross', ascending=False).head()

In [None]:
most_prolific.sort_values('budget', ascending=False).head()

Looks like Sir Ian McKellen has had quite a career. He came out on top on all three of our attributes. He plays in the movies with the highsest budget, but returns this with the highest average revenues. It makes sense that these enormous budgets lead to good movies. This is reflected by him having the highest average score on IMDB.

We can now develop several plots to visually answer the questions. Let us start by plotting the average revenue and budget per actor.

In [None]:
genre_count = []
for genre in liste_genres:
    genre_count.append([genre, df_reduced[genre].values.sum()])
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_count]
reduced_genre_list = labels[:19]
trace=[]
for genre in reduced_genre_list:
    trace.append({'type':'scatter',
                  'mode':'markers',
                  'y':most_prolific.loc[most_prolific['favored_genre']==genre,'gross'],
                  'x':most_prolific.loc[most_prolific['favored_genre']==genre,'budget'],
                  'name':genre,
                  'text': most_prolific.loc[most_prolific['favored_genre']==genre,'actor'],
                  'marker':{'size':10,'opacity':0.7,
                            'line':{'width':1.25,'color':'black'}}})
layout={'title':'Actors favored genres',
       'xaxis':{'title':'mean budget'},
       'yaxis':{'title':'mean revenue'}}
fig=Figure(data=trace,layout=layout)
pyo.iplot(fig)

We can also use this data to highlight single actors. Let us take a look at actors for who we have data of more than 20 movies.

In [None]:
selection = df_appearance['title_year'] > 20
most_prolific = df_actors[selection]
most_prolific

So let's have a look at Morgan Freeman. We would like to have a clear overview of all the movies he played in and what his movies scored on IMDB. We can do this using a polar chart. We want to answer the following question:

**What is the distribution of average scores made by Morgan Freeman's movies?**

In [None]:
class Trace():
    #____________________
    def __init__(self, color):
        self.mode = 'markers'
        self.name = 'default'
        self.title = 'default title'
        self.marker = dict(color=color, size=110,
                           line=dict(color='white'), opacity=0.7)
        self.r = []
        self.t = []
    #______________________________
    def set_color(self, color):
        self.marker = dict(color = color, size=110,
                           line=dict(color='white'), opacity=0.7)
    #____________________________
    def set_name(self, name):
        self.name = name
    #____________________________
    def set_title(self, title):
        self.na = title
    #___________________________
    def set_actor(self, actor):
        self.actor = actor
    
    #__________________________
    def set_values(self, r, t):
        self.r = np.array(r)
        self.t = np.array(t)

In [None]:
names =['Morgan Freeman']
df2 = df_reduced[df_reduced['actor'] == 'Morgan Freeman']
total_count  = 0
years = []
imdb_score = []
genre = []
titles = []
actor = []
for s in liste_genres:
    icount = df2[s].sum()
    #__________________________________________________________________
    # Here, we set the limit to 3 because of a bug in plotly's package
    if icount > 3: 
        total_count += 1
        genre.append(s)
        actor.append(list(df2[df2[s] ==1 ]['actor']))
        years.append(list(df2[df2[s] == 1]['title_year']))
        imdb_score.append(list(df2[df2[s] == 1]['vote_average'])) 
        titles.append(list(df2[df2[s] == 1]['movie_title']))
max_y = max([max(s) for s in years])
min_y = min([min(s) for s in years])
year_range = max_y - min_y

years_normed = []
for i in range(total_count):
    years_normed.append( [360/total_count*((an-min_y)/year_range+i) for an in years[i]])
    
color = ('royalblue', 'grey', 'wheat', 'c', 'firebrick', 'seagreen', 'lightskyblue',
          'lightcoral', 'yellowgreen', 'gold', 'tomato', 'violet', 'aquamarine', 'chartreuse', 'red')

trace = [Trace(color[i]) for i in range(total_count)]
tr    = []
for i in range(total_count):
    trace[i].set_name(genre[i])
    trace[i].set_title(titles[i])
    trace[i].set_values(np.array(imdb_score[i]),
                        np.array(years_normed[i]))
    tr.append(go.Scatter(r      = trace[i].r,
                         t      = trace[i].t,
                         mode   = trace[i].mode,
                         name   = trace[i].name,
                         marker = trace[i].marker,
#                         text   = ['default title' for j in range(len(trace[i].r))], 
                         hoverinfo = 'all'
                        ))        
layout = go.Layout(
    title='Morgan Freeman movies',
    font=dict(
        size=15
    ),
    plot_bgcolor='rgb(223, 223, 223)',
    angularaxis=dict(        
        tickcolor='rgb(253,253,253)'
    ),
    hovermode='Closest',
)
fig = go.Figure(data = tr, layout=layout)
pyo.iplot(fig)

Unfortunately, plotly doesn't allow us to put hover text on the different notes. This means we can't add the movie name and year of release to all the different nodes.

### 3.5 Directors

Directors of movies are known as an important influence for the quality of a movie. This part will analyse the IMDB score per director. In this analyse we try to discover if the IMDB score of movies from high scoring directors are influenced by the budget. This brings us to the following questions: 

For directors with more than 4/ and directors with more than 15 movies: 

   **Which directors have the highest revenue per movie?**
   
   **Which directors have the highest budget per movie?**
   
   **Do directors with high budgets also have high revenues? **

We start by retrieving the copy of the database we created earlier. After that, we  analyse by computing the average per movie and total revenue of the directors. We only took into account the directors for which we have at least 4 movies as observations, to exclude extreme outliers. Not surprisingly, the top rated directors are probably directors you have heard about.

In [None]:
df = df_old

In [None]:
def create_comparison_database(name, value, x, no_films):
    
    comparison_df = df.groupby(name, as_index=False)
    
    if x == 'mean':
        comparison_df = comparison_df.mean()
    elif x == 'median':
        comparison_df = comparison_df.median()
    elif x == 'sum':
        comparison_df = comparison_df.sum() 
    
    # Create database with either name of directors or actors, the value being compared i.e. 'revenue',
    # and number of films they're listed with. Then sort by value being compared.
    name_count_key = df[name].value_counts().to_dict()
    comparison_df['films'] = comparison_df[name].map(name_count_key)
    comparison_df.sort_values(value, ascending=False, inplace=True)
    comparison_df[name] = comparison_df[name].map(str) + " (" + comparison_df['films'].astype(str) + ")"
   # create a Series with the name as the index so it can be plotted to a subgrid
    comp_series = comparison_df[comparison_df['films'] >= no_films][[name, value]][10::-1].set_index(name).ix[:,0]
    
    return comp_series

Two questions we want to answer are:

**Which directors created the most total revenue if produced atleast 4 movies?**

**Which directors created the most  average revenue  if produced at least 4 movies?**

By creating a bar chart we can get a good overview/visualition of directors with 4+ movies and the revenue. 

In [None]:
fig = plt.figure(figsize=(18,6))

# Director_name
plt.subplot2grid((2,3),(0,0), rowspan = 2)
create_comparison_database('director_name','gross','sum', 4).plot(kind='barh', color='#006600')
plt.legend().set_visible(False)
plt.title("Total revenue for Directors with 4+ Films")
plt.ylabel("Director (no. films)")
plt.xlabel("Revenue")

plt.subplot2grid((2,3),(0,1), rowspan = 2)
create_comparison_database('director_name','gross','mean', 4).plot(kind='barh', color='#ffff00')
plt.legend().set_visible(False)
plt.title('Average revenue for Directors with 4+ Films')
plt.ylabel("Director (no. films)")
plt.xlabel("Revenue")

plt.tight_layout()

It's not very remarkable that Steven Spielberg has gained the most revenue out of the movies he has dericted due to the fact that he produced the most movies. Moreover, his name is well known as a good director. However, on average he is not in the top 10, probably because he has produced 27 movies. If we compare the two bar charts we can conclude that Peter Jackson and James Cameron score well in both charts. 

We used a bar chart to  get an better insight of the directors and the following questions:

**Which What directors with on average more than 4 movies directed got access to the highest budget**

**Which directors have on average the highest IMDB score if if produced at least 4 movies?**

In [None]:
fig = plt.figure(figsize=(18,6))

# Director_name
plt.subplot2grid((2,3),(0,0), rowspan = 2)
create_comparison_database('director_name','budget','mean', 4).plot(kind='barh', color='#006600')
plt.legend().set_visible(False)
plt.title("Average budget for Directors with 4+ Filmss")
plt.ylabel("Director (no. films)")
plt.xlabel("Budget")

plt.subplot2grid((2,3),(0,1), rowspan = 2)
create_comparison_database('director_name','vote_average','mean', 4).plot(kind='barh', color='#ffff00')
plt.legend().set_visible(False)
plt.title('Mean IMDB Score for Directors with 4+ Films')
plt.ylabel("Director (no. films)")
plt.xlabel("IMDB Score")
plt.xlim(0,10)

plt.tight_layout()

Notice how many of the directors that have a very high average budget per movie were nowhere to be seen in the revenue plot. Implying that, although they make expensive movies, they don't make the most grossing movies. Also note that a lot of high scoring directors are not found in the top ten highest budgeted directors. This implies that a big budget doesn't necessarily lead to a good, or well-received, movie. On the other hand, it shows that some directors, for instance Hayao Miyazaki, is capable of creating excellent movies without needing a very high budget.

However, it is always possible that directors with few movies were lucky. Because we took into acount evry director with 4+ movies. Therefore, for the next part we only took the directors with 15+ movies and like to know:

**Are there any directors that are capable of consistently creating well-received movies, without the need for big budgets? **

To answer this question we plot the average budget next to the average score per director, for directors with at least 15 movies.

In [None]:
fig = plt.figure(figsize=(18,6))

# Director_name
plt.subplot2grid((2,3),(0,0), rowspan = 2)
create_comparison_database('director_name','budget','mean', 10).plot(kind='barh', color='#006600')
plt.legend().set_visible(False)
plt.title("Average budget for Directors with 15+ Filmss")
plt.ylabel("Director (no. films)")
plt.xlabel("Budget")

plt.subplot2grid((2,3),(0,1), rowspan = 2)
create_comparison_database('director_name','vote_average','mean', 10).plot(kind='barh', color='#ffff00')
plt.legend().set_visible(False)
plt.title('Mean IMDB Score for Directors with 15+ Films')
plt.ylabel("Director (no. films)")
plt.xlabel("IMDB Score")
plt.xlim(0,10)

plt.tight_layout()

Now, we easily see that the two bar plots have more directors in common. Still, there are some directors who manage to create excellent movies without the need for a big budget. A funny observation is Michael Bay. While he is easily the king of budget, he is nowhere to be found in the top ten highest scoring directors.

## 4. Predictives

In this section, we make an attempt to predict the success, expressed in revenue and its average score on IMDB, using to independent variables: budget and runtime. We use two different methods: K-Nearest-Neighbours and regression. We aim to answer the following 2 questions:

**Can we predict a movie's revenue?**

**Can we predict a movie's IMDB score?**

### 4.1 K-Nearest-Neighbours

In this section we train a predictive model using the K-Nearest_Neighbours method. We start with predicting the average score of a movie. We will use the new format of the dataframe for this prediction. Since KNN won't be able to work with continuous variables, we start by converting the average-scores to a range of 0 to 100. We also use Imputer once again to fill the empty variable in the runtime variable.

In [None]:
df = df_new.copy()

df['vote_classes'] = pd.cut(df['vote_average'],100, labels=range(100))

my_imputer = Imputer()

X2 = my_imputer.fit_transform(df[['runtime']])
df['runtime'] = X2

col = ['budget', 'runtime']
X = df[col]
y = df['vote_classes']

Next, we train the KNN model.

In [None]:
# instantiate the model (with the default parameters)
knn_score = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn_score.fit(X, y)

We do the exact same process for revenue.

In [None]:
y = df['revenue']

# instantiate the model (with the default parameters)
knn_rev = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn_rev.fit(X, y)

### 4.2 Regression

In this section we will use two different types of regressions to answer our two main questions: linear regression and random forest regression. We start with linear regression. We again use the new format for the database and use imputer to fill the missing values in runtime. We first create a fit for the average scores.

In [None]:
df = df_new.copy()

my_imputer = Imputer()

X2 = my_imputer.fit_transform(df[['runtime']])
df['runtime'] = X2

budget = ['budget','runtime']
training = df[budget]

target= ['vote_average']
target = df[target]

X = training.values
y = target.values

X_train1, X_test1, y_train1, y_test1 = train_test_split(
    X, y, test_size=0.3)

# Create linear regression object
regr1 = linear_model.LinearRegression()

# Train the model using the training sets
regr1.fit(X_train1, y_train1)

# Make predictions using the testing set
y_pred_lr1 = regr1.predict(X_test1)

print(regr1.coef_, regr1.intercept_)
print(r2_score(y_test1, y_pred_lr1))

f = plt.figure(figsize=(10,5))
plt.scatter(X_test1[:,1], y_test1, label="Real score");
plt.scatter(X_test1[:,1], y_pred_lr1, c='r',label="Predicted score");
plt.xlabel("Budget");
plt.ylabel('Score');
plt.legend(loc=2);

Note that the R-square is fairly low.

Second, we create a linear fit for the revenue.

In [None]:
target = ['revenue']
target = df[target]

X = training.values
y = target.values

X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X, y, test_size=0.3)

# Create linear regression object
regr2 = linear_model.LinearRegression()

# Train the model using the training sets
regr2.fit(X_train2, y_train2)

# Make predictions using the testing set
y_pred_lr2 = regr2.predict(X_test2)

print(regr2.coef_, regr2.intercept_)
print(r2_score(y_test2, y_pred_lr2))

f = plt.figure(figsize=(10,5))
plt.scatter(X_test2[:,1], y_test2, label="Real revenue");
plt.scatter(X_test2[:,1], y_pred_lr2, c='r',label="Predicted revenue");
plt.ylabel('Revenue');
plt.legend(loc=2);

Note that the R-square is a lot higher this time.

Now, we will create two random forest regressions. For this regression we will use the exact same training and testing data as we did with the linear regressions. We again start off with the average scores.

In [None]:
# Create linear regression object
rf1 = RandomForestRegressor(1)

# Train the model using the training sets
rf1.fit(X_train1, y_train1)

# Make predictions using the testing set
y_pred_rf1 = rf1.predict(X_test1)

In [None]:
f = plt.figure(figsize=(10,5))
plt.scatter(X_test1[:,1], y_test1, s=50,label="Real Score");
plt.scatter(X_test1[:,1], y_pred_rf1,s=100, c='r',label="Predicted Score");
plt.ylabel("Score");
plt.legend(loc=2);

And secondly we fit the revenue.

In [None]:
# Create linear regression object
rf2 = RandomForestRegressor(1)

# Train the model using the training sets
rf2.fit(X_train2, y_train2)

# Make predictions using the testing set
y_pred_rf2 = rf2.predict(X_test2)

In [None]:
f = plt.figure(figsize=(10,5))
plt.scatter(X_test2[:,1], y_test2, s=50,label="Real Score");
plt.scatter(X_test2[:,1], y_pred_rf2,s=100, c='r',label="Predicted Score");
plt.ylabel("Score");
plt.legend(loc=2);

### 4.3 Comparison of models

Now that we have created both linear and random forest regressions, we can compare both models. We will do this by using the mean squared error of both regressions. We visualize them using a bar chart.

In [None]:
error_lr = mean_squared_error(y_test1,y_pred_lr1)
error_rf = mean_squared_error(y_test1,y_pred_rf1)
print(error_lr)
print(error_rf)

f = plt.figure(figsize=(10,5))
plt.bar(range(2),[error_lr,error_rf])
plt.xlabel("Classifiers");
plt.ylabel("Mean Squared Error of the Score");
plt.xticks(range(2),['Linear Regression','Random Forest'])
plt.legend(loc=2);

The mean squared error of the square is almost twice as high for the random forest regressions as it is for the linear regression. This means the linear regression is a much better representation of the data.

We now do the same for the revenue regressions.

In [None]:
error_lr = mean_squared_error(y_test2,y_pred_lr2)
error_rf = mean_squared_error(y_test2,y_pred_rf2)
print(error_lr)
print(error_rf)

f = plt.figure(figsize=(10,5))
plt.bar(range(2),[error_lr,error_rf])
plt.xlabel("Classifiers");
plt.ylabel("Mean Squared Error of the Revenue");
plt.xticks(range(2),['Linear Regression','Random Forest'])
plt.legend(loc=2);

This time we see that the mean squared error for the random forest regression  is even higher compared to the linear regression than it was last time. So again, the linear regression is a much better representation for the data.

We can now finally answer the two main questions we formulated in the beginning of this chapter. We can predict both the revenue and the score using the three models we described above. The results are show below. We chose a budget of 250 million and a runtime of 165 minutes as our input variables. The prediction are shown in the following order: KNN, Linear Regression and Random Forest Regression.

In [None]:
print("Score predictions:")

pred = knn_score.predict([[250000000, 165]])
print(pred)

pred = regr1.predict([[250000000, 165]])
print(pred)

pred = rf1.predict([[250000000, 165]])
print(pred)

print("\nRevenue predictions:")

pred = knn_rev.predict([[250000000, 165]])
print(pred)

pred = regr2.predict([[250000000, 165]])
print(pred)

pred = rf2.predict([[250000000, 165]])
print(pred)

We see that the predictions for the average score all seem fairly close together. The fact that we have made three predictions that are fairly close together, may indicate that the three models together form a good prediction for the average score of a movie. Take note however, that we found the linear regression to be more accurate than the random forest regression and that we were not able to make any claims about the quality of the KNN model.

The revenue predictions are further away from each other. Since the mean square error of the random forest regressions was way higher than the one for the linear regression, it makes sense that these two predictions are fairly far apart. In this case we should trust the linear regression more. If we combine this with the KNN, we can predict that the revenue for a movie with a budget of 250 million and a runtime of 165 minutes lies somewhere between 750 and 875 million.

## 5. Conclusion

**Basic Descriptive:**<br>
Firstly, the vote average was divided in classes so that it was easier to work with. For each of the categories a log transformation is applied in order to unskew the distribution. This transformation has no influence on the ratio of the variable, but makes sure that the distribution has more broad view. 

The scatter plots show the correlation between each of the categories. We can conclude that there is a strong positive relationship between the vote counts and popularity. The same goes for the association between vote counts and revenue. For both cases the points cluster along a straight line (linear relationship). For the association between the category revenue and budget and popularity (respectively) we can conclude that there is a slightly positive linear relation.  Lastly, we can conclude that the budget of a movie has no influence on the vote count. For the other categories and their relationships we can’t really make illations on basis of the scatterplots. 

Via a choropleth map we created an overview of the amount of movies that were produced in each country. This goes without saying. Remarkable is that India doesn’t appear in the top countries of the movie industry. Despite the fact Bollywood is located their.

**Genres:**<br>
We started our analysis with the examination of genres. What we can clearly see about the distribution of the genres is that the four most common genres cover 51% of all the movies. The top five most popular genre descriptions are Drama, Comedy, Thriller, Action, and Romance respectively.  

We also examined the relationship between genres and revenues, budget, and vote average. Therefore, we created a new top five which firstly ranked on vote average. This top five shows a completely new ranking compared with the five most common genres. Only the genre drama is ranked third in this new table. Remarkably it shows that history, war, and music are in the highly ranked in the average vote top five while they are on the bottom of the list for distribution. 

We also did this for budget and revenue and those graphs are rather similar. The genres Animation, Adventure, and Fantasy are the top three in both tables. This is what we could expect due to the fact that the Pearson correlation of budget and revenue is 0.73. 

Then we look at how budget, revenue, and vote average evolved over the years. We can clearly see that the budget for Animation, Adventure, Action, Fantasy, Family, and Science Fiction noticeably increases over time. For the revenues we can clearly see an increase over the years per genre. Furthermore, it can be seen that the genres that increase in budget also increase in revenue. This can be explained by the correlation between those variables. For vote average there is no obvious trend over the last 29 years. However, we noticed that there are more outliers in the first 15 years compared to the last 14 years.

**Keywords: ** <br>
Further research on the keywords shows that the words: women director, independent film, duringcreditsstringer, based on novel, and murder are used 324, 318, 307, 197, and 197 times respectively. To get further insights in these keywords we took their roots and synonyms into account. Which resulted in a decrease of keywords from 9474 to 2110.  

Now we can order the list of keywords by the average votes(/rating) , budget and revenue. When we order by average votes we can see that there is a new top 5 (Brazilian, jedi, bittersweet, loss of sense of reality and fascism) not including the ones that are the most frequently used.
 
When we rank on budget we see that yet another top 5 shows itself. Containing ‘swashbuckler’, ‘based on fairy tale’, ‘hobbit’, ‘marvel cinematic universal’, and ‘east india trading company’. Which on his turn also differs from the top 5 keywords when ranked by revenue. However hobbit is represented in both graphs.

**Actors:**<br>
A list of all actors is made. We add the constrain that an actor must at least be linked to 10 movies.  The list makes it possible to rank actors by their influence on vote average, budget and gross revenue. We can clearly see that a few actors pop out. Ian mckellen has the highest average vote and played in movies with really high budgets and revenues. Which means that his acting skills are quite impressive. Emma Watson and Anne Hathaway are also in this high performing segment. 

Furthermore have we looked at actors that played in more than 20 movies. Here Christopher Plummer, Morgen Freeman and Samuel L. Jackson show the highest rankings. Lastly, we added an interactive polar chart about the average ratings Morgan Freeman received for his roles in movies divided by genres.  The distribution is pretty much the same for drama, thriller and crime as well as he distribution of comedy and action due to the fact that a couple of movies are assigned to multiple genres. Unfortunately, the plot isn’t able to show the title of a movie for conformation.

**Directors:**<br>
Noticeable is that Steven Spielberg has the highest total revenues over the 27 movies he directed. Probably, because he produced the most movies from all directors with 4+ movies. However, when we take a look of the average revenue of directors we notice that Steven Spielberg does not appear in the top 10. This might be cause by some of the movies with relatively low income. Not every movie can be a blockbuster. 

Other directors like Peter Jackson and James Cameron appear really high in both list. This means that the overall quality (measure by revenue) of their movies is higher than the movies of Steven Spielberg. When we compare this to the average budget a director gets for his movies we see that Steven Spielberg closes the top 10 which means that his average budget is not that high in comparison to for example Michael Bay. 

If we compare the revenue, average revenue and average budget of the top 10 directors with 4+ movies with the top 10 director with 4+ movies per mean IMDB rating only Christopher Nolan appears in both lists. This implies that a big budget doesn't necessarily lead to a good, or well-received, movie. On the other hand, it shows that some directors, for instance Hayao Miyazaki, is capable of creating excellent movies without needing a very high budget. 

**Prediction:**<br>
By using the independent variables, budget and runtime, we predict the revenue and its average score on IMDB, based on the KNN, random forest regression and linear regression method. After we analysed by applying the Mean Squared Error we concluded that the linear regression had a better representation of the data than the random forest regression for both the variables: budget and runtime. Unfortunately, we weren’t able to make any claims about the quality of the K-Nearest-Neighbour.

Lastly, we made the predictions. We see that the predictions for the average score all seem fairly close together. This may indicate that the three models together form a good prediction for the average score of a movie. The revenue predictions are further away from each other since the mean square error of the random forest regressions was way higher than the one for the linear regression. Therefore, we should trust the linear regression more. 


In [None]:
df = df_new.copy()

t = df['release_date']
t = pd.to_datetime(t)
t = t.dt.year
df['release_year'] = t

my_imputer = Imputer()

X2 = my_imputer.fit_transform(df[['runtime']])
df['runtime'] = X2

X2 = my_imputer.fit_transform(df[['vote_count']])
df['vote_count'] = X2

X2 = my_imputer.fit_transform(df[['release_year']])
df['release_year'] = X2

X2 = my_imputer.fit_transform(df[['popularity']])
df['popularity'] = X2

predictors = ['budget', 'vote_count', 'vote_average','runtime', 'release_year', 'popularity']
training = df[predictors]

target= ['revenue']
target = df[target]

X = training.values
y = target.values

X_train1, X_test1, y_train1, y_test1 = train_test_split(
    X, y, test_size=0.3)

# Create linear regression object
regr1 = linear_model.LinearRegression()

# Train the model using the training sets
regr1.fit(X_train1, y_train1)

# Make predictions using the testing set
y_pred_lr1 = regr1.predict(X_test1)

print(r2_score(y_test1, y_pred_lr1))

f = plt.figure(figsize=(10,5))
plt.scatter(X_test1[:,1], y_test1, label="Real Revenue");
plt.scatter(X_test1[:,1], y_pred_lr1, c='r',label="Predicted Revenue");
plt.ylabel('Revenue');
plt.legend(loc=2);

In the regression above, we included all the numerical values we have available as predictor values. Even though it doesn't make sense to predict the revenue of a movie with variables that become available at the point the revenue is available as well, we did do that here, for the sole purpose of coming up with a better predictive model for the revenue.

We see indeed that this linear model is a better prediction model. We arrive at an R-square of 75%, meaning 75% of the variation in the revenue is predicted by the variables we have used. This is a lot higher than the 55% we had beforehand.