
# Project: Investigate the IMDB dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

I have selected the IMDB dataset to review the impact that genres and cast members have on the profitability of a movie. 

In [None]:
import pandas as pd
import numpy as np
% matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [None]:
df = pd.read_csv('tmdb-movies.csv')

In [None]:
df.head(1)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

### Data Cleaning of the genre and cast columns, as well as data type correction

The main problems I have discovered with the data is that the genre and cast columns contain multiple entries per field. In addition, the budget column is an int while the budget_adj and revenue_adj columns are floats, therefore the data types will have to be made similar. 

In [None]:
df['genres'].value_counts()

Below we are splitting the genre column into multiple rows, so that we are better able to analyze budget and revenue across genres.

In [None]:
genre_df = df['genres'].str.split('|').apply(pd.Series, 1).stack()
genre_df.index = genre_df.index.droplevel(-1)
genre_df.name = 'genres'
del df['genres']
df = df.join(genre_df)

Below we are changing the budget and revenue columns into int columns

In [None]:
df['budget_adj'] = df['budget_adj'].astype(int)

In [None]:
df['revenue_adj'] = df['revenue_adj'].astype(int)

In [None]:
df.info()

In [None]:
df['genres'].value_counts()

We will now be correcting the cast column so that it contains one actor per row, instead of multiple actors in each row. 

In [None]:
df['cast'].value_counts()

In [None]:
cast_df = df['cast'].str.split('|').apply(pd.Series, 1).stack()
cast_df.index = cast_df.index.droplevel(-1)
cast_df.name = 'cast'
del df['cast']
df = df.join(cast_df)

In [None]:
df.head(2)

We will now also be looking at the '0' values in revenue and budget, as these influence our profit column and therefore our analysis.

In [None]:
df.info()

In [None]:
#replace 0 values in budget with mean for budget (calculated by df.describe())
df['budget'] = df['budget'].replace(to_replace=0, value=1.462570e+07)

In [None]:
#replace 0 values in revenue with mean for revenue (calculated by df.describe())
df['revenue'] = df['revenue'].replace(to_replace=0, value=3.982332e+07)

In [None]:
df.head()

<a id='eda'></a>
## Exploratory Data Analysis

### Research Question 1: How does genre influence the gross revenue for a film?

In [None]:
df['profit'] = df['revenue'] - df['budget']

In [None]:
genre_rev = df.groupby('genres').mean()
genre_rev

In [None]:
# What is the maximum and minimum profit for each genre?
max_profit = genre_rev['profit'].max()
min_profit = genre_rev['profit'].min()
print(max_profit, min_profit)

In [None]:
#The range for movie profits
range_profit = max_profit - min_profit
range_profit

In [None]:
plt.bar(["Maximum profit", "Minimum profit"],[max_profit, min_profit])
plt.title("Variations of profit achievable in the IMDB dataset")
plt.xlabel("Range")
plt.ylabel("Profit in US dollars");

In [None]:
df['profit'].hist()
plt.title('Profitability of films histogram')
plt.xlabel('Classes of profit')
plt.ylabel('Profit');

In [None]:
genre_rev.dropna(inplace=True)
plt.subplots(figsize=(12, 5))
plt.bar(genre_rev.index, genre_rev.profit)
plt.xticks(rotation=90)
plt.xlabel('Genre')
plt.ylabel('Profit')
plt.title('An illustration of how profitable various genres in the movie industry are');

The genres that generate the highest amount of profit are adventure, animation and fantasy. The genres that seem to have the lowest profits are documentaries, foreign films, TV movies and Western movies. This is not exactly surprising as adventure films like the Marvel franchise and Jurassic Park are enormous blockbusters that shaped culture. However, documentaries and foreign films are more niche films that appeal to specific target audiences and do not always have mass market appeal.

### Are certain cast members assosciated with higher profits?

Below is a list of cast members and the number of times they show up in the IMDB movie dataset.

In [None]:
df['cast'].value_counts()

In [None]:
cast_rev = df.groupby('cast').mean()
cast_rev

In [None]:
cast_and_profit = df[['cast', 'profit']]
cast_and_profit.head()

By highest earning actor I am reviewing how much the actor is earning for the production house and not for themselves, which is why I am looking at profit, rather than what they personally received.

In [None]:
highest_earning_actor = cast_and_profit['profit'].max()
lowest_earning_actor = cast_and_profit['profit'].min()
highest_earning_actor, lowest_earning_actor

In [None]:
max_profit_by_actor = df.groupby('cast').max()

The actors selected below are the ten actors who occur the most times in the cast.value_counts query.

In [None]:
topten_actors = max_profit_by_actor.query('cast in ["Samuel L. Jackson", "Bruce Willis", "Nicolas Cage", "Eddie Murphy", "Robert De Niro", "Clint Eastwood", "Mel Gibson", "Antonio Banderas", "Michael Caine", "Donald Sutherland"]')

In [None]:
topten_actors

In [None]:
plt.bar(topten_actors.index, topten_actors.profit)
plt.xlabel('Actor')
plt.ylabel('Profit')
plt.title('The ten actors who occurred in the most movies profit compared')
plt.xticks(rotation=90);

Michael Caine seems to have brought in the highest revenue for all of the actors, therefore, let's print out a list of movies he has been in

In [None]:
michael_c = df.query('cast == "Michael Caine"')
michael_c_updated = michael_c[['original_title', 'profit']]

In [None]:
michael_c_updated.original_title.nunique()

In [None]:
michael_films = michael_c_updated.groupby('original_title').mean()

The films Michael Caine has starred in as well as the profit earned on each film.

In [None]:
michael_films = michael_films.loc[(michael_films!=0).any(axis=1)]
michael_films

In [None]:
plt.bar(michael_films.index, michael_films.profit)
plt.xlabel('Films')
plt.ylabel('Profit')
plt.title('How much profit Michael Caines films have returned')
plt.xticks(rotation=90);

In [None]:
michael_genre = michael_c.groupby('genres').mean()
michael_genre

In [None]:
min_profit = michael_genre['profit'].idxmin(axis=1)
print('Michaels least profitable genre is', min_profit)

In [None]:
max_profit = michael_genre['profit'].idxmax(axis=1)
print('Michaels most profitable genre is', max_profit)

In [None]:
ave_profit = michael_genre['profit'].mean()
print('The average profit obtained by Michael Caines films across genres is US$', ave_profit)

<h2>Research question 3: Does profitability correlate with popularity?</h2>

In [None]:
pop_genre = df.groupby('genres').mean()
pop_genre

Let's discover what the distribution of popularity looks like. 

In [None]:
pop_genre['popularity'].hist()
plt.title('A histogram of popularity scores')
plt.xlabel('Class bins')
plt.ylabel('Popularity score');

It seems that the popularity distribution is leading more towards being skewed to the right, rather than being a perfect example of a normal distribution. 

In [None]:
x_var = pop_genre['popularity']
y_var = pop_genre['profit']

plt.plot(x_var, y_var, 'o', color='red')
plt.title('A comparison between popularity and profitability')
plt.xlabel('Popularity score')
plt.ylabel('Profit');

There does seem to be a positive correlation, the greater the popularity score the higher the level of profitability a movie experienced. However, the points are not perfectly correlated, therefore further investigation and more information would need to be acquired in order to provide more meaningful insights.

<a id='conclusions'></a>
## Conclusions

The most profitable genres seem to be the adventure, animation and fantasy genres, while the least profitable genres are documentaries, foreign films and TV movies.

The range in potential profit for a movie is enormous with the difference between the highest grossing and the lowest grossing movies being US$74 746 929. 

The difference between actors profits is even more substantial with some even bringing in negative earnings.

Of the ten actors who occured the most, Michael Caine was the one who had the highest profit for his films, however this does not imply that he was solely responsible for those profits as there are multiple other factors that go into creating a succesful film. For example, the marketing of the film, the genre it is in, the director's decisions and the writing quality. 

Michael Caine's two most profitable films are The Dark Knight and The Dark Knight Rising, this isn't exactly surprising to me, as Batman is my favourite superhero of all time. 

Popularity does seem to have an influence on profitability, however there isn't enough information to provide definitive causation. 

The main limitations on the dataframe is not knowing how many people actually viewed the movie, how much individual actors earned for each movie and the awards the movie may have received.

References:
Chhibber, A., 2017. IMDB - Analysis by Genres. [online] Kaggle.com. Available at: <https://www.kaggle.com/abhishekchhibber/imdb-analysis-by-genres> [Accessed 22 February 2021].
https://stackoverflow.com/questions/22649693/drop-rows-with-all-zeros-in-pandas-data-frame
https://pandas.pydata.org/pandas-docs/version/0.25.2/reference/api/pandas.DataFrame.query.html
https://thispointer.com/pandas-dataframe-get-minimum-values-in-rows-or-columns-their-index-position/
https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html
Udacity notes

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])