# An Analysis of Tmdb Movie Database - Profitability Trends

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project I analyze the Imdb data set. I wanted to explore the profitability of movies and see if there is any correlations between the profitability of a movie and its budget or genre and how the profitability is mapped over the years.

The following are the packages that I will use:


In [None]:
# Theimport statements for all of the packages that we
# plan to use.

import numpy as np                  # numpy version 1.16.2
import pandas as pd                 # pandas version 0.24.2
import matplotlib.pyplot as plt     # matplotlib version 3.0.3
import seaborn as sns               # seaborn version 0.9.0
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

As it is stated in the discribtion presented on kaggle.com to introduce the datasets, this dataset has been cleaned. However, I wanted to explore any anomalies. So I explored the data as shown below.

I explored the data in general, its shape, summarized statisitcs, and general info.

I discovered that there are several movies with a budget that is equal to 0 dollars. I removed those from the data set as my analysis depends on seeing trends between the budget and revenue of movies.

I also used the adjusted budget and revenues as the numbers there are universal and useful to compare movies over the years (taking inflation into consideration).

### General Properties

In [None]:
# Loading the data and printing out a few lines. We also Performed operations to inspect data
#    types and look for instances of missing or possibly errant data. The data is stored in the file:
#    "tmdb-movies.csv"

df = pd.read_csv('tmdb-movies.csv')
df.head()


In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

### Data Cleaning I: Removing Columns of no Interest

I removed several columns of my dataset as I was not interested in them for my main analysis

In [None]:
df.drop(['id','imdb_id'],axis = 1, inplace=True)

In [None]:
df.drop(['original_title','cast','director','tagline','keywords','overview'],axis = 1, inplace=True)

### Data Cleaning II: Removing Movies with a Budget = 0 Dollars
I then removed all the entries with a reported budget of 0 dollars as I am interested to compare the budget to revenue.
I then extracted the summarized statisitcs of my dataset to make sure that all movies has a budget

In [None]:
df = df[df.budget != 0]

In [None]:
df.describe()

<a id='eda'></a>
## Exploratory Data Analysis

### Is the budget correlated with the profitability of a movie
In this section I analyze whther the budget affect the profitability of a movie

I divided the movies into 4 categories:

1. Highly profitable where the revenue is more than 8 times the budget.
2. Reasonably profitable where the revenue is between 2 times and 8 times the budget of a movie.
3. Barely priftable where the revenue is more than the budget but less than twice the budget.
4. a movie that did not turn any profit

I then calculated the mean of the budget of the movies in these 4 categories.
I also plotted the histograms of the 4 categories over the budget to see if there is a correlation between profitability and the budget.

In [None]:
hihgly_proft = df.revenue_adj >= 5*df.budget_adj
reasonably_profit = (df.revenue_adj >= 2*df.budget_adj) & (df.revenue_adj < 5*df.budget_adj)
barely_profit = (df.revenue_adj > df.budget_adj) & (df.revenue_adj < 2*df.budget_adj)
lost = df.revenue_adj < df.budget_adj

In [None]:
"{:,}".format(int(df.budget_adj[hihgly_proft].mean()))

In [None]:
"{:,}".format(int(df.budget_adj[reasonably_profit].mean()))

In [None]:
"{:,}".format(int(df.budget_adj[barely_profit].mean()))

In [None]:
"{:,}".format(int(df.budget_adj[lost].mean()))

In [None]:
df.budget_adj[hihgly_proft].plot.hist(xlim=(10**4,2*10**8),figsize=(20,10),alpha = 0.5,bins = 50, label='Highly Profitable')
df.budget_adj[reasonably_profit].plot.hist(xlim=(10**4,2*10**8),figsize=(20,10),alpha = 0.5, bins = 50, label='Reasonably Profitable')
df.budget_adj[barely_profit].plot.hist(xlim=(10**4,2*10**8),figsize=(20,10), alpha = 0.5,bins = 50, label='Barely Profitable')
df.budget_adj[lost].plot.hist(xlim=(10**4,2*10**8),figsize=(20,10), alpha = 0.5,bins = 50, label='Lost')
plt.legend();

Then I analyzed the revenue not as a multiple of the budget but as a wehole number with the following categories:
1) Huge profit where a movie profit is higher than 100 Million Dollars.
2) Big profit where a movie profit is more than 10 Million Dollars but less than 100 Million.
3) Small profit where a movie profit is more than 1 Million Dollars but less than 10 Million.
4) A losing movie turning no profit

For this analysis I focused on movies with a commercial budget of 1 Million Dollars or above.

In [None]:
huge_proft = (df.revenue_adj - df.budget_adj) >= 10**8
big_profit = ((df.revenue_adj - df.budget_adj) >= 10**7)  & ((df.revenue_adj - df.budget_adj) < 10**8)
small_profit = ((df.revenue_adj - df.budget_adj) >= 10**6) & ((df.revenue_adj - df.budget_adj) < 10**7)
lost_2 = (df.revenue_adj - df.budget_adj) <= 0

In [None]:
df.budget_adj[huge_proft].plot.hist(xlim=(10**6,2*10**8),figsize=(20,10),alpha = 0.5,bins = 100, label='Profit > 100 Million')
df.budget_adj[big_profit].plot.hist(xlim=(10**6,2*10**8),figsize=(20,10),alpha = 0.5,bins = 100, label='100 Million > Profit > 10 Million')
df.budget_adj[small_profit].plot.hist(ylim=(0,250),xlim=(10**6,2*10**8),figsize=(20,10), alpha = 0.5,bins = 100, label='Profit betwwen 1 & 10 Million')
df.budget_adj[lost_2].plot.hist(ylim=(0,250),xlim=(10**6,2*10**8),figsize=(20,10), alpha = 0.5,bins = 100, label='Lost money')
plt.legend();

### Is the profitability correlated with the genre

I analyzed the count of movies per the main genre of the movie (the 1st genre in the list of genres in the genre column). I plotted the count of movies that secured profits higher than 100 Million Dollars and the count of movies that lost money. In this analysis, I focused on movies with a budget over a million dollar (highly commercial movies).

In [None]:
df_commercial = df[df.budget_adj >= 10**6]

In [None]:
df_commercial.genres = df_commercial.genres.str.split('|').str[0].str.strip()

In [None]:
df_commercial.head()

In [None]:
df_commercial_huge_profit = df_commercial[(df_commercial.revenue_adj - df_commercial.budget_adj) >= 10**8]
df_commercial_huge_profit.groupby('genres').budget_adj.count().plot(figsize=(20,10),kind = 'bar',label='The Number of Movies',grid='True')
plt.legend();

In [None]:
df_commercial_lost= df_commercial[(df_commercial.revenue_adj - df_commercial.budget_adj) < 0]
df_commercial_lost.groupby('genres').budget_adj.count().plot(figsize=(20,10),kind = 'bar',label='The Number of Movies',grid='True')
plt.legend();

In this last analysis, I wanted to see the count of movies in the 4 categories on the sam bar plot to see if profitability is correlated to genre

In [None]:
huge_proft2 = (df_commercial.revenue_adj - df_commercial.budget_adj) >= 10**8
big_profit2 = ((df_commercial.revenue_adj - df_commercial.budget_adj) >= 10**7)  & ((df_commercial.revenue_adj - df_commercial.budget_adj) < 10**8)
small_profit2 = ((df_commercial.revenue_adj - df_commercial.budget_adj) >= 10**6) & ((df_commercial.revenue_adj - df_commercial.budget_adj) < 10**7)
lost_3 = (df_commercial.revenue_adj - df_commercial.budget_adj) <= 0

In [None]:
df_commercial.genres[lost_3].value_counts().plot(figsize=(20,10),kind = 'bar',label='Lost',grid='True',color=('red'))
df_commercial.genres[big_profit2].value_counts().plot(figsize=(20,10),kind = 'bar',label='100 Million > Profit > 10 Million',grid='True',color=('green'))
df_commercial.genres[huge_proft2].value_counts().plot(figsize=(20,10),kind = 'bar',label='Profit > 100 Million',grid='True',color=('blue'))
plt.legend();

<a id='conclusions'></a>
## Conclusions

### Profit Vs. Budget

In the analysis seen above, we can see that there is no clear correlation between the budget of a movie and how profitable it will be. However, we can also see that as the budget exceeds 100 Million dollars, the probability of securing a profit of more than 100 Million dollars becomes more than 50%.

### Profit Vs. Genre

We can also see fromt the anlaysis of profitability vs. genre that for the most common genres the chance of profitability is not that high compared to some uncommon genres such Thriller where the chance of profitability far exceeds 50%.

### Reveneue & Budget Over The Years

In the plot below, we can see a very interesting observation: the amount of money spent on movies (adjusted for inflation) has not seen many ups and downs and has been relatively flat for the past 30 or so years whereas the revenues fluctuated more steeply over the same period.

In [None]:
df_commercial = df[df.budget_adj >= 10**6]

df_commercial.groupby('release_year').revenue_adj.mean().plot(xlim=(1960,2015),figsize=(20,10),kind = 'line',label='The Average Revenue',grid='True')
df_commercial.groupby('release_year').budget_adj.mean().plot(xlim=(1960,2015),figsize=(20,10),kind = 'line',label='The Average Budget',grid='True')
plt.legend();