# Project: Investigating tmdb movies dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The dataset at hand is a dataset containing various data about movies, like cast, genre, budget, revenue, etc. The main question to be asked I believe is what makes a movie successfull in terms of revenue to budget ratio? However, there are more questions to be asked, like whether budget correlates to revenue.
* What factors contribute more to the financial success of a movie?
* Does budget directly correlate with revenue? or is there a point where increasing the budget becomes pointless?


In [None]:
#importing libraries to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

In [None]:
#loading data into a pandas dataframe
df = pd.read_csv('../input/dataset/tmdb-movies.csv')
df.info()

In [None]:
df.head()

We can notice from the previous cell that the dataset has too many columns, some of which will not be useful to our analysis, like id, original_title, homepage, etc.
We can also notice that some fields (e.g. cast) have long entries (lists of names) divided by '|'.

In [None]:
df.describe()

From the data produced by describe(), we can notice some absurdities. For example, over half of the movies have a budget and revenue of zero, which indicates missing values. Also, some movies have a runtime of zero, which also indicates missing values.

In [None]:
df[df['budget']!=0].count()

Further checking the zero budget and revenue values.

### Data Cleaning:

1-We will start by removing unneeded columns from the dataset to ease out way through cleaning the rest.

In [None]:
#dropping unneeded columns from the data
unneeded = ['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies', 'release_date']
df.drop(unneeded, axis=1, inplace=True)
df.info()

We can notice that there are some missing values in some fields (cast, director and genres). Till now, we are not sure whether we will include those in our final analysis or not, so we will keep them, and only take note of that.

Before moving further, we will check for duplicates.

In [None]:
sum(df.duplicated())

We discovered that there is one duplicate item, so we will remove it.

In [None]:
df.drop_duplicates(inplace = True)
sum(df.duplicated())

2-In this part we will try to address the problem of missing budget and revenue values. We will begin by checking what characteristics are common between those movies (release year and popularity) 

In [None]:
df['release_year'].hist()
plt.title('Movies missing budget data from all movies by release year')
plt.xlabel('Release Year')
plt.ylabel('Movie Count')
a = df[df['budget'] == 0].release_year.hist()
a.legend(['All Movies','Missing Budget Data']);

In [None]:
df['popularity'].hist(bins = 50)
plt.title('Movies missing budget data from all movies by popularity')
plt.xlabel('Popularity')
plt.ylabel('Movie Count')
a = df[df['budget'] == 0].popularity.hist(bins = 50)
a.legend(['All Movies','Missing Budget Data']);

From the above graph, we can observe that less popular movies are the ones who lack budget information. Since we are analysing movie success based on revenue and budget, we have no option but to drop those values.

In [None]:
df = df[df['revenue'] != 0]
df = df[df['budget'] != 0]
df.describe()

In [None]:
df[df['budget_adj'] < 1000000]

After looking up the internet for some of the movies that have absurdly low values for revenue and budget. It seems that those values were recorded in millions instead of dollars (e.g. 11 is actually 11 million dollars). However, there are many other issues with these entries (e.g. incorrect values), and they are not a significant portion of the dataset, so they will be dropped.

In [None]:
#removing movies with unrealisticly low budget or revenue
df = df[df['budget_adj'] > 1000000]
df = df[df['revenue_adj'] > 1000000]
df.info()

In [None]:
df.describe()

To keep the analysis simple, we will drop cast, director and genres columns

In [None]:
#Further dropping unneeded columns
df.drop(['cast', 'director', 'genres'], axis=1, inplace=True)
df.info()

<a id='eda'></a>
## Exploratory Data Analysis


### What factors contribute more to the financial success of a movie?

We will begin by adding a column to the data with a metric named success, which is the ratio of revenue to budget.

In [None]:
df['success'] = df['revenue']/df['budget']

We will first define a function for scatterplots of a success against a variable since we will be doing a lot of those.

In [None]:
def scatter_success(aspect,data):
    data.plot(x=aspect,y='success',kind='scatter')
    plt.title("Success Vs {}".format(aspect));

We will look at scatterplots of movie success against various aspects

In [None]:
scatter_success('budget_adj',df)
scatter_success('runtime',df)
scatter_success('release_year',df)
scatter_success('popularity',df)
scatter_success('vote_average',df)

We can try to exclude overly successfull movies in hope of having clearer graphs or correlations.

In [None]:
df_normal = df[df['success'] < 10]
scatter_success('budget_adj',df_normal)
scatter_success('runtime',df_normal)
scatter_success('release_year',df_normal)
scatter_success('popularity',df_normal)
scatter_success('vote_average',df_normal)

### Does budget directly correlate with revenue? or is there a point where increasing the budget becomes pointless?

* We can begin by looking at histograms of movie budgets and revenues (adjusted for inflation) to see how they are distributed

In [None]:
df['budget_adj'].hist()
plt.xlabel('Budget')
plt.ylabel('Movie Count')
plt.title('Budget Distribution');

In [None]:
df['revenue_adj'].hist()
plt.xlabel('Revenue')
plt.ylabel('Movie Count')
plt.title('Revenue Distribution');

We can notice that the graphs are similarly distributed, so there are far more low-budget movies than high-budget ones, and definitely movies, with higher budget are morepopular and make higher revnue, but that doesn't mean they are necessarily successfull.

We then look for correlation between budget and revenue

In [None]:
df.plot(x='budget_adj',y='revenue_adj',kind='scatter')
plt.title('Revenue Vs Budget');

<a id='conclusions'></a>
## Conclusions

From the above graphs and analysis, we can conclude that movie success is very hard to expect before the movie is released. It is more a complicated problem than to be solved with mere numbers. Even after excluding overly successfull movies, there still can not be correlation to be found. Most probably, it's due to the unqunatizable nature of some aspects of movies that control their success.

**Limitations:**
As mentioned in the previous paragraph, there are a lot of immeasurable factors that contribute to movie success. For example, the events that were occuring at the time of the movie release can greatly impact a movie success, just like we've seen this year with the COVID19 pandemic definitely affecting the success of many movies. Thus, it's almost impossible to figure figure out the factor that affect movie success based on simple statistical analysis alone, due to the nature of some factors being unmeasurable.