# Investigate the TMDb Movie Data

## Data Wrangling

In this section, we will firstly load in the data, check for cleanliness, and then trim and clean our dataset for analysis.



### General Properties

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import *
plt.style.use('seaborn-white')
%matplotlib inline

NOW, import the data and take a brief look at what the data set is like.  

In [None]:
# load the data
filepath = 'tmdb-movies.csv'
df = pd.read_csv(filepath)

# data records, columns, data type, and missing values check
df.info()

From the above, we can attain a brief summary of the data structure such as how many records, columns are there, which data each column has and which of them have null values. 

Then let's ouptput a few lines to see how these data actually look like in the data set.

In [None]:
# output the first five rows to see how the data is presented in the data set.
df.head()

From the above, we can see there are 10866 records. There are three data types in total, e.g. ***int***, ***float*** and ***strings***. Some columns like cast, homepage, tagline, production companies have lots of null values. Certain columns, like ‘cast’, ‘genres’ and production companies, contain multiple values separated by pipe ('|') characters. We need to do some tricks on these columns for answering our questions later on.  

Next, let's take a quick view of descriptive statistics summary on the numeric data type in this data set. *See as below*.

In [None]:
# make a brief descriptive statistics summary on the data
df.describe()

The summary above tells us some important numeric indicators about the movie, such as,  
- the movie data set records the information from 1960 to 2015.
- the average runtime of a movie is 102 mins but the maximum value is 900 mins which needs to be studied on deeper!
- The mimium values of some columns are 0 such as the `budget`, `revenue`. This is against reality. According to the explaination on [Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv), *'it was necessary to treat values of zero in the budget field as missing.'*, so we may consider this when we are answering buget/revenue related questions.

### Data Cleaning
In this section, we are going to do some wrangling on the original data set such as dropping some redundant information, groupping, computing etc. for making our exploratory data analysis and answering our questions easily.

#### Drop the irrelevant columns

As some columns do not make sense to answer our questions, we need to delete them from our dataset. Let's decide which columns need to be dropped.

In [None]:
# general information about the data in the table.
df.info()

Columns like `id`, `imdb_id`, `cast`, `homepage`, `tagline`, `keywords`, `overview` are all irrelevant to our analysis in the project, so we drop them immediately.

In [None]:
# drop the irrelevant columns.
df.drop(['id', 'imdb_id', 'cast', 'homepage', 'tagline', 'keywords', 'overview'], axis=1, inplace=True)

In [None]:
# check the dataset after dropping.
df.head()

#### Drop the missing values
Now let's look at the missing values in the dataset.

In [None]:
# check which columns have missing values and how many are there?
df.isnull().sum()

From the result above, we can see that within the columns of `director`, `genres`, `production_companies`, there are a few null values, as they do not take up much proportion, we just choose to delete them.

In [None]:
df.dropna(inplace=True)

In [None]:
# re-check whether the missing values are successfully dropped or not.
df.isnull().any()

In [None]:
# reset the index
df.reset_index(inplace=True, drop=True)

#### Alter the data type
Next let's transform the data types of `budget`, `revenue` from ***int*** to ***float***.

In [None]:
# change the data type of budget, revenue to float
df['budget'] = df['budget'].astype(float)
df['revenue'] = df['revenue'].astype(float)

Now let's check again the descriptive ststistics summary of the data set after cleaning.

In [None]:
df.describe()

As we can tell from the summary above, the longest runtime of a movie is 877 mins. To confirm this whether it's wrong or not, I searched it on [google](https://www.google.com/search?sxsrf=ALeKk03lXkiNzFw9YwQZ8k338MqIEdPV6A%3A1601458065265&ei=kU90X9_RD4-_0PEP4cCM-A0&q=taken+dreamworks+runtime&oq=taken+dreamworks+runtime&gs_lcp=CgZwc3ktYWIQAzIFCCEQoAEyBQghEKABOgQIIxAnOgQIABAeUP-TB1i1mgdgspsHaABwAHgAgAHYAogBvAmSAQUyLTEuM5gBAKABAaoBB2d3cy13aXrAAQE&sclient=psy-ab&ved=0ahUKEwif3uiYyJDsAhWPHzQIHWEgA98Q4dUDCA0&uact=5), and found out it is a TV mini series, so let's leave as it is for the moment.

About the situation where budget and revenue equals to 0, as there are as large as (***See as below***), we cannot delete from the data right now as such large volumn of data will affect the liability and precision of other non-budget/revenue indicators' analysis. So we will also keep them temperarily here.

In [None]:
df[(df.budget==0) | (df.revenue==0)]['budget'].count()

In [None]:
df[(df.budget==0) & (df.revenue==0)]['budget'].count()