In [None]:
# Import libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
# read datasets
df_credits = pd.read_csv("dataset/tmdb_5000_credits.csv")
df_movies = pd.read_csv("dataset/tmdb_5000_movies.csv")

Load in the datasets and check for first five rows

In [None]:
df_credits.head()

In [None]:
df_movies.head()

In [None]:
# Check the shape of the first dataset
df_credits.shape

In [None]:
# Check the shape of the second dataset
df_movies.shape

In [None]:
# Check for the duplicates in the first dataset
sum(df_credits.duplicated())

In [None]:
# Check for the duplicates in the second dataset
sum(df_credits.duplicated())

In [None]:
# Check for the datatypes of each variable in the first dataset
df_credits.dtypes

In [None]:
# Check for the datatypes of each variable in the second dataset
df_movies.dtypes

In [None]:
# Check for unique values in each variable in the first dataset
df_credits.nunique()

In [None]:
# Check for unique values in each variable in the second dataset
df_movies.nunique()

**Issues**
1. While there are 4803 movies in total, there are only 4800 unique titles and which suggest the presence of duplicates
2. There are 4800 titles but 4801 original titles which suggests further investigation in the 2 variables
3. There are 4802 unique popularity values instead of 4803 which suggests presence of duplicates
4. Status has only 3 unique values which needs further investigation to find if we need all the observations or some of them can be removed

In [None]:
df_credits.info()

In [None]:
df_movies.info()

**Presence of null values can be seen in some variables namely - `homepage`, `overview`, `release_date`, `runtime`, and `tagline`**

In [None]:
df_movies.status.value_counts()

**It can be seen that there are almost all the obseravtions which belong to `Released` status. Therefore, we need to limit the dataset to this value of status only as including other status types also makes no sense**

In [None]:
df_movies.isnull().sum()

**There are huge number of null counts in `homepage` followed by `tagline`. These are the variables which we might not even require in our further analysis**

In [None]:
# Check the instances where original title do not match with title column
df_movies[['original_title', 'title']].query('original_title != title')

**So, we saw that there are 261 instances where `original_title` doesn't match with `title`. So, it's possible that number of unique values differ in both the columns. Therefore, we are on the same place as before that we need to explore `title` column further which we will do in the _Further Exploration and Cleaning_ section**

### Further Exploration and Cleaning 

In [None]:
# Filter df_movies with status 'Released'
cl_mov = df_movies[df_movies['status'] == "Released"]
cl_mov.head(2)

In [None]:
# confirm only status is "Released"
cl_mov.status.unique()

In [None]:
# Drop columns not required further for analysis
cl_mov.drop(['homepage', 'keywords', 'original_language', 'original_title', 'tagline', 'overview', 'spoken_languages', 'status'], axis=1, inplace = True)
cl_mov.head(2)

In [None]:
# Check the shape again of cl_mov
cl_mov.shape

In [None]:
# Check the null counts again of cl_mov
cl_mov.isnull().sum()

In [None]:
cl_mov.dropna(inplace = True)

In [None]:
# Check the shape again of cl_mov
cl_mov.shape

In [None]:
# Check for duplicates in the title variable in cl_mov
cl_mov[cl_mov['title'].duplicated() == True]

In [None]:
cl_mov[cl_mov['title'].str.contains('Out of the Blue')]

In [None]:
cl_mov[cl_mov['title'].str.contains('The Host')]

In [None]:
cl_mov[cl_mov['title'].str.contains('Batman')]

**So, we observed that our intuition was wrong and there can be two or more movies with the same name**

In [None]:
cl_cr = df_credits

In [None]:
cl_cr.drop(['crew'], axis=1, inplace=True)
cl_cr.head(2)

In [None]:
cl_mov['release_date'] = pd.to_datetime(cl_mov['release_date'])
cl_mov.dtypes

<a id='eda'></a>
## Exploratory Data Analysis

So, now, as the title suggests, we are ready for **Exploratory Data Analysis**. As we know, we need to form questions for further research and analysis, this part will help us form those questions after we explore deeper and try to look at some specific areas for research. We will do this part stepwise as we have been doing till now. 


In [None]:
cl_mov.head(2)

 Look at the descriptive statistics of cl_mov

In [None]:
# Look at the descriptive statistics of the data
cl_mov.describe()

In [None]:
cl_cr.head(2)

In [None]:
mean_rev = cl_mov['revenue'].mean()
mean_rev

In [None]:
mean_bud = cl_mov['budget'].mean()
mean_bud

Although the values look good, but let'check if there are any zero values in `revenue` and other variable that is relatable i.e. `budget`

In [None]:
cl_mov.query('revenue == 0 or budget == 0')

In [None]:
cl_mov.query('revenue == 0 or budget == 0').count()

replace the zero values in both the columns with their respective means

In [None]:
cl_mov.replace({'revenue': {0: mean_rev}}, inplace = True)
cl_mov.query('revenue == 0 or budget == 0').count()

In [None]:
cl_mov.replace({'budget': {0: mean_bud}}, inplace = True)
cl_mov.query('revenue == 0 or budget == 0').count()