# Data Cleaning

The following are the features of the movie industry we will be exploring
along with the datasets that are needed to analyze them:

- Genre: `imdb.title.basics.csv.bz2`
- Runtime: `imdb.title.basics.csv.bz2`
- Budget Allocation: `tn.movie_budgets.csv.bz2`
- Release Window: `tn.movie_budgets.csv.bz2`, `tmdb.movies.csv.bz2`
- Director: `imdb.title.crew.csv.bz2`, `imdb.name.basics.csv.bz2`

Consolidated the list of datasets is:

- `tn.movie_budgets.csv.bz2`
- `tmdb.movies.csv.bz2`
- `imdb.title.crew.csv.bz2`
- `imdb.name.basics.csv.bz2`
- `imdb.title.basics.csv.bz2`

Our methodology will be as follows for each dataset:

1. Identify wrongly encoded data types
2. Impute/Drop missing values
3. Drop columns which aren't required

Finally we will try to unify our datasets into one csv file for simple loading.

In [None]:
# filter out warnings
from warnings import filterwarnings
filterwarnings("ignore")

In [2]:
# import our required libraries
import pandas as pd
import numpy as np
from src.tools import currency_string_to_float

## tn.movie_budgets.csv.bz2

In [3]:
# import our dataset, display info and head
tn_movie_budgets = pd.read_csv("../data/raw/tn.movie_budgets.csv.bz2")
tn_movie_budgets.info()
display(tn_movie_budgets.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Our dataset doesn't look to have any issues, there are no known missing values, but we do have columns that should be numeric encoded as objects.

In [4]:
# convert release_date to a datetime instance
tn_movie_budgets.release_date = pd.to_datetime(tn_movie_budgets.release_date)

In [5]:
# convert our budget and gross columns to float
cols_to_convert = ['production_budget', 'domestic_gross', 'worldwide_gross']
result = tn_movie_budgets[cols_to_convert].applymap(currency_string_to_float)
tn_movie_budgets[cols_to_convert] = result

In [6]:
# finally drop the id column
tn_movie_budgets.drop(columns="id", errors="ignore", inplace=True)