# Project: Investigate TMDB Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> The dataset that I have selected for investigating is a dataset that contains information about 10,000 movies collected from The Movie Database(TMDb), including user ratings and revenue.
>
## Questions

1. These are the following questions I want to analyze using this dataset:
    - Which genres are receiving higher user ratings?
    - What kinds of properties are associated with movies that have high revenues? 
    - Does higher budget movies have higher user ratings?

In [173]:
# import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as ply

%matplotlib inline
sns.set_style('darkgrid')

<a id='wrangling'></a>
## Data Wrangling

> In this section of the report, I will load in the data, check for cleanliness, and then trim and clean my dataset for analysis.
### General Properties

In [174]:
# load data
file_path = 'tmdb_movies.csv'
movie_df = pd.read_csv(file_path)

In [176]:
# view data
movie_df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [177]:
# shape of the data
print('Number of Movies: {}'.format(movie_df.shape[0]))
print('Number of Movie Attributes: {}'.format(movie_df.shape[1]))

Number of Movies: 10866
Number of Movie Attributes: 21


In [178]:
# number of nulls in each column
movie_df.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

### Observation 1
After investigating dataset for number of nulls in each column, this is what I found:
- **homepage** has `7930` nulls
- **tagline** has `2824` nulls
- **keywords** has `1493` nulls
- **overview** has `4` nulls
- **imdb_id** has `10` nulls
- **release_date**

There are also other columns with nulls but I decided to drop these columns from my dataframe as there are lot of nulls and if you observe there are some columns with less number of nulls that I'm gonna drop as these columns are not much useful for my analysis.

And talking about the remaining nulls I'm gonna handle them in my next steps.

In [179]:
# drop homepage, tagline, keywords, overview, and imdb_id columns
drop_columns = ['imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'release_date']
movie_df.drop(columns = drop_columns, inplace = True)

In [180]:
# verify with shape
print('Number of Movies: {}'.format(movie_df.shape[0]))
print('Number of Movie Attributes: {}'.format(movie_df.shape[1]))

Number of Movies: 10866
Number of Movie Attributes: 15


In [181]:
# finding duplicates
movie_df[movie_df.duplicated()]

Unnamed: 0,id,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,vote_count,vote_average,release_year,budget_adj,revenue_adj
2090,42194,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,Dwight H. Little,92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,110,5.0,2010,30000000.0,967000.0


In [182]:
movie_df.drop_duplicates(inplace = True)

In [183]:
# verify duplicates
print('Your data contains {} duplicates'.format(movie_df.duplicated().sum()))

Your data contains 0 duplicates


### Observation 2
After inspecting for duplicates in my dataframe there is only one duplicate and I dropped it from my dataframe.

In [184]:
# finding rows with missing values
movie_df[movie_df.isnull().any(axis = 1)]

Unnamed: 0,id,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,vote_count,vote_average,release_year,budget_adj,revenue_adj
228,300792,0.584363,0,0,Racing Extinction,Elon Musk|Jane Goodall|Louie Psihoyos|Leilani ...,Louie Psihoyos,90,Adventure|Documentary,,36,7.8,2015,0.0,0.0
259,360603,0.476341,0,0,Crown for Christmas,Danica McKellar|Rupert Penry-Jones|Ellie Botte...,Alex Zamm,84,TV Movie,,10,7.6,2015,0.0,0.0
295,363483,0.417191,0,0,12 Gifts of Christmas,Katrina Law|Donna Mills|Aaron O'Connell|Melani...,Peter Sullivan,84,Family|TV Movie,,12,6.3,2015,0.0,0.0
298,354220,0.370258,0,0,The Girl in the Photographs,Kal Penn|Claudia Lee|Kenny Wormald|Toby Heming...,Nick Simon,95,Crime|Horror|Thriller,,10,4.7,2015,0.0,0.0
328,308457,0.367617,0,0,Advantageous,Jacqueline Kim|James Urbaniak|Freya Adams|Ken ...,Jennifer Phang,92,Science Fiction|Drama|Family,,29,6.4,2015,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10804,15867,0.149259,0,0,Interiors,Diane Keaton|Kristin Griffith|Mary Beth Hurt|R...,Woody Allen,93,Drama,,35,6.3,1978,0.0,0.0
10806,24998,0.138635,0,0,Gates of Heaven,Lucille Billingsley|Zella Graham|Cal Harberts|...,Errol Morris,85,Documentary,,12,5.9,1978,0.0,0.0
10816,16378,0.064602,0,0,The Rutles: All You Need Is Cash,Eric Idle|John Halsey|Ricky Fataar|Neil Innes|...,Eric Idle|Gary Weis,76,Comedy,,14,6.0,1978,0.0,0.0
10842,36540,0.253437,0,0,Winnie the Pooh and the Honey Tree,Sterling Holloway|Junius Matthews|Sebastian Ca...,Wolfgang Reitherman,25,Animation|Family,,12,7.9,1966,0.0,0.0


### Observation 3
After inspecting the number of rows with missing values I have decided to drop them my dataframe.

In [185]:
# drop rows with missing columns
movie_df.drop(index = movie_df[movie_df.isnull().any(axis = 1)].index, inplace = True)

In [186]:
# verfiy for missing values
print('Your data contains {} missing values'.format(movie_df.isnull().sum().sum()))

Your data contains 0 missing values


In [187]:
# inspect data types
movie_df.dtypes

id                        int64
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
director                 object
runtime                   int64
genres                   object
production_companies     object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

### Observation 4
After inspecting for data types I decided to perform following things:
- Convert `budget`, `revenue` and `runtime` columns to floats

In [188]:
# Convert budget, revenue and runtime columns to floats
movie_df[['budget', 'revenue']] = movie_df[['budget', 'revenue']].astype(float)

In [189]:
# verify data types
movie_df.dtypes

id                        int64
popularity              float64
budget                  float64
revenue                 float64
original_title           object
cast                     object
director                 object
runtime                   int64
genres                   object
production_companies     object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

In [190]:
# Checking for consistency
movie_df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,9772.0,9772.0,9772.0,9772.0,9772.0,9772.0,9772.0,9772.0,9772.0,9772.0
mean,63189.64081,0.694721,16179670.0,44231210.0,102.926627,239.312014,5.963528,2000.878428,19415990.0,57053090.0
std,90718.059987,1.036931,32210740.0,122588900.0,27.877432,603.011504,0.913174,13.036794,35666340.0,151449900.0
min,5.0,0.000188,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10221.5,0.23271,0.0,0.0,90.0,18.0,5.4,1994.0,0.0,0.0
50%,18677.5,0.419762,200000.0,0.0,100.0,46.0,6.0,2005.0,306134.2,0.0
75%,70577.25,0.776408,19287500.0,31047290.0,112.0,173.0,6.6,2011.0,24642680.0,43118480.0
max,417859.0,32.985763,425000000.0,2781506000.0,877.0,9767.0,8.7,2015.0,425000000.0,2827124000.0


In [191]:
movie_df.head()

Unnamed: 0,id,popularity,budget,revenue,original_title,cast,director,runtime,genres,production_companies,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,32.985763,150000000.0,1513529000.0,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,6.5,2015,137999900.0,1392446000.0
1,76341,28.419936,150000000.0,378436400.0,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,7.1,2015,137999900.0,348161300.0
2,262500,13.112507,110000000.0,295238200.0,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2480,6.3,2015,101200000.0,271619000.0
3,140607,11.173104,200000000.0,2068178000.0,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,5292,7.5,2015,183999900.0,1902723000.0
4,168259,9.335014,190000000.0,1506249000.0,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2947,7.3,2015,174799900.0,1385749000.0
