# Project: Investigate a Dataset (TMDb Movie Data)

###### By Karim El-Dweky

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-Wrangling" data-toc-modified-id="Data-Wrangling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Wrangling</a></span><ul class="toc-item"><li><span><a href="#Data-Gathering" data-toc-modified-id="Data-Gathering-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Data Gathering</a></span></li><li><span><a href="#Data-Assessment" data-toc-modified-id="Data-Assessment-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data Assessment</a></span><ul class="toc-item"><li><span><a href="#Visual-Assessment" data-toc-modified-id="Visual-Assessment-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Visual Assessment</a></span></li><li><span><a href="#Programmatic-Assessment" data-toc-modified-id="Programmatic-Assessment-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Programmatic Assessment</a></span></li><li><span><a href="#Data-Assessment-Report" data-toc-modified-id="Data-Assessment-Report-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>Data Assessment Report</a></span><ul class="toc-item"><li><span><a href="#Quality-Issues:" data-toc-modified-id="Quality-Issues:-2.2.3.1"><span class="toc-item-num">2.2.3.1&nbsp;&nbsp;</span>Quality Issues:</a></span></li><li><span><a href="#Tideness-Issues:" data-toc-modified-id="Tideness-Issues:-2.2.3.2"><span class="toc-item-num">2.2.3.2&nbsp;&nbsp;</span>Tideness Issues:</a></span></li></ul></li></ul></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Taking-a-copy-for-cleaning-process:" data-toc-modified-id="Taking-a-copy-for-cleaning-process:-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Taking a copy for cleaning process:</a></span></li><li><span><a href="#Cleaning-Quality-Issues:" data-toc-modified-id="Cleaning-Quality-Issues:-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Cleaning Quality Issues:</a></span></li><li><span><a href="#Cleaning-Tideness-Issues:" data-toc-modified-id="Cleaning-Tideness-Issues:-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Cleaning Tideness Issues:</a></span></li></ul></li><li><span><a href="#Data-Storing" data-toc-modified-id="Data-Storing-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Data Storing</a></span></li></ul></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Research-Question-1-(Replace-this-header-name!)" data-toc-modified-id="Research-Question-1-(Replace-this-header-name!)-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Research Question 1 (Replace this header name!)</a></span></li><li><span><a href="#Research-Question-2--(Replace-this-header-name!)" data-toc-modified-id="Research-Question-2--(Replace-this-header-name!)-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Research Question 2  (Replace this header name!)</a></span></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

## Introduction

The purpose of this project is to put in practice what I learned in data wrangling data section from Udacity Data Analysis Nanodegree program. The dataset that is wrangled contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

In [1]:
# importing required libraries 
import pandas as pd
import numpy as np
import requests
import re
import matplotlib.pyplot as plt
import datetime
import os
import seaborn as sns
from scipy import stats
from functools import reduce
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## Data Wrangling

### Data Gathering
Kaggle have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, They're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use.
- **TMDb Movie Data**
    - This file (tmdb-movies.csv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv.
    - The file is cleaned from original data on https://www.kaggle.com/tmdb/tmdb-movie-metadata.


In [2]:
# URL downloaded programatically 
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv"
file_name = os.path.basename(url)
response = requests.get(url)

if not os.path.isfile(file_name):
    with open(file_name, 'wb') as f:
        f.write(response.content)

# Read CSV file
tmdb_df = pd.read_csv(file_name)
tmdb_df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


### Data Assessment 
Data Assessment consists of two main steps (Visual - Programmatic) which help in exploring the gathered data and finding the anamolies points that needs to be cleaned. 

#### Visual Assessment

In [3]:
tmdb_df

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,1.379999e+08,3.481613e+08
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,1.012000e+08,2.716190e+08
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,1.839999e+08,1.902723e+09
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,1.747999e+08,1.385749e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10861,21,tt0060371,0.080598,0,0,The Endless Summer,Michael Hynson|Robert August|Lord 'Tally Ho' B...,,Bruce Brown,,...,"The Endless Summer, by Bruce Brown, is one of ...",95,Documentary,Bruce Brown Films,6/15/66,11,7.4,1966,0.000000e+00,0.000000e+00
10862,20379,tt0060472,0.065543,0,0,Grand Prix,James Garner|Eva Marie Saint|Yves Montand|Tosh...,,John Frankenheimer,Cinerama sweeps YOU into a drama of speed and ...,...,Grand Prix driver Pete Aron is fired by his te...,176,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,20,5.7,1966,0.000000e+00,0.000000e+00
10863,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...,,Eldar Ryazanov,,...,An insurance agent who moonlights as a carthie...,94,Mystery|Comedy,Mosfilm,1/1/66,11,6.5,1966,0.000000e+00,0.000000e+00
10864,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...,,Woody Allen,WOODY ALLEN STRIKES BACK!,...,"In comic Woody Allen's film debut, he took the...",80,Action|Comedy,Benedict Pictures Corp.,11/2/66,22,5.4,1966,0.000000e+00,0.000000e+00


#### Programmatic Assessment

In [4]:
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

In [5]:
tmdb_df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0


In [6]:
sum(tmdb_df['id'].duplicated())

1

In [7]:
tmdb_df[tmdb_df['id'].duplicated()]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


In [8]:
tmdb_df.popularity.value_counts()

0.109305    2
0.114027    2
0.126182    2
0.247926    2
0.410235    2
           ..
0.645437    1
0.088796    1
0.155075    1
0.596755    1
0.234375    1
Name: popularity, Length: 10814, dtype: int64

In [9]:
len(tmdb_df[tmdb_df['popularity'] <= 1])

9110

In [10]:
len(tmdb_df[tmdb_df['popularity'] > 1])

1756

In [11]:
tmdb_df.popularity.sum()

7024.227383999999

In [12]:
tmdb_df.vote_average.value_counts()

6.1    496
6.0    495
5.8    486
5.9    473
6.2    464
      ... 
8.9      1
8.6      1
9.2      1
8.7      1
2.0      1
Name: vote_average, Length: 72, dtype: int64

In [13]:
len(tmdb_df[tmdb_df['vote_average'] <= 5.97])

5056

In [14]:
len(tmdb_df[tmdb_df['vote_average'] > 5.97])

5810

In [15]:
len(tmdb_df[tmdb_df['vote_average'] == 0])

0

In [16]:
tmdb_df.sample(5)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
7513,11172,tt0758766,0.606428,40000000,145896422,Music and Lyrics,Drew Barrymore|Hugh Grant|Scott Porter|Brad Ga...,,Marc Lawrence,Share the music with someone you love.,...,A washed up singer is given a couple days to c...,96,Comedy|Music|Romance,Village Roadshow Pictures|Castle Rock Entertai...,2/9/07,372,6.1,2007,42066740.0,153434700.0
7884,620,tt0087332,2.484654,30000000,295212467,Ghostbusters,Bill Murray|Dan Aykroyd|Sigourney Weaver|Harol...,http://www.ghostbusters.com/,Ivan Reitman,They ain't afraid of no ghost.,...,After losing their academic posts at a prestig...,107,Fantasy|Action|Comedy|Science Fiction|Family,Columbia Pictures Corporation|Delphi Films|Bla...,6/7/84,1383,7.2,1984,62971260.0,619663400.0
4954,604,tt0234215,4.02924,150000000,738599701,The Matrix Reloaded,Keanu Reeves|Carrie-Anne Moss|Laurence Fishbur...,,Lilly Wachowski|Lana Wachowski,Free your mind.,...,Six months after the events depicted in The Ma...,138,Adventure|Action|Thriller|Science Fiction,Village Roadshow Pictures|NPV Entertainment|He...,5/15/03,2376,6.6,2003,177802900.0,875501100.0
94,309809,tt1754656,1.865007,64000000,97571250,The Little Prince,Jeff Bridges|Rachel McAdams|Paul Rudd|Marion C...,http://www.thelittleprincemovie.com/,Mark Osborne,Growing up isn't the problem... forgetting is.,...,Based on the best-seller book 'The Little Prin...,92,Adventure|Animation|Fantasy,Onyx Films|Orange Studios|CityMation|On Entert...,7/29/15,423,7.5,2015,58879970.0,89765510.0
5863,171581,tt2334841,0.215019,0,0,The Marine 3: Homefront,Mike Mizanin|Neal McDonough|Michael Eklund|Ash...,,Scott Wiper,,...,A Marine must do whatever it takes to save his...,86,Action,WWE Studios,3/5/13,16,4.7,2013,0.0,0.0


In [17]:
tmdb_df.isna().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

In [18]:
len(tmdb_df[tmdb_df['budget_adj'] == 0])

5696

In [19]:
len(tmdb_df[tmdb_df['revenue_adj'] == 0])

6016

In [20]:
len(tmdb_df[tmdb_df['budget'] == 0])

5696

In [21]:
len(tmdb_df[tmdb_df['revenue'] == 0])

6016

In [22]:
tmdb_df.isin([0]).sum()

id                         0
imdb_id                    0
popularity                 0
budget                  5696
revenue                 6016
original_title             0
cast                       0
homepage                   0
director                   0
tagline                    0
keywords                   0
overview                   0
runtime                   31
genres                     0
production_companies       0
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj              5696
revenue_adj             6016
dtype: int64

#### Data Assessment Report

##### Quality Issues:

   1. Dropping Dublicated **id**.
   2. Dropping unneeded columns.
   3. Converting zeros in **budget_adj** & **revenue_adj** to np-NaN.
   4. Dropping all rows with Nan Values except **revenue_adj**.
   5. Converting **release_date** from object into datetime.

##### Tideness Issues:

   6. Separate **release_date** into **release_day**, **release_month** and **release_year** (3 columns) and Dropping **release_date**.

### Data Cleaning

#### Taking a copy for cleaning process:

In [23]:
tmdb_df_cleaned = tmdb_df.copy()

#### Cleaning Quality Issues:

###### 1. Define

Dropping Dublicated **id**.

###### Code

In [24]:
# sorting by id 
tmdb_df_cleaned.sort_values('id', inplace = True) 
  
# dropping ALL duplicates values 
tmdb_df_cleaned.drop_duplicates(subset ='id', 
                     keep = False, inplace = True) 

###### Test

In [25]:
sum(tmdb_df_cleaned['id'].duplicated())

0

###### 2. Define

Dropping unneeded columns.

###### Code

In [26]:
# Dropping (imdb_id, budget, revenue, homepage, tagline, keywords, overview, release_year) coloumns 

unneeded_coulmns = ['imdb_id',
                    'budget',
                    'revenue',
                    'homepage',
                    'tagline',
                    'keywords',
                    'overview',
                    'release_year']

tmdb_df_cleaned.drop(unneeded_coulmns, axis=1, inplace=True)

###### Test

In [27]:
tmdb_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10864 entries, 8088 to 3460
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10864 non-null  int64  
 1   popularity            10864 non-null  float64
 2   original_title        10864 non-null  object 
 3   cast                  10788 non-null  object 
 4   director              10820 non-null  object 
 5   runtime               10864 non-null  int64  
 6   genres                10841 non-null  object 
 7   production_companies  9834 non-null   object 
 8   release_date          10864 non-null  object 
 9   vote_count            10864 non-null  int64  
 10  vote_average          10864 non-null  float64
 11  budget_adj            10864 non-null  float64
 12  revenue_adj           10864 non-null  float64
dtypes: float64(4), int64(3), object(6)
memory usage: 1.2+ MB


###### 3. Define

Converting zeros in **budget_adj** & **revenue_adj** to np-NaN.

###### Code

In [28]:
# Converting zeros in budget_adj to np-NaN
tmdb_df_cleaned['budget_adj'].replace(0, np.nan, inplace = True)
  
# Converting zeros in revenue_adj to np-NaN
tmdb_df_cleaned['revenue_adj'].replace(0, np.nan, inplace = True)

###### Test

In [29]:
tmdb_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10864 entries, 8088 to 3460
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10864 non-null  int64  
 1   popularity            10864 non-null  float64
 2   original_title        10864 non-null  object 
 3   cast                  10788 non-null  object 
 4   director              10820 non-null  object 
 5   runtime               10864 non-null  int64  
 6   genres                10841 non-null  object 
 7   production_companies  9834 non-null   object 
 8   release_date          10864 non-null  object 
 9   vote_count            10864 non-null  int64  
 10  vote_average          10864 non-null  float64
 11  budget_adj            5168 non-null   float64
 12  revenue_adj           4848 non-null   float64
dtypes: float64(4), int64(3), object(6)
memory usage: 1.2+ MB


###### 4. Define

Dropping all rows with Nan Values except **revenue_adj**.

###### Code

In [30]:
# Dropping all rows with Nan Values except revenue_adj.
tmdb_df_cleaned.dropna(subset=[n for n in tmdb_df_cleaned if n != 'revenue_adj'], inplace = True)

###### Test

In [31]:
tmdb_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5020 entries, 8088 to 3460
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5020 non-null   int64  
 1   popularity            5020 non-null   float64
 2   original_title        5020 non-null   object 
 3   cast                  5020 non-null   object 
 4   director              5020 non-null   object 
 5   runtime               5020 non-null   int64  
 6   genres                5020 non-null   object 
 7   production_companies  5020 non-null   object 
 8   release_date          5020 non-null   object 
 9   vote_count            5020 non-null   int64  
 10  vote_average          5020 non-null   float64
 11  budget_adj            5020 non-null   float64
 12  revenue_adj           3804 non-null   float64
dtypes: float64(4), int64(3), object(6)
memory usage: 549.1+ KB


###### 5. Define

Converting **release_date** coloumn from object into datetime.

###### Code

In [32]:
# Converting from object into datetime

tmdb_df_cleaned['release_date'] =  pd.to_datetime(tmdb_df_cleaned['release_date'], infer_datetime_format=True)

###### Test

In [33]:
tmdb_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5020 entries, 8088 to 3460
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    5020 non-null   int64         
 1   popularity            5020 non-null   float64       
 2   original_title        5020 non-null   object        
 3   cast                  5020 non-null   object        
 4   director              5020 non-null   object        
 5   runtime               5020 non-null   int64         
 6   genres                5020 non-null   object        
 7   production_companies  5020 non-null   object        
 8   release_date          5020 non-null   datetime64[ns]
 9   vote_count            5020 non-null   int64         
 10  vote_average          5020 non-null   float64       
 11  budget_adj            5020 non-null   float64       
 12  revenue_adj           3804 non-null   float64       
dtypes: datetime64[n

In [34]:
tmdb_df_cleaned.head()

Unnamed: 0,id,popularity,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,budget_adj,revenue_adj
8088,5,1.23489,Four Rooms,Tim Roth|Antonio Banderas|Jennifer Beals|Madon...,Allison Anders|Alexandre Rockwell|Robert Rodri...,98,Comedy,Miramax Films|A Band Apart,1995-12-25,293,6.4,5723867.0,6153158.0
1329,11,12.037933,Star Wars,Mark Hamill|Harrison Ford|Carrie Fisher|Peter ...,George Lucas,121,Adventure|Action|Science Fiction,Lucasfilm|Twentieth Century Fox Film Corporation,1977-03-20,4428,7.9,39575590.0,2789712000.0
4955,12,3.440519,Finding Nemo,Albert Brooks|Ellen DeGeneres|Alexander Gould|...,Andrew Stanton|Lee Unkrich,100,Animation|Family,Walt Disney Pictures|Pixar Animation Studios|D...,2003-05-30,3692,7.4,111423100.0,1024887000.0
4179,13,6.715966,Forrest Gump,Tom Hanks|Robin Wright|Gary Sinise|Mykelti Wil...,Robert Zemeckis,142,Comedy|Drama|Romance,Paramount Pictures,1994-07-06,4856,8.1,80911140.0,997333300.0
2411,14,3.55572,American Beauty,Kevin Spacey|Annette Bening|Thora Birch|Wes Be...,Sam Mendes,122,Drama,DreamWorks SKG|Jinks/Cohen Company,1999-09-15,1756,7.7,19635790.0,466411100.0


#### Cleaning Tideness Issues:

###### 6. Define

Separating **release_date** into **release_day**, **release_month** and **release_year** (3 columns) and Dropping **release_date**.

###### Code

In [35]:
# extract year, month and day to new columns
tmdb_df_cleaned['release_day'] = tmdb_df_cleaned['release_date'].dt.day
tmdb_df_cleaned['release_month'] = tmdb_df_cleaned['release_date'].dt.month
tmdb_df_cleaned['release_year'] = tmdb_df_cleaned['release_date'].dt.year

# Finally drop timestamp column
tmdb_df_cleaned = tmdb_df_cleaned.drop('release_date', axis=1)

###### Test

In [36]:
tmdb_df_cleaned

Unnamed: 0,id,popularity,original_title,cast,director,runtime,genres,production_companies,vote_count,vote_average,budget_adj,revenue_adj,release_day,release_month,release_year
8088,5,1.234890,Four Rooms,Tim Roth|Antonio Banderas|Jennifer Beals|Madon...,Allison Anders|Alexandre Rockwell|Robert Rodri...,98,Comedy,Miramax Films|A Band Apart,293,6.4,5.723867e+06,6.153158e+06,25,12,1995
1329,11,12.037933,Star Wars,Mark Hamill|Harrison Ford|Carrie Fisher|Peter ...,George Lucas,121,Adventure|Action|Science Fiction,Lucasfilm|Twentieth Century Fox Film Corporation,4428,7.9,3.957559e+07,2.789712e+09,20,3,1977
4955,12,3.440519,Finding Nemo,Albert Brooks|Ellen DeGeneres|Alexander Gould|...,Andrew Stanton|Lee Unkrich,100,Animation|Family,Walt Disney Pictures|Pixar Animation Studios|D...,3692,7.4,1.114231e+08,1.024887e+09,30,5,2003
4179,13,6.715966,Forrest Gump,Tom Hanks|Robin Wright|Gary Sinise|Mykelti Wil...,Robert Zemeckis,142,Comedy|Drama|Romance,Paramount Pictures,4856,8.1,8.091114e+07,9.973333e+08,6,7,1994
2411,14,3.555720,American Beauty,Kevin Spacey|Annette Bening|Thora Birch|Wes Be...,Sam Mendes,122,Drama,DreamWorks SKG|Jinks/Cohen Company,1756,7.7,1.963579e+07,4.664111e+08,15,9,1999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
311,360387,0.393566,Blunt Force Trauma,Mickey Rourke|Freida Pinto|Ryan Kwanten|Maruia...,Ken Sanzel,92,Adventure|Action,ETA films,23,4.4,9.199996e+05,,5,10,2015
330,362105,0.366030,R.L. Stine's Monsterville: The Cabinet of Souls,Dove Cameron|Katherine McNamara|Ryan McCartan|...,Peter DeLuise,85,Comedy|Horror,EveryWhere Studios,22,7.2,4.047998e+06,,29,9,2015
515,395560,0.142759,Capsule,Edmund Kingsley|David Wayman|Nigel Barber|Lisa...,Andrew Martin,91,Drama|History|Thriller|Science Fiction,Ecaveo Capital Partners|Hermes Space Industries,11,5.3,1.195999e+06,,23,12,2015
3826,414419,0.146477,Kill Bill: The Whole Bloody Affair,Uma Thurman|Lucy Liu|Vivica A. Fox|Daryl Hanna...,Quentin Tarantino,247,Crime|Action,Miramax Films|A Band Apart|Super Cool ManChu,28,8.1,2.908194e+07,,28,3,2011


### Data Storing

In [37]:
#Store the clean DataFrame in a CSV file
tmdb_df_cleaned.to_csv('Tmdb-Movies-Cleaned.csv', 
                 index=False, encoding = 'utf-8')

## Exploratory Data Analysis

### Research Question 1 (Replace this header name!)

In [38]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [39]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!