# Investigating a Data Set - TMDB Movies

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

> This analysis covers a data set containing information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue. The original data set was collected by Kaggle and can be viewed in CSV format [Here](https://www.kaggle.com/tmdb/tmdb-movie-metadata). This full report including all the files used and the final dataset can be viewed in [This GitHub Repository](https://github.com/TrikerDev/Investigating-a-Data-Set---TMDB-Movies). This analysis is not to prove anything. It is to find possible correlations between certain points of data. This information is all tentative and **correlation does not equal causation**.

## Questions to answer

> * How has the number of movies produced changed over time?
> * How long is the average movie?
> * What genres are associated with the most amount of movies?
> * What years were the overall highest rated films released?
> * How has the profit of movies changed over the years?
> * How do popularity and profit correlate?

In [296]:
# Importing the packages that will be used for this analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id='wrangling'></a>
# Data Wrangling

## General Properties

In [297]:
# Loading in our dataset in the form of a Pandas DataFrame. We name the variable 'df' for DataFrame as its short and sweet
df = pd.read_csv('tmdb-movies.csv')

# Displaying the first few rows of df to make sure it imported correctly
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.99,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999939.28,1392445892.52
1,76341,tt1392190,28.42,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999939.28,348161292.49
2,262500,tt2908446,13.11,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101199955.47,271619025.41
3,140607,tt2488496,11.17,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999919.04,1902723129.8
4,168259,tt2820852,9.34,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799923.09,1385748801.47


> Now, some of these numbers are pretty strange. Specifically the **budget_adj** and **revenue_adj** columns. These are displaying in scientific notation. We dont want that.

In [298]:
# Changing the scientific notation values to display as floats. Changing very long numbers to be rounded up to have only
# two numbers after the decimal
pd.options.display.float_format = '{:.2f}'.format

#Displaying the first few rows again to see if the numbers have changed
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.99,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999939.28,1392445892.52
1,76341,tt1392190,28.42,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999939.28,348161292.49
2,262500,tt2908446,13.11,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101199955.47,271619025.41
3,140607,tt2488496,11.17,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999919.04,1902723129.8
4,168259,tt2820852,9.34,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799923.09,1385748801.47


> Displaying our table in a few different views to get some more information

In [299]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

In [25]:
df.describe()

Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.18,0.65,14625701.09,39823319.79,102.07,217.39,5.97,2001.32,17551039.82,51364363.25
std,92130.14,1.0,30913213.83,117003486.58,31.38,575.62,0.94,12.81,34306155.72,144632485.04
min,5.0,0.0,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.21,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.38,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.71,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853251.08,33697095.72
max,417859.0,32.99,425000000.0,2781505847.0,900.0,9767.0,9.2,2015.0,425000000.0,2827123750.41


In [27]:
df.dtypes

id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object

## Data Cleaning

> First, we are going to delete the columns that we dont need for this analysis. This will make the table overall more manageable and less cluttered.

> The columns that are not needed are:
> * imdb_id
> * homepage
> * tagline
> * overview
> * release_date (we dont need the specific date, only the release year, which is in a sepereate column)

In [130]:
# Getting rid of columns that we dont want
new_df = df.drop(['imdb_id', 'homepage', 'tagline', 'overview', 'release_date'], axis = 1)
new_df.head()

Unnamed: 0,id,popularity,budget,revenue,original_title,cast,director,keywords,runtime,genres,production_companies,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,32.99,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,monster|dna|tyrannosaurus rex|velociraptor|island,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,5562,6.5,2015,137999939.28,1392445892.52
1,76341,28.42,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,future|chase|post-apocalyptic|dystopia|australia,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,6185,7.1,2015,137999939.28,348161292.49
2,262500,13.11,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,based on novel|revolution|dystopia|sequel|dyst...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2480,6.3,2015,101199955.47,271619025.41
3,140607,11.17,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,android|spaceship|jedi|space opera|3d,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,5292,7.5,2015,183999919.04,1902723129.8
4,168259,9.34,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,car race|speed|revenge|suspense|car,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2947,7.3,2015,174799923.09,1385748801.47


> Another issue is how some data is presented in this table. Certain columns such as 'cast' have many values seperated by a '|' sign. We will seperate each of these values into their own unique value. The columns containing these values are:
> * cast
> * director
> * genre
> * keywords
> * production_companies

In [131]:
# Sepereate the values by the '|' symbol, and joining them into new dataframes
cast_df = (new_df['cast'].str.split('|', expand=True).rename(columns=lambda x: f"cast_{x+1}"))
director_df = (new_df['director'].str.split('|', expand = True).rename(columns=lambda x: f"director_{x+1}"))
genre_df = (new_df['genres'].str.split('|', expand=True).rename(columns=lambda x: f"genres_{x+1}"))
keywords_df = (new_df['keywords'].str.split('|', expand=True).rename(columns=lambda x: f"keywords_{x+1}"))
production_df = (new_df['production_companies'].str.split('|', expand=True).rename(columns=lambda x: f"production_companies_{x+1}"))

> Displaying these new dataframes to make sure the splitting worked

In [132]:
# Cast df
cast_df.head()

Unnamed: 0,cast_1,cast_2,cast_3,cast_4,cast_5
0,Chris Pratt,Bryce Dallas Howard,Irrfan Khan,Vincent D'Onofrio,Nick Robinson
1,Tom Hardy,Charlize Theron,Hugh Keays-Byrne,Nicholas Hoult,Josh Helman
2,Shailene Woodley,Theo James,Kate Winslet,Ansel Elgort,Miles Teller
3,Harrison Ford,Mark Hamill,Carrie Fisher,Adam Driver,Daisy Ridley
4,Vin Diesel,Paul Walker,Jason Statham,Michelle Rodriguez,Dwayne Johnson


In [133]:
# Director df
director_df.head()

Unnamed: 0,director_1,director_2,director_3,director_4,director_5,director_6,director_7,director_8,director_9,director_10,...,director_27,director_28,director_29,director_30,director_31,director_32,director_33,director_34,director_35,director_36
0,Colin Trevorrow,,,,,,,,,,...,,,,,,,,,,
1,George Miller,,,,,,,,,,...,,,,,,,,,,
2,Robert Schwentke,,,,,,,,,,...,,,,,,,,,,
3,J.J. Abrams,,,,,,,,,,...,,,,,,,,,,
4,James Wan,,,,,,,,,,...,,,,,,,,,,


> The reason there are so many 'None' values here is because the columns have to expand to take into account every movie, so every row has the same amount of values. This means that one movie had as many as 36 directors!

> Found it:

In [134]:
director_df[director_df.director_36.notnull()]

##(wow. truly incredible)##

Unnamed: 0,director_1,director_2,director_3,director_4,director_5,director_6,director_7,director_8,director_9,director_10,...,director_27,director_28,director_29,director_30,director_31,director_32,director_33,director_34,director_35,director_36
7751,Theo Angelopoulos,Olivier Assayas,Bille August,Jane Campion,Youssef Chahine,Chen Kaige,Michael Cimino,Ethan Coen,Joel Coen,David Cronenberg,...,Roman Polanski,RaÃºl Ruiz,Walter Salles,Elia Suleiman,Tsai Ming-Liang,Gus Van Sant,Lars von Trier,Wim Wenders,Wong Kar-wai,Zhang Yimou


In [111]:
# Genres df
genre_df.head()

Unnamed: 0,genres_1,genres_2,genres_3,genres_4,genres_5
0,Action,Adventure,Science Fiction,Thriller,
1,Action,Adventure,Science Fiction,Thriller,
2,Adventure,Science Fiction,Thriller,,
3,Action,Adventure,Science Fiction,Fantasy,
4,Action,Crime,Thriller,,


In [112]:
# Production Companies df
production_df.head()

Unnamed: 0,production_companies_1,production_companies_2,production_companies_3,production_companies_4,production_companies_5
0,Universal Studios,Amblin Entertainment,Legendary Pictures,Fuji Television Network,Dentsu
1,Village Roadshow Pictures,Kennedy Miller Productions,,,
2,Summit Entertainment,Mandeville Films,Red Wagon Entertainment,NeoReel,
3,Lucasfilm,Truenorth Productions,Bad Robot,,
4,Universal Pictures,Original Film,Media Rights Capital,Dentsu,One Race Films


In [113]:
# Keywords df
keywords_df.head()

Unnamed: 0,keywords_1,keywords_2,keywords_3,keywords_4,keywords_5
0,monster,dna,tyrannosaurus rex,velociraptor,island
1,future,chase,post-apocalyptic,dystopia,australia
2,based on novel,revolution,dystopia,sequel,dystopic future
3,android,spaceship,jedi,space opera,3d
4,car race,speed,revenge,suspense,car


> Now the next thing we need to do is join these seperate tables back into our main table.

In [135]:
# Start by dropping the columns we split
dropped_df = new_df.drop(['cast', 'director', 'genres', 'keywords', 'production_companies'], axis = 1)
dropped_df.head()

Unnamed: 0,id,popularity,budget,revenue,original_title,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,32.99,150000000,1513528810,Jurassic World,124,5562,6.5,2015,137999939.28,1392445892.52
1,76341,28.42,150000000,378436354,Mad Max: Fury Road,120,6185,7.1,2015,137999939.28,348161292.49
2,262500,13.11,110000000,295238201,Insurgent,119,2480,6.3,2015,101199955.47,271619025.41
3,140607,11.17,200000000,2068178225,Star Wars: The Force Awakens,136,5292,7.5,2015,183999919.04,1902723129.8
4,168259,9.34,190000000,1506249360,Furious 7,137,2947,7.3,2015,174799923.09,1385748801.47


In [136]:
# Now, adding back in our newly created tables in place of the ones we dropped
final_df = dropped_df.join([cast_df, genre_df, keywords_df, production_df, director_df])

In [137]:
# Checking for duplicate rows
final_df.duplicated().sum()
# Found 1 duplicate row

1

In [138]:
# Removing duplicate row
final_df.drop_duplicates(keep ='first', inplace=True)

In [139]:
# Checking again for duplicate rows
final_df.duplicated().sum()

0

In [349]:
# Checking the types of the columns
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 67 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10865 non-null  int64  
 1   popularity              10865 non-null  float64
 2   budget                  10865 non-null  int64  
 3   revenue                 10865 non-null  int64  
 4   original_title          10865 non-null  object 
 5   runtime                 10865 non-null  int64  
 6   vote_count              10865 non-null  int64  
 7   vote_average            10865 non-null  float64
 8   release_year            10865 non-null  int64  
 9   budget_adj              10865 non-null  float64
 10  revenue_adj             10865 non-null  float64
 11  cast_1                  10789 non-null  object 
 12  cast_2                  10645 non-null  object 
 13  cast_3                  10555 non-null  object 
 14  cast_4                  10446 non-null

> Now, the data is all cleaned and presented nicely, the duplicate rows are gone. We are ready to begin our analysis.

<a id='eda'></a>
# Exploratory Data Analysis

## How has the number of movies produced changed over time?

## How long is the average movie?

##  What genres are associated with the most amount of movies?

## What years were the overall highest rated films released?

## How has the profit of movies changed over the years?

## How do popularity and profit correlate?

<a id='conclusions'></a>
# Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!