# Microsoft Movie Analysis

**Author:** [Gustavo Villagrana](mailto:gusvilla303@gmail.com)
***

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Microsoft-Movie-Analysis" data-toc-modified-id="Microsoft-Movie-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Microsoft Movie Analysis</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Business-Problem" data-toc-modified-id="Business-Problem-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Business Problem</a></span></li><li><span><a href="#Data-Understanding" data-toc-modified-id="Data-Understanding-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Understanding</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Data-Modeling" data-toc-modified-id="Data-Modeling-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Data Modeling</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Cleaning-the-Data" data-toc-modified-id="Cleaning-the-Data-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Cleaning the Data</a></span><ul class="toc-item"><li><span><a href="#Title-Basics" data-toc-modified-id="Title-Basics-1.8.1"><span class="toc-item-num">1.8.1&nbsp;&nbsp;</span>Title Basics</a></span></li><li><span><a href="#Title-Ratings" data-toc-modified-id="Title-Ratings-1.8.2"><span class="toc-item-num">1.8.2&nbsp;&nbsp;</span>Title Ratings</a></span></li><li><span><a href="#Movie-Gross-data" data-toc-modified-id="Movie-Gross-data-1.8.3"><span class="toc-item-num">1.8.3&nbsp;&nbsp;</span>Movie Gross data</a></span></li></ul></li><li><span><a href="#Joining-the-Data" data-toc-modified-id="Joining-the-Data-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Joining the Data</a></span><ul class="toc-item"><li><span><a href="#Duplicate-Titles-in-Data" data-toc-modified-id="Duplicate-Titles-in-Data-1.9.1"><span class="toc-item-num">1.9.1&nbsp;&nbsp;</span>Duplicate Titles in Data</a></span></li></ul></li><li><span><a href="#Working-with-Genres" data-toc-modified-id="Working-with-Genres-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Working with Genres</a></span><ul class="toc-item"><li><span><a href="#Groupby-List-of-Genres" data-toc-modified-id="Groupby-List-of-Genres-1.10.1"><span class="toc-item-num">1.10.1&nbsp;&nbsp;</span>Groupby List of Genres</a></span></li></ul></li></ul></li></ul></div>

## Overview
***
This project analyzes the types of films that are currently doing the best at the box office to help the head of Microsoft's new movie studio decide what type of films to create. Since this is Microsoft's first time creating films, it is critical that the best option is clearly identified in order to optimize this investment opportunity. 



## Business Problem
***
Microsoft wants to create a new movie studio to produce original content but needs help in identifying what type of films are performing the best at the box office. By identifying the best performing films, Microsoft will be able to leverage its investment resources and maximize its profitability. 



## Data Understanding


## Data Preparation


## Data Modeling


## Evaluation


## Conclusions

In [42]:
# Import standard packages
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Cleaning the Data

### Title Basics

In [43]:
# Title Basics data

title_basics_df = pd.read_csv('data/imdb.title.basics.csv.gz')
title_basics_df.rename(columns={'tconst': 'movie_id'}, inplace=True)
title_basics_df

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,


In [46]:
title_basics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


### Title Ratings

In [47]:
# Understanding the data:
# imdb.title.ratings

title_ratings_df = pd.read_csv('data/imdb.title.ratings.csv.gz')
title_ratings_df.rename(columns={'tconst': 'movie_id'}, inplace=True)
title_ratings_df

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.30,31
1,tt10384606,8.90,559
2,tt1042974,6.40,20
3,tt1043726,4.20,50352
4,tt1060240,6.50,21
...,...,...,...
73851,tt9805820,8.10,25
73852,tt9844256,7.50,24
73853,tt9851050,4.70,14
73854,tt9886934,7.00,5


In [48]:
title_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [49]:
# Joined Title Basics df and Title Ratings df ON movie_id

basics_with_ratings_df = title_basics_df.join(title_ratings_df.set_index('movie_id'), on='movie_id')
basics_with_ratings_df

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama",7.00,77.00
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama",7.20,43.00
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama,6.90,4517.00
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.10,13.00
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy",6.50,119.00
...,...,...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama,,
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary,,
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,,
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,,,


In [50]:
basics_with_ratings_df.rename(columns={'primary_title': 'title'}, inplace=True)
basics_with_ratings_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama",7.00,77.00
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama",7.20,43.00
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama,6.90,4517.00
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.10,13.00
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy",6.50,119.00
...,...,...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama,,
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary,,
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,,
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,,,


### Movie Gross data

In [51]:
# Movie Gross data

# domestic_gross is FLOAT type but foreign_gross is a STRING type



movie_gross_df = pd.read_csv('data/bom.movie_gross.csv.gz')
movie_gross_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.00,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.00,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010
3,Inception,WB,292600000.00,535700000,2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.00,,2018
3383,Edward II (2018 re-release),FM,4800.00,,2018
3384,El Pacto,Sony,2500.00,,2018
3385,The Swan,Synergetic,2400.00,,2018


In [52]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [53]:
movie_gross_sorted_df = movie_gross_df.sort_values(by=['domestic_gross'], ascending=False)
movie_gross_sorted_df  

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.6,2015
3080,Black Panther,BV,700100000.00,646900000,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.5,2018
1873,Jurassic World,Uni.,652300000.00,1019.4,2015
727,Marvel's The Avengers,BV,623400000.00,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [54]:
# Remove commas from df['foreign_gross'] column

movie_gross_sorted_df['foreign_gross'].replace(',','', regex=True, inplace=True)
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.6,2015
3080,Black Panther,BV,700100000.00,646900000,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.5,2018
1873,Jurassic World,Uni.,652300000.00,1019.4,2015
727,Marvel's The Avengers,BV,623400000.00,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [55]:
movie_gross_sorted_df['foreign_gross'] = movie_gross_sorted_df['foreign_gross'].astype(float)
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.60,2015
3080,Black Panther,BV,700100000.00,646900000.00,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.50,2018
1873,Jurassic World,Uni.,652300000.00,1019.40,2015
727,Marvel's The Avengers,BV,623400000.00,895500000.00,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000.00,2015
2392,Finding Mr. Right 2,CL,,114700000.00,2016
2468,Solace,LGP,,22400000.00,2016
2595,Viral,W/Dim.,,552000.00,2016


In [14]:
#movie_gross_sorted_df['foreign_to_fix'] = [len(str(row)) <= 6 for row in movie_gross_sorted_df['foreign_gross']]
#movie_gross_sorted_df.loc[movie_gross_sorted_df['foreign_to_fix'], 'foreign_gross'] 


1872   1,131.60
3079   1,369.50
1873   1,019.40
1874   1,163.00
2760   1,010.00
         ...   
2321        nan
2757        nan
2756        nan
1476        nan
327    3,800.00
Name: foreign_gross, Length: 1372, dtype: float64

In [15]:
#movie_gross_sorted_df.loc[movie_gross_sorted_df['foreign_to_fix'], 
 #                         'foreign_gross'] = movie_gross_sorted_df['foreign_gross'] * 1000000

In [56]:
movie_gross_sorted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3387 entries, 1872 to 2825
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 158.8+ KB


In [57]:
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.60,2015
3080,Black Panther,BV,700100000.00,646900000.00,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.50,2018
1873,Jurassic World,Uni.,652300000.00,1019.40,2015
727,Marvel's The Avengers,BV,623400000.00,895500000.00,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000.00,2015
2392,Finding Mr. Right 2,CL,,114700000.00,2016
2468,Solace,LGP,,22400000.00,2016
2595,Viral,W/Dim.,,552000.00,2016


In [65]:
#pd.merge(basics_with_ratings_df, movie_gross_sorted_df, left_on='original_title', right_on='title' )

## Joining the Data

In [58]:
basics_ratings_gross_df = basics_with_ratings_df.join(movie_gross_sorted_df.set_index('title'), how='inner', on='title')
basics_ratings_gross_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year
38,tt0315642,Wazir,Wazir,2016,103.00,"Action,Crime,Drama",7.10,15378.00,Relbig.,1100000.00,,2016
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012
39490,tt2404548,On the Road,On the Road,2011,90.00,Drama,,,IFC,744000.00,8000000.00,2012
68078,tt3872966,On the Road,On the Road,2013,87.00,Documentary,,,IFC,744000.00,8000000.00,2012
76007,tt4339118,On the Road,On the Road,2014,89.00,Drama,6.00,6.00,IFC,744000.00,8000000.00,2012
...,...,...,...,...,...,...,...,...,...,...,...,...
133797,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.00,Romance,6.50,607.00,WGUSA,747000.00,82100000.00,2018
134045,tt8427036,Helicopter Eela,Helicopter Eela,2018,135.00,Drama,5.40,673.00,Eros,72000.00,,2018
137854,tt8851262,Spring Fever,Spring Fever,2019,,"Comedy,Horror",,,Strand,10800.00,150000.00,2010
140171,tt9078374,Last Letter,"Ni hao, Zhihua",2018,114.00,"Drama,Romance",6.40,322.00,CL,181000.00,,2018


In [59]:
basics_ratings_gross_df.loc[135258]

movie_id                      tt8553606
title                            Aurora
original_title                   Aurora
start_year                         2019
runtime_minutes                  106.00
genres             Comedy,Drama,Romance
averagerating                      7.50
numvotes                         200.00
studio                             CGld
domestic_gross                 5,700.00
foreign_gross                  5,100.00
year                               2011
Name: 135258, dtype: object

In [60]:
basics_ratings_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3366 entries, 38 to 140826
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         3366 non-null   object 
 1   title            3366 non-null   object 
 2   original_title   3366 non-null   object 
 3   start_year       3366 non-null   int64  
 4   runtime_minutes  3198 non-null   float64
 5   genres           3326 non-null   object 
 6   averagerating    3027 non-null   float64
 7   numvotes         3027 non-null   float64
 8   studio           3363 non-null   object 
 9   domestic_gross   3342 non-null   float64
 10  foreign_gross    2043 non-null   float64
 11  year             3366 non-null   int64  
dtypes: float64(5), int64(2), object(5)
memory usage: 501.9+ KB


In [61]:
df_without_nan = basics_ratings_gross_df.dropna()
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012
76007,tt4339118,On the Road,On the Road,2014,89.00,Drama,6.00,6.00,IFC,744000.00,8000000.00,2012
96791,tt5647250,On the Road,On the Road,2016,121.00,Drama,5.70,127.00,IFC,744000.00,8000000.00,2012
54,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.00,"Adventure,Comedy,Drama",7.30,275300.00,Fox,58200000.00,129900000.00,2013
58,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.00,"Action,Crime,Drama",6.50,105116.00,Uni.,26300000.00,26900000.00,2014
...,...,...,...,...,...,...,...,...,...,...,...,...
126784,tt7752454,Detective Chinatown 2,Tang ren jie tan an 2,2018,121.00,"Action,Comedy,Mystery",6.10,1250.00,WB,2000000.00,542100000.00,2018
127205,tt7784604,Hereditary,Hereditary,2018,127.00,"Drama,Horror,Mystery",7.30,151571.00,A24,44100000.00,35300000.00,2018
130621,tt8097306,Nobody's Fool,Nobody's Fool,2018,110.00,"Comedy,Drama,Romance",4.60,3618.00,Par.,31700000.00,1800000.00,2018
133797,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.00,Romance,6.50,607.00,WGUSA,747000.00,82100000.00,2018


In [62]:
df_without_nan['total_gross'] = df_without_nan['domestic_gross'] + df_without_nan['foreign_gross']
df_without_nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_without_nan['total_gross'] = df_without_nan['domestic_gross'] + df_without_nan['foreign_gross']


Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012,8744000.00
76007,tt4339118,On the Road,On the Road,2014,89.00,Drama,6.00,6.00,IFC,744000.00,8000000.00,2012,8744000.00
96791,tt5647250,On the Road,On the Road,2016,121.00,Drama,5.70,127.00,IFC,744000.00,8000000.00,2012,8744000.00
54,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.00,"Adventure,Comedy,Drama",7.30,275300.00,Fox,58200000.00,129900000.00,2013,188100000.00
58,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.00,"Action,Crime,Drama",6.50,105116.00,Uni.,26300000.00,26900000.00,2014,53200000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
126784,tt7752454,Detective Chinatown 2,Tang ren jie tan an 2,2018,121.00,"Action,Comedy,Mystery",6.10,1250.00,WB,2000000.00,542100000.00,2018,544100000.00
127205,tt7784604,Hereditary,Hereditary,2018,127.00,"Drama,Horror,Mystery",7.30,151571.00,A24,44100000.00,35300000.00,2018,79400000.00
130621,tt8097306,Nobody's Fool,Nobody's Fool,2018,110.00,"Comedy,Drama,Romance",4.60,3618.00,Par.,31700000.00,1800000.00,2018,33500000.00
133797,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.00,Romance,6.50,607.00,WGUSA,747000.00,82100000.00,2018,82847000.00


In [63]:
df_without_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1767 entries, 48 to 140826
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         1767 non-null   object 
 1   title            1767 non-null   object 
 2   original_title   1767 non-null   object 
 3   start_year       1767 non-null   int64  
 4   runtime_minutes  1767 non-null   float64
 5   genres           1767 non-null   object 
 6   averagerating    1767 non-null   float64
 7   numvotes         1767 non-null   float64
 8   studio           1767 non-null   object 
 9   domestic_gross   1767 non-null   float64
 10  foreign_gross    1767 non-null   float64
 11  year             1767 non-null   int64  
 12  total_gross      1767 non-null   float64
dtypes: float64(6), int64(2), object(5)
memory usage: 193.3+ KB


In [64]:
# Attempting to sort by total_gross by DESC 

df_without_nan = df_without_nan.sort_values(by='total_gross', ascending=False)
df_without_nan.head(10)

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.0,"Action,Adventure,Sci-Fi",7.3,665594.0,BV,459000000.0,946400000.0,2015,1405400000.0
19050,tt1825683,Black Panther,Black Panther,2018,134.0,"Action,Adventure,Sci-Fi",7.3,516148.0,BV,700100000.0,646900000.0,2018,1347000000.0
42223,tt2527336,Star Wars: The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152.0,"Action,Adventure,Fantasy",7.1,462903.0,BV,620200000.0,712400000.0,2017,1332600000.0
84414,tt4881806,Jurassic World: Fallen Kingdom,Jurassic World: Fallen Kingdom,2018,128.0,"Action,Adventure,Sci-Fi",6.2,219125.0,Uni.,417700000.0,891800000.0,2018,1309500000.0
6647,tt1323045,Frozen,Frozen,2010,93.0,"Adventure,Drama,Sport",6.2,62311.0,BV,400700000.0,875700000.0,2013,1276400000.0
10824,tt1611845,Frozen,Wai nei chung ching,2010,92.0,"Fantasy,Romance",5.4,75.0,BV,400700000.0,875700000.0,2013,1276400000.0
35107,tt2294629,Frozen,Frozen,2013,102.0,"Adventure,Animation,Comedy",7.5,516998.0,BV,400700000.0,875700000.0,2013,1276400000.0
62741,tt3606756,Incredibles 2,Incredibles 2,2018,118.0,"Action,Adventure,Animation",7.7,203510.0,BV,608600000.0,634200000.0,2018,1242800000.0
6453,tt1300854,Iron Man 3,Iron Man Three,2013,130.0,"Action,Adventure,Sci-Fi",7.2,692794.0,BV,409000000.0,805800000.0,2013,1214800000.0
35077,tt2293640,Minions,Minions,2015,91.0,"Adventure,Animation,Comedy",6.4,193917.0,Uni.,336000000.0,823400000.0,2015,1159400000.0


### Duplicate Titles in Data

In [65]:
df_without_nan.original_title.duplicated().sum()

156

In [66]:
# Set keep=False to view all duplicate rows to determine which record to keep

df_without_nan.loc[df_without_nan.original_title.duplicated(keep=False), :]


Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross
6647,tt1323045,Frozen,Frozen,2010,93.00,"Adventure,Drama,Sport",6.20,62311.00,BV,400700000.00,875700000.00,2013,1276400000.00
35107,tt2294629,Frozen,Frozen,2013,102.00,"Adventure,Animation,Comedy",7.50,516998.00,BV,400700000.00,875700000.00,2013,1276400000.00
43781,tt2608638,Inside Out,Inside Out,2013,75.00,"Biography,Documentary,History",7.50,60.00,BV,356500000.00,501100000.00,2015,857600000.00
27024,tt2071483,Inside Out,Inside Out,2011,59.00,Family,7.30,15.00,BV,356500000.00,501100000.00,2015,857600000.00
28269,tt2096673,Inside Out,Inside Out,2015,95.00,"Adventure,Animation,Comedy",8.20,536181.00,BV,356500000.00,501100000.00,2015,857600000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
133711,tt8396182,Aurora,Aurora,2018,98.00,Drama,6.30,7.00,CGld,5700.00,5100.00,2011,10800.00
62665,tt3603470,Aurora,Aurora,2014,83.00,Drama,7.00,87.00,CGld,5700.00,5100.00,2011,10800.00
135258,tt8553606,Aurora,Aurora,2019,106.00,"Comedy,Drama,Romance",7.50,200.00,CGld,5700.00,5100.00,2011,10800.00
7292,tt1403047,Aurora,Aurora,2010,181.00,Drama,6.70,1398.00,CGld,5700.00,5100.00,2011,10800.00


In [67]:
movie_budgets_df = pd.read_csv('data/tn.movie_budgets.csv.gz')

In [68]:
# Attempt to sort by worldwide_gross

budgets_sorted_df = movie_budgets_df.sort_values(by='worldwide_gross', ascending=False)
budgets_sorted_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
3737,38,"Aug 21, 2009",Fifty Dead Men Walking,"$10,000,000",$0,"$997,921"
3432,33,"Sep 30, 2005",Duma,"$12,000,000","$870,067","$994,790"
5062,63,"Apr 1, 2011",Insidious,"$1,500,000","$54,009,150","$99,870,886"
883,84,"Apr 2, 2004",Hellboy,"$60,000,000","$59,623,958","$99,823,958"
5613,14,"Mar 21, 1980",Mad Max,"$200,000","$8,750,000","$99,750,000"
...,...,...,...,...,...,...
5488,89,"Dec 31, 2014",The Sound and the Shadow,"$500,000",$0,$0
5487,88,"Dec 1, 2015",Brooklyn Bizarre,"$500,000",$0,$0
5486,87,"Aug 11, 2015",Alleluia! The Devil's Carnival,"$500,000",$0,$0
5485,86,"Jun 23, 2015",Crossroads,"$500,000",$0,$0


In [69]:
budgets_sorted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5782 entries, 3737 to 4765
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 316.2+ KB


## Working with Genres

In [70]:
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.00,"Action,Adventure,Sci-Fi",7.30,665594.00,BV,459000000.00,946400000.00,2015,1405400000.00
19050,tt1825683,Black Panther,Black Panther,2018,134.00,"Action,Adventure,Sci-Fi",7.30,516148.00,BV,700100000.00,646900000.00,2018,1347000000.00
42223,tt2527336,Star Wars: The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152.00,"Action,Adventure,Fantasy",7.10,462903.00,BV,620200000.00,712400000.00,2017,1332600000.00
84414,tt4881806,Jurassic World: Fallen Kingdom,Jurassic World: Fallen Kingdom,2018,128.00,"Action,Adventure,Sci-Fi",6.20,219125.00,Uni.,417700000.00,891800000.00,2018,1309500000.00
6647,tt1323045,Frozen,Frozen,2010,93.00,"Adventure,Drama,Sport",6.20,62311.00,BV,400700000.00,875700000.00,2013,1276400000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
133711,tt8396182,Aurora,Aurora,2018,98.00,Drama,6.30,7.00,CGld,5700.00,5100.00,2011,10800.00
62665,tt3603470,Aurora,Aurora,2014,83.00,Drama,7.00,87.00,CGld,5700.00,5100.00,2011,10800.00
135258,tt8553606,Aurora,Aurora,2019,106.00,"Comedy,Drama,Romance",7.50,200.00,CGld,5700.00,5100.00,2011,10800.00
7292,tt1403047,Aurora,Aurora,2010,181.00,Drama,6.70,1398.00,CGld,5700.00,5100.00,2011,10800.00


In [71]:
df_without_nan['list_of_genres'] = df_without_nan['genres'].map(lambda x: x.split(','))
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross,list_of_genres
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.00,"Action,Adventure,Sci-Fi",7.30,665594.00,BV,459000000.00,946400000.00,2015,1405400000.00,"[Action, Adventure, Sci-Fi]"
19050,tt1825683,Black Panther,Black Panther,2018,134.00,"Action,Adventure,Sci-Fi",7.30,516148.00,BV,700100000.00,646900000.00,2018,1347000000.00,"[Action, Adventure, Sci-Fi]"
42223,tt2527336,Star Wars: The Last Jedi,Star Wars: Episode VIII - The Last Jedi,2017,152.00,"Action,Adventure,Fantasy",7.10,462903.00,BV,620200000.00,712400000.00,2017,1332600000.00,"[Action, Adventure, Fantasy]"
84414,tt4881806,Jurassic World: Fallen Kingdom,Jurassic World: Fallen Kingdom,2018,128.00,"Action,Adventure,Sci-Fi",6.20,219125.00,Uni.,417700000.00,891800000.00,2018,1309500000.00,"[Action, Adventure, Sci-Fi]"
6647,tt1323045,Frozen,Frozen,2010,93.00,"Adventure,Drama,Sport",6.20,62311.00,BV,400700000.00,875700000.00,2013,1276400000.00,"[Adventure, Drama, Sport]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133711,tt8396182,Aurora,Aurora,2018,98.00,Drama,6.30,7.00,CGld,5700.00,5100.00,2011,10800.00,[Drama]
62665,tt3603470,Aurora,Aurora,2014,83.00,Drama,7.00,87.00,CGld,5700.00,5100.00,2011,10800.00,[Drama]
135258,tt8553606,Aurora,Aurora,2019,106.00,"Comedy,Drama,Romance",7.50,200.00,CGld,5700.00,5100.00,2011,10800.00,"[Comedy, Drama, Romance]"
7292,tt1403047,Aurora,Aurora,2010,181.00,Drama,6.70,1398.00,CGld,5700.00,5100.00,2011,10800.00,[Drama]


In [72]:
genres_df = df_without_nan.explode('list_of_genres')
genres_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,total_gross,list_of_genres
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.00,"Action,Adventure,Sci-Fi",7.30,665594.00,BV,459000000.00,946400000.00,2015,1405400000.00,Action
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.00,"Action,Adventure,Sci-Fi",7.30,665594.00,BV,459000000.00,946400000.00,2015,1405400000.00,Adventure
39010,tt2395427,Avengers: Age of Ultron,Avengers: Age of Ultron,2015,141.00,"Action,Adventure,Sci-Fi",7.30,665594.00,BV,459000000.00,946400000.00,2015,1405400000.00,Sci-Fi
19050,tt1825683,Black Panther,Black Panther,2018,134.00,"Action,Adventure,Sci-Fi",7.30,516148.00,BV,700100000.00,646900000.00,2018,1347000000.00,Action
19050,tt1825683,Black Panther,Black Panther,2018,134.00,"Action,Adventure,Sci-Fi",7.30,516148.00,BV,700100000.00,646900000.00,2018,1347000000.00,Adventure
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135258,tt8553606,Aurora,Aurora,2019,106.00,"Comedy,Drama,Romance",7.50,200.00,CGld,5700.00,5100.00,2011,10800.00,Drama
135258,tt8553606,Aurora,Aurora,2019,106.00,"Comedy,Drama,Romance",7.50,200.00,CGld,5700.00,5100.00,2011,10800.00,Romance
7292,tt1403047,Aurora,Aurora,2010,181.00,Drama,6.70,1398.00,CGld,5700.00,5100.00,2011,10800.00,Drama
137581,tt8821182,Aurora,Aurora,2018,110.00,"Horror,Thriller",4.30,298.00,CGld,5700.00,5100.00,2011,10800.00,Horror


### Groupby List of Genres

In [73]:
genres_grouped = genres_df.groupby('list_of_genres').sum()
genres_grouped

Unnamed: 0_level_0,start_year,runtime_minutes,averagerating,numvotes,domestic_gross,foreign_gross,year,total_gross
list_of_genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Action,922381,52678.0,2935.4,76340878.0,37674384196.0,67760212558.9,922408,105434596754.9
Adventure,741176,40980.0,2399.5,68114087.0,41532854795.0,77785294086.9,741217,119318148881.9
Animation,245687,11687.0,813.7,12740525.0,13214310498.0,25163917999.0,245725,38378228497.0
Biography,322286,18567.0,1122.9,15478966.0,5538161799.0,7251920400.0,322315,12790082199.0
Comedy,1206035,62449.0,3761.8,49283963.0,30759544795.0,45497069797.0,1206125,76256614592.0
Crime,487224,26938.0,1573.7,25895312.0,8908928700.0,9866646072.0,487263,18775574772.0
Documentary,165104,6836.0,580.2,592530.0,2659038899.0,3161341500.0,165083,5820380399.0
Drama,1874503,104288.0,6189.4,71723014.0,26720567297.0,39279903592.0,1874686,66000470889.0
Family,169102,8590.0,516.8,5612613.0,5446805300.0,8284801300.0,169095,13731606600.0
Fantasy,257743,14210.0,795.2,17202422.0,9010959899.0,18440349700.0,257763,27451309599.0


In [74]:
genres_grouped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22 entries, Action to Western
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   start_year       22 non-null     int64  
 1   runtime_minutes  22 non-null     float64
 2   averagerating    22 non-null     float64
 3   numvotes         22 non-null     float64
 4   domestic_gross   22 non-null     float64
 5   foreign_gross    22 non-null     float64
 6   year             22 non-null     int64  
 7   total_gross      22 non-null     float64
dtypes: float64(6), int64(2)
memory usage: 1.5+ KB
