# Microsoft Movie Analysis

**Author:** [Gustavo Villagrana](mailto:gusvilla303@gmail.com)
***

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Microsoft-Movie-Analysis" data-toc-modified-id="Microsoft-Movie-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Microsoft Movie Analysis</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Business-Problem" data-toc-modified-id="Business-Problem-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Business Problem</a></span></li><li><span><a href="#Data-Understanding" data-toc-modified-id="Data-Understanding-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data Understanding</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Data-Modeling" data-toc-modified-id="Data-Modeling-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Data Modeling</a></span></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Cleaning-the-Data" data-toc-modified-id="Cleaning-the-Data-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Cleaning the Data</a></span><ul class="toc-item"><li><span><a href="#Title-Basics" data-toc-modified-id="Title-Basics-1.8.1"><span class="toc-item-num">1.8.1&nbsp;&nbsp;</span>Title Basics</a></span></li><li><span><a href="#Movie-Gross-data" data-toc-modified-id="Movie-Gross-data-1.8.2"><span class="toc-item-num">1.8.2&nbsp;&nbsp;</span>Movie Gross data</a></span></li></ul></li><li><span><a href="#Joining-the-Data" data-toc-modified-id="Joining-the-Data-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Joining the Data</a></span><ul class="toc-item"><li><span><a href="#Working-with-Genres" data-toc-modified-id="Working-with-Genres-1.9.1"><span class="toc-item-num">1.9.1&nbsp;&nbsp;</span>Working with Genres</a></span></li></ul></li></ul></li></ul></div>

## Overview
***
This project analyzes the types of films that are currently doing the best at the box office to help the head of Microsoft's new movie studio decide what type of films to create. Since this is Microsoft's first time creating films, it is critical that the best option is clearly identified in order to optimize this investment opportunity. 



## Business Problem
***
Microsoft wants to create a new movie studio to produce original content but needs help in identifying what type of films are performing the best at the box office. By identifying the best performing films, Microsoft will be able to leverage its investment resources and maximize its profitability. 



## Data Understanding


## Data Preparation


## Data Modeling


## Evaluation


## Conclusions

In [2]:
# Import standard packages
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Cleaning the Data

### Title Basics

In [3]:
# Title Basics data

title_basics_df = pd.read_csv('data/imdb.title.basics.csv.gz')
title_basics_df.rename(columns={'tconst': 'movie_id'}, inplace=True)
title_basics_df

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,


In [4]:
title_basics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [5]:
# Understanding the data:
# imdb.title.ratings

title_ratings_df = pd.read_csv('data/imdb.title.ratings.csv.gz')
title_ratings_df.rename(columns={'tconst': 'movie_id'}, inplace=True)
title_ratings_df

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.30,31
1,tt10384606,8.90,559
2,tt1042974,6.40,20
3,tt1043726,4.20,50352
4,tt1060240,6.50,21
...,...,...,...
73851,tt9805820,8.10,25
73852,tt9844256,7.50,24
73853,tt9851050,4.70,14
73854,tt9886934,7.00,5


In [6]:
title_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [7]:
# Joined Title Basics df and Title Ratings df ON movie_id

basics_with_ratings_df = title_basics_df.join(title_ratings_df.set_index('movie_id'), on='movie_id')
basics_with_ratings_df

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama",7.00,77.00
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama",7.20,43.00
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama,6.90,4517.00
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.10,13.00
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy",6.50,119.00
...,...,...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama,,
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary,,
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,,
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,,,


In [8]:
basics_with_ratings_df.rename(columns={'primary_title': 'title'}, inplace=True)
basics_with_ratings_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.00,"Action,Crime,Drama",7.00,77.00
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.00,"Biography,Drama",7.20,43.00
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.00,Drama,6.90,4517.00
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.10,13.00
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.00,"Comedy,Drama,Fantasy",6.50,119.00
...,...,...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.00,Drama,,
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary,,
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy,,
146142,tt9916730,6 Gunn,6 Gunn,2017,116.00,,,


### Movie Gross data

In [9]:
# Movie Gross data

# domestic_gross is FLOAT type but foreign_gross is a STRING type



movie_gross_df = pd.read_csv('data/bom.movie_gross.csv.gz')
movie_gross_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.00,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.00,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010
3,Inception,WB,292600000.00,535700000,2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.00,,2018
3383,Edward II (2018 re-release),FM,4800.00,,2018
3384,El Pacto,Sony,2500.00,,2018
3385,The Swan,Synergetic,2400.00,,2018


In [10]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [11]:
movie_gross_sorted_df = movie_gross_df.sort_values(by=['domestic_gross'], ascending=False)
movie_gross_sorted_df  

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.6,2015
3080,Black Panther,BV,700100000.00,646900000,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.5,2018
1873,Jurassic World,Uni.,652300000.00,1019.4,2015
727,Marvel's The Avengers,BV,623400000.00,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [12]:
# Remove commas from df['foreign_gross'] column

movie_gross_sorted_df['foreign_gross'].replace(',','', regex=True, inplace=True)
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.6,2015
3080,Black Panther,BV,700100000.00,646900000,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.5,2018
1873,Jurassic World,Uni.,652300000.00,1019.4,2015
727,Marvel's The Avengers,BV,623400000.00,895500000,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000,2015
2392,Finding Mr. Right 2,CL,,114700000,2016
2468,Solace,LGP,,22400000,2016
2595,Viral,W/Dim.,,552000,2016


In [13]:
movie_gross_sorted_df['foreign_gross'] = movie_gross_sorted_df['foreign_gross'].astype(float)
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1872,Star Wars: The Force Awakens,BV,936700000.00,1131.60,2015
3080,Black Panther,BV,700100000.00,646900000.00,2018
3079,Avengers: Infinity War,BV,678800000.00,1369.50,2018
1873,Jurassic World,Uni.,652300000.00,1019.40,2015
727,Marvel's The Avengers,BV,623400000.00,895500000.00,2012
...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000.00,2015
2392,Finding Mr. Right 2,CL,,114700000.00,2016
2468,Solace,LGP,,22400000.00,2016
2595,Viral,W/Dim.,,552000.00,2016


In [14]:
movie_gross_sorted_df['foreign_to_fix'] = [len(str(row)) <= 6 for row in movie_gross_sorted_df['foreign_gross']]
movie_gross_sorted_df.loc[movie_gross_sorted_df['foreign_to_fix'], 'foreign_gross'] 


1872   1,131.60
3079   1,369.50
1873   1,019.40
1874   1,163.00
2760   1,010.00
         ...   
2321        nan
2757        nan
2756        nan
1476        nan
327    3,800.00
Name: foreign_gross, Length: 1372, dtype: float64

In [15]:
movie_gross_sorted_df.loc[movie_gross_sorted_df['foreign_to_fix'], 
                          'foreign_gross'] = movie_gross_sorted_df['foreign_gross'] * 1000000

In [38]:
movie_gross_sorted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3387 entries, 1872 to 2825
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   float64
 4   year            3387 non-null   int64  
 5   foreign_to_fix  3387 non-null   bool   
dtypes: bool(1), float64(2), int64(1), object(2)
memory usage: 322.1+ KB


In [16]:
movie_gross_sorted_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,foreign_to_fix
1872,Star Wars: The Force Awakens,BV,936700000.00,1131600000.00,2015,True
3080,Black Panther,BV,700100000.00,646900000.00,2018,False
3079,Avengers: Infinity War,BV,678800000.00,1369500000.00,2018,True
1873,Jurassic World,Uni.,652300000.00,1019400000.00,2015,True
727,Marvel's The Avengers,BV,623400000.00,895500000.00,2012,False
...,...,...,...,...,...,...
1975,Surprise - Journey To The West,AR,,49600000.00,2015,False
2392,Finding Mr. Right 2,CL,,114700000.00,2016,False
2468,Solace,LGP,,22400000.00,2016,False
2595,Viral,W/Dim.,,552000.00,2016,False


In [65]:
#pd.merge(basics_with_ratings_df, movie_gross_sorted_df, left_on='original_title', right_on='title' )

## Joining the Data

In [63]:
basics_ratings_gross_df = basics_with_ratings_df.join(movie_gross_sorted_df.set_index('title'), how='inner', on='title')
basics_ratings_gross_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix
38,tt0315642,Wazir,Wazir,2016,103.00,"Action,Crime,Drama",7.10,15378.00,Relbig.,1100000.00,,2016,True
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012,False
39490,tt2404548,On the Road,On the Road,2011,90.00,Drama,,,IFC,744000.00,8000000.00,2012,False
68078,tt3872966,On the Road,On the Road,2013,87.00,Documentary,,,IFC,744000.00,8000000.00,2012,False
76007,tt4339118,On the Road,On the Road,2014,89.00,Drama,6.00,6.00,IFC,744000.00,8000000.00,2012,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
133797,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.00,Romance,6.50,607.00,WGUSA,747000.00,82100000.00,2018,False
134045,tt8427036,Helicopter Eela,Helicopter Eela,2018,135.00,Drama,5.40,673.00,Eros,72000.00,,2018,True
137854,tt8851262,Spring Fever,Spring Fever,2019,,"Comedy,Horror",,,Strand,10800.00,150000.00,2010,False
140171,tt9078374,Last Letter,"Ni hao, Zhihua",2018,114.00,"Drama,Romance",6.40,322.00,CL,181000.00,,2018,True


In [18]:
basics_ratings_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146146 entries, 0 to 146143
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146146 non-null  object 
 1   title            146146 non-null  object 
 2   original_title   146125 non-null  object 
 3   start_year       146146 non-null  int64  
 4   runtime_minutes  114407 non-null  float64
 5   genres           140738 non-null  object 
 6   averagerating    73858 non-null   float64
 7   numvotes         73858 non-null   float64
 8   studio           3363 non-null    object 
 9   domestic_gross   3342 non-null    float64
 10  foreign_gross    2043 non-null    float64
 11  year             3366 non-null    float64
 12  foreign_to_fix   3366 non-null    object 
dtypes: float64(6), int64(1), object(6)
memory usage: 15.6+ MB


In [31]:
df_without_nan = basics_ratings_gross_df.dropna()
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012.00,False
54,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.00,"Adventure,Comedy,Drama",7.30,275300.00,Fox,58200000.00,129900000.00,2013.00,False
58,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.00,"Action,Crime,Drama",6.50,105116.00,Uni.,26300000.00,26900000.00,2014.00,False
60,tt0369610,Jurassic World,Jurassic World,2015,124.00,"Action,Adventure,Sci-Fi",7.00,539338.00,Uni.,652300000.00,1019400000.00,2015.00,True
61,tt0372538,Spy,Spy,2011,110.00,"Action,Crime,Drama",6.60,78.00,Fox,110800000.00,124800000.00,2015.00,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140826,tt9151704,Burn the Stage: The Movie,Burn the Stage: The Movie,2018,84.00,"Documentary,Music",8.80,2067.00,Trafalgar,4200000.00,16100000.00,2018.00,False
141374,tt9225192,Unstoppable,Seongnan hwangso,2018,116.00,"Action,Crime",6.50,576.00,Fox,81600000.00,86200000.00,2010.00,False
142617,tt9392532,Neighbors,Neighbors,2018,90.00,"Comedy,Drama",7.60,18.00,Uni.,150200000.00,120500000.00,2014.00,False
142938,tt9447594,The Gambler,The Gambler,2019,121.00,"Action,Sci-Fi,Thriller",6.10,10.00,Par.,33700000.00,5600000.00,2014.00,False


In [33]:
df_without_nan['total_gross'] = df_without_nan['domestic_gross'] + df_without_nan['foreign_gross']
df_without_nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_without_nan['total_gross'] = df_without_nan['domestic_gross'] + df_without_nan['foreign_gross']


Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix,total_gross
48,tt0337692,On the Road,On the Road,2012,124.00,"Adventure,Drama,Romance",6.10,37886.00,IFC,744000.00,8000000.00,2012.00,False,8744000.00
54,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.00,"Adventure,Comedy,Drama",7.30,275300.00,Fox,58200000.00,129900000.00,2013.00,False,188100000.00
58,tt0365907,A Walk Among the Tombstones,A Walk Among the Tombstones,2014,114.00,"Action,Crime,Drama",6.50,105116.00,Uni.,26300000.00,26900000.00,2014.00,False,53200000.00
60,tt0369610,Jurassic World,Jurassic World,2015,124.00,"Action,Adventure,Sci-Fi",7.00,539338.00,Uni.,652300000.00,1019400000.00,2015.00,True,1671700000.00
61,tt0372538,Spy,Spy,2011,110.00,"Action,Crime,Drama",6.60,78.00,Fox,110800000.00,124800000.00,2015.00,False,235600000.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140826,tt9151704,Burn the Stage: The Movie,Burn the Stage: The Movie,2018,84.00,"Documentary,Music",8.80,2067.00,Trafalgar,4200000.00,16100000.00,2018.00,False,20300000.00
141374,tt9225192,Unstoppable,Seongnan hwangso,2018,116.00,"Action,Crime",6.50,576.00,Fox,81600000.00,86200000.00,2010.00,False,167800000.00
142617,tt9392532,Neighbors,Neighbors,2018,90.00,"Comedy,Drama",7.60,18.00,Uni.,150200000.00,120500000.00,2014.00,False,270700000.00
142938,tt9447594,The Gambler,The Gambler,2019,121.00,"Action,Sci-Fi,Thriller",6.10,10.00,Par.,33700000.00,5600000.00,2014.00,False,39300000.00


In [37]:
df_without_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1767 entries, 48 to 146078
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         1767 non-null   object 
 1   title            1767 non-null   object 
 2   original_title   1767 non-null   object 
 3   start_year       1767 non-null   int64  
 4   runtime_minutes  1767 non-null   float64
 5   genres           1767 non-null   object 
 6   averagerating    1767 non-null   float64
 7   numvotes         1767 non-null   float64
 8   studio           1767 non-null   object 
 9   domestic_gross   1767 non-null   float64
 10  foreign_gross    1767 non-null   float64
 11  year             1767 non-null   float64
 12  foreign_to_fix   1767 non-null   object 
 13  total_gross      1767 non-null   float64
dtypes: float64(7), int64(1), object(6)
memory usage: 207.1+ KB


In [61]:
# Attempting to sort by total_gross by DESC 

df_without_nan = df_without_nan.sort_values(by='total_gross', ascending=False)
df_without_nan.head(10)

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix,total_gross
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.0,"Comedy,Drama,Romance",7.3,2336.0,Strand,3500.0,9500000000.0,2012.0,True,9500003500.0
71753,tt4096620,Troublemakers: The Story of Land Art,Troublemakers: The Story of Land Art,2015,72.0,"Biography,Documentary,History",6.5,108.0,FRun,29500.0,9100000000.0,2016.0,True,9100029500.0
7053,tt1373156,Karthik Calling Karthik,Karthik Calling Karthik,2010,135.0,"Drama,Mystery,Thriller",7.0,9257.0,Eros,286000.0,7100000000.0,2010.0,True,7100286000.0
40404,tt2442772,Bluebeard,Barbazul,2012,98.0,Horror,6.1,19.0,Strand,33500.0,5200000000.0,2010.0,True,5200033500.0
112563,tt6599340,Bluebeard,Haebing,2017,117.0,Thriller,6.4,1269.0,Strand,33500.0,5200000000.0,2010.0,True,5200033500.0
62665,tt3603470,Aurora,Aurora,2014,83.0,Drama,7.0,87.0,CGld,5700.0,5100000000.0,2011.0,True,5100005700.0
137581,tt8821182,Aurora,Aurora,2018,110.0,"Horror,Thriller",4.3,298.0,CGld,5700.0,5100000000.0,2011.0,True,5100005700.0
135258,tt8553606,Aurora,Aurora,2019,106.0,"Comedy,Drama,Romance",7.5,200.0,CGld,5700.0,5100000000.0,2011.0,True,5100005700.0
133711,tt8396182,Aurora,Aurora,2018,98.0,Drama,6.3,7.0,CGld,5700.0,5100000000.0,2011.0,True,5100005700.0
7292,tt1403047,Aurora,Aurora,2010,181.0,Drama,6.7,1398.0,CGld,5700.0,5100000000.0,2011.0,True,5100005700.0


In [58]:
grouped_df = df_without_nan.groupby('title')  # Need an Aggregation Function  # Look for .duplicates
grouped_df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe6b2fe17f0>

In [43]:
movie_budgets_df = pd.read_csv('data/tn.movie_budgets.csv.gz')

In [59]:
# Attempt to sort by worldwide_gross

budgets_sorted_df = movie_budgets_df.sort_values(by='worldwide_gross', ascending=False)
budgets_sorted_df

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
3737,38,"Aug 21, 2009",Fifty Dead Men Walking,"$10,000,000",$0,"$997,921"
3432,33,"Sep 30, 2005",Duma,"$12,000,000","$870,067","$994,790"
5062,63,"Apr 1, 2011",Insidious,"$1,500,000","$54,009,150","$99,870,886"
883,84,"Apr 2, 2004",Hellboy,"$60,000,000","$59,623,958","$99,823,958"
5613,14,"Mar 21, 1980",Mad Max,"$200,000","$8,750,000","$99,750,000"
...,...,...,...,...,...,...
5488,89,"Dec 31, 2014",The Sound and the Shadow,"$500,000",$0,$0
5487,88,"Dec 1, 2015",Brooklyn Bizarre,"$500,000",$0,$0
5486,87,"Aug 11, 2015",Alleluia! The Devil's Carnival,"$500,000",$0,$0
5485,86,"Jun 23, 2015",Crossroads,"$500,000",$0,$0


### Working with Genres

In [66]:
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix,total_gross
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.00,"Comedy,Drama,Romance",7.30,2336.00,Strand,3500.00,9500000000.00,2012.00,True,9500003500.00
71753,tt4096620,Troublemakers: The Story of Land Art,Troublemakers: The Story of Land Art,2015,72.00,"Biography,Documentary,History",6.50,108.00,FRun,29500.00,9100000000.00,2016.00,True,9100029500.00
7053,tt1373156,Karthik Calling Karthik,Karthik Calling Karthik,2010,135.00,"Drama,Mystery,Thriller",7.00,9257.00,Eros,286000.00,7100000000.00,2010.00,True,7100286000.00
40404,tt2442772,Bluebeard,Barbazul,2012,98.00,Horror,6.10,19.00,Strand,33500.00,5200000000.00,2010.00,True,5200033500.00
112563,tt6599340,Bluebeard,Haebing,2017,117.00,Thriller,6.40,1269.00,Strand,33500.00,5200000000.00,2010.00,True,5200033500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5836,tt1196340,Inhale,Inhale,2010,83.00,"Drama,Thriller",6.60,6006.00,IFC,4100.00,51000.00,2010.00,False,55100.00
55378,tt3244466,Love Thy Nature,Love Thy Nature,2014,76.00,Documentary,6.90,112.00,ITL,41100.00,11800.00,2015.00,False,52900.00
9424,tt1558250,GasLand,GasLand,2010,107.00,Documentary,7.70,9940.00,WOW,30800.00,18600.00,2010.00,False,49400.00
11261,tt1625150,North Sea Texas,"Noordzee, Texas",2011,98.00,"Drama,Romance",7.20,7476.00,Strand,28300.00,16700.00,2012.00,False,45000.00


In [67]:
x = df_without_nan.loc[7416, 'genres']
x

'Comedy,Drama,Romance'

In [68]:
y = x.split(',')
y

['Comedy', 'Drama', 'Romance']

In [72]:
df_without_nan['list_of_genres'] = df_without_nan['genres'].map(lambda x: x.split(','))
df_without_nan

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix,total_gross,list_of_genres
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.00,"Comedy,Drama,Romance",7.30,2336.00,Strand,3500.00,9500000000.00,2012.00,True,9500003500.00,"[Comedy, Drama, Romance]"
71753,tt4096620,Troublemakers: The Story of Land Art,Troublemakers: The Story of Land Art,2015,72.00,"Biography,Documentary,History",6.50,108.00,FRun,29500.00,9100000000.00,2016.00,True,9100029500.00,"[Biography, Documentary, History]"
7053,tt1373156,Karthik Calling Karthik,Karthik Calling Karthik,2010,135.00,"Drama,Mystery,Thriller",7.00,9257.00,Eros,286000.00,7100000000.00,2010.00,True,7100286000.00,"[Drama, Mystery, Thriller]"
40404,tt2442772,Bluebeard,Barbazul,2012,98.00,Horror,6.10,19.00,Strand,33500.00,5200000000.00,2010.00,True,5200033500.00,[Horror]
112563,tt6599340,Bluebeard,Haebing,2017,117.00,Thriller,6.40,1269.00,Strand,33500.00,5200000000.00,2010.00,True,5200033500.00,[Thriller]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5836,tt1196340,Inhale,Inhale,2010,83.00,"Drama,Thriller",6.60,6006.00,IFC,4100.00,51000.00,2010.00,False,55100.00,"[Drama, Thriller]"
55378,tt3244466,Love Thy Nature,Love Thy Nature,2014,76.00,Documentary,6.90,112.00,ITL,41100.00,11800.00,2015.00,False,52900.00,[Documentary]
9424,tt1558250,GasLand,GasLand,2010,107.00,Documentary,7.70,9940.00,WOW,30800.00,18600.00,2010.00,False,49400.00,[Documentary]
11261,tt1625150,North Sea Texas,"Noordzee, Texas",2011,98.00,"Drama,Romance",7.20,7476.00,Strand,28300.00,16700.00,2012.00,False,45000.00,"[Drama, Romance]"


In [74]:
genre_df = df_without_nan.explode('list_of_genres')
genre_df

Unnamed: 0,movie_id,title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,studio,domestic_gross,foreign_gross,year,foreign_to_fix,total_gross,list_of_genres
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.00,"Comedy,Drama,Romance",7.30,2336.00,Strand,3500.00,9500000000.00,2012.00,True,9500003500.00,Comedy
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.00,"Comedy,Drama,Romance",7.30,2336.00,Strand,3500.00,9500000000.00,2012.00,True,9500003500.00,Drama
7416,tt1417067,Cirkus Columbia,Cirkus Columbia,2010,113.00,"Comedy,Drama,Romance",7.30,2336.00,Strand,3500.00,9500000000.00,2012.00,True,9500003500.00,Romance
71753,tt4096620,Troublemakers: The Story of Land Art,Troublemakers: The Story of Land Art,2015,72.00,"Biography,Documentary,History",6.50,108.00,FRun,29500.00,9100000000.00,2016.00,True,9100029500.00,Biography
71753,tt4096620,Troublemakers: The Story of Land Art,Troublemakers: The Story of Land Art,2015,72.00,"Biography,Documentary,History",6.50,108.00,FRun,29500.00,9100000000.00,2016.00,True,9100029500.00,Documentary
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55378,tt3244466,Love Thy Nature,Love Thy Nature,2014,76.00,Documentary,6.90,112.00,ITL,41100.00,11800.00,2015.00,False,52900.00,Documentary
9424,tt1558250,GasLand,GasLand,2010,107.00,Documentary,7.70,9940.00,WOW,30800.00,18600.00,2010.00,False,49400.00,Documentary
11261,tt1625150,North Sea Texas,"Noordzee, Texas",2011,98.00,"Drama,Romance",7.20,7476.00,Strand,28300.00,16700.00,2012.00,False,45000.00,Drama
11261,tt1625150,North Sea Texas,"Noordzee, Texas",2011,98.00,"Drama,Romance",7.20,7476.00,Strand,28300.00,16700.00,2012.00,False,45000.00,Romance
