# Genre as a Predictor of Profit
## A Chi-Squared Analysis of Various Movie Genres as Potential Predictors of Profitability

### The Business Question
Does the genre of a movie have any association with the movie's profitability?

### The Datasets

### The Methods

#### Import Pandas

In [1]:
import pandas as pd

#### Import Relevant Datasets

The relevant datasets for our analysis were the tn.movie_budgets.csv and tmdb.movies.csv files. 

In [2]:
budgets = pd.read_csv("data/tn.movie_budgets.csv")
tmdb = pd.read_csv("data/tmdb.movies.csv", index_col = 0)

#### Review the Contents of the Datasets and Areas that Require Cleaning

Before running our analysis, we needed to review the contents of the datasets, isolate relevant columns, and clean data as needed. 

##### Budgets Dataframe

In [3]:
budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


From this dataframe, we will need the following columns:
- Movie (for joining with other dataframes)
- Production Budget and Worldwide Gross (for calculating profit)

In [4]:
cols_to_keep = ['movie','production_budget','worldwide_gross']
budgets_relevant = budgets[cols_to_keep]
budgets_relevant.head()

Unnamed: 0,movie,production_budget,worldwide_gross
0,Avatar,"$425,000,000","$2,776,345,279"
1,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$1,045,663,875"
2,Dark Phoenix,"$350,000,000","$149,762,350"
3,Avengers: Age of Ultron,"$330,600,000","$1,403,013,963"
4,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$1,316,721,747"


We also noted that the production budget and worldwide gross columns were filled with strings (as evident by the symbolic characters used alongside the numeric characters). So, these needed to be cleaned and cast as integers before they could be used to calculate profit. However, before doing any further cleaning we looked for null values and duplicates so that we weren't making any unnecessary calculations

In [5]:
budgets_relevant.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   movie              5782 non-null   object
 1   production_budget  5782 non-null   object
 2   worldwide_gross    5782 non-null   object
dtypes: object(3)
memory usage: 135.6+ KB


Looking at the dataframe's information confirmed that the information within the production budget and worldwide gross columns were stored as strings and required cleaning. Furthermore, we could see that there are no obvious nulls in the dataframe.

In [6]:
budgets_relevant.duplicated().value_counts()

False    5782
dtype: int64

We also saw that there is no evidence of duplicated entries in the dataframe.

After checking for nulls and duplicates, we got started cleaning the budget and gross revenue columns so that we could eventually use them to calculate profit. 

In [7]:
def dollar_to_numeric(column):
    # removing $ and , from string
    column = column.str.replace(",","")
    column = column.str.replace("$","")

    # casting the values as integers
    column = pd.to_numeric(column)
    
    return column

budgets_relevant['worldwide_gross'] = dollar_to_numeric(budgets_relevant['worldwide_gross'])
budgets_relevant['production_budget'] = dollar_to_numeric(budgets_relevant['production_budget'])

budgets_relevant.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  budgets_relevant['worldwide_gross'] = dollar_to_numeric(budgets_relevant['worldwide_gross'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  budgets_relevant['production_budget'] = dollar_to_numeric(budgets_relevant['production_budget'])


Unnamed: 0,movie,production_budget,worldwide_gross
0,Avatar,425000000,2776345279
1,Pirates of the Caribbean: On Stranger Tides,410600000,1045663875
2,Dark Phoenix,350000000,149762350
3,Avengers: Age of Ultron,330600000,1403013963
4,Star Wars Ep. VIII: The Last Jedi,317000000,1316721747


After successfully casting the data as integers, we looked at the descriptive statistics for any obvious issues.

In [8]:
budgets_relevant['production_budget'].describe()

count    5.782000e+03
mean     3.158776e+07
std      4.181208e+07
min      1.100000e+03
25%      5.000000e+06
50%      1.700000e+07
75%      4.000000e+07
max      4.250000e+08
Name: production_budget, dtype: float64

In [9]:
budgets_relevant['worldwide_gross'].describe()

count    5.782000e+03
mean     9.148746e+07
std      1.747200e+08
min      0.000000e+00
25%      4.125415e+06
50%      2.798445e+07
75%      9.764584e+07
max      2.776345e+09
Name: worldwide_gross, dtype: float64

There are some zeroes in the gross revenue dataset. Since this is unlikely, and more likely meant to be null, we looked to see how many rows this is the case for.

In [10]:
budgets_relevant['worldwide_gross'].value_counts()

0            367
8000000        9
7000000        6
2000000        6
4000000        4
            ... 
166000000      1
42843521       1
101173038      1
478595         1
12996          1
Name: worldwide_gross, Length: 5356, dtype: int64

Since there are relatively few movies for which this is the case relative to the entire size of the dataset, these movies were dropped.

In [11]:
budgets_relevant = budgets_relevant.loc [ budgets_relevant['worldwide_gross'] > 0]
budgets_relevant['worldwide_gross'].describe()

count    5.415000e+03
mean     9.768800e+07
std      1.788591e+08
min      2.600000e+01
25%      7.004834e+06
50%      3.333987e+07
75%      1.044590e+08
max      2.776345e+09
Name: worldwide_gross, dtype: float64

In [12]:
# calculating total profit
budgets_relevant['total_profit'] =  budgets_relevant['worldwide_gross'] - budgets_relevant['production_budget']

# confirmation
budgets_relevant.head()

Unnamed: 0,movie,production_budget,worldwide_gross,total_profit
0,Avatar,425000000,2776345279,2351345279
1,Pirates of the Caribbean: On Stranger Tides,410600000,1045663875,635063875
2,Dark Phoenix,350000000,149762350,-200237650
3,Avengers: Age of Ultron,330600000,1403013963,1072413963
4,Star Wars Ep. VIII: The Last Jedi,317000000,1316721747,999721747


Knowing that we will eventually have to merge this dataframe with the TMDB dataframe, we also set the index to the column on which we wanted to merge.

In [13]:
budgets_relevant.set_index('movie', inplace = True)
budgets_relevant.head()

Unnamed: 0_level_0,production_budget,worldwide_gross,total_profit
movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,425000000,2776345279,2351345279
Pirates of the Caribbean: On Stranger Tides,410600000,1045663875,635063875
Dark Phoenix,350000000,149762350,-200237650
Avengers: Age of Ultron,330600000,1403013963,1072413963
Star Wars Ep. VIII: The Last Jedi,317000000,1316721747,999721747


##### TMDB Dataframe

In [14]:
tmdb.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


From this dataframe, we needed the following columns:
- Title
- Genre_ids

We started with dropping the irrelevant columns.

In [15]:
cols_to_keep = ['title','genre_ids']
tmdb_relevant = tmdb[cols_to_keep]
tmdb_relevant.head()

Unnamed: 0,title,genre_ids
0,Harry Potter and the Deathly Hallows: Part 1,"[12, 14, 10751]"
1,How to Train Your Dragon,"[14, 12, 16, 10751]"
2,Iron Man 2,"[12, 28, 878]"
3,Toy Story,"[16, 35, 10751]"
4,Inception,"[28, 878, 12]"


We noted that the genre_ids column appeared to contain lists of multiple ids associated with specific genres. We needed to clean this column and replace these numbers with their associated genre. However, we decided to wait to replace these values until after the dummy columns were created because it would be easier to rename columns than replace multiple numbers in every cell with its associated genre. 

Instead, we moved on to locating null values and duplicates.

In [16]:
tmdb_relevant.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      26517 non-null  object
 1   genre_ids  26517 non-null  object
dtypes: object(2)
memory usage: 621.5+ KB


There didn't appear to be any null values in the dataset.

In [17]:
tmdb_relevant.duplicated().value_counts()

False    25429
True      1088
dtype: int64

There were, however, some duplicate movies which we needed to drop.

In [18]:
tmdb_relevant = tmdb_relevant.drop_duplicates()

In [19]:
tmdb_relevant.duplicated().value_counts()

False    25429
dtype: int64

After dropping these duplicate values, we set the movie titles as in the index in preparation for merging these two dataframes.

In [20]:
tmdb_relevant = tmdb_relevant.set_index('title')
tmdb_relevant.head()

Unnamed: 0_level_0,genre_ids
title,Unnamed: 1_level_1
Harry Potter and the Deathly Hallows: Part 1,"[12, 14, 10751]"
How to Train Your Dragon,"[14, 12, 16, 10751]"
Iron Man 2,"[12, 28, 878]"
Toy Story,"[16, 35, 10751]"
Inception,"[28, 878, 12]"


#### Merging the Dataframes

In [21]:
budgets_and_tmdb = budgets_relevant.join(tmdb_relevant, how='inner')
budgets_and_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1992 entries, 10 Cloverfield Lane to xXx: Return of Xander Cage
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   production_budget  1992 non-null   int64 
 1   worldwide_gross    1992 non-null   int64 
 2   total_profit       1992 non-null   int64 
 3   genre_ids          1992 non-null   object
dtypes: int64(3), object(1)
memory usage: 77.8+ KB


#### Creating Dummy Columns for the Appropriate Genre

The genre_ids column is made up of strings. This means that we needed to remove any string characters and isolate each genre id before making dummy columns. We needed to accomplish the following:
1. Remove brackets and whitespace
2. Split by commas
3. Create a new dataframe with genres as columns and cells containing binary values, with 1 indicating a relevant genre for that movie. 

For this section, we utilized this resource to create our dummy columns: https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies

In [23]:
def create_dummy_cols(data, col):
    
    # remove [, ], and whitespace
    data[col] = data[col].str.strip("]")
    data[col] = data[col].str.strip("[")
    data[col] = data[col].str.replace(" ", "")
    
    # split genre ids by commas
    genre_ids = data[col].str.split(",")
    
    # create the binary dummy columns
    bin_genre_df = pd.get_dummies(genre_ids.apply(pd.Series).stack()).sum(level=0)
    budgets_and_genre_dummys = data.join(bin_genre_df, how='inner')
    
    # rename columns for genres
    budgets_and_genre_dummys.rename(columns = {'28' : 'Action', 
                                           '12' : 'Adventure',
                                          '16' : 'Animation',
                                          '35' : 'Comedy',
                                          '80' : 'Crime',
                                          '99' : 'Documentary',
                                          '18' : 'Drama',
                                          '10751' : 'Family',
                                          '14' : 'Fantasy',
                                          '36' : 'History',
                                          '27' : 'Horror',
                                          '10402' : 'Music',
                                          '9648' : 'Mystery',
                                          '10749' : 'Romance',
                                          '878' : 'SciFi',
                                          '10770' : 'TV',
                                          '53' : 'Thriller',
                                          '10752' : 'War',
                                          '37' : 'Western'}, inplace = True)
    return budgets_and_genre_dummys

budgets_and_genre_dummys = create_dummy_cols(budgets_and_tmdb,'genre_ids')
budgets_and_genre_dummys

Unnamed: 0,production_budget,worldwide_gross,total_profit,genre_ids,Unnamed: 5,Music,Romance,Family,War,TV,...,Horror,Action,Comedy,History,Western,Thriller,Crime,SciFi,Mystery,Documentary
10 Cloverfield Lane,5000000,108286422,103286422,5387818,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
10 Days in a Madhouse,12000000,14616,-11985384,18,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12 Strong,35000000,71118378,36118378,10752183628,0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,0,0
12 Years a Slave,20000000,181025343,161025343,1836,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
127 Hours,18000000,60217171,42217171,121853,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Zoolander 2,50000000,55348693,5348693,35,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Zoom,35000000,12506188,-22493812,163518,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Zootopia,150000000,1019429616,869429616,16121075135,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
mother!,30000000,42531076,12531076,18279648,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
