#### Overview

1.  How does MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
2. Is there a difference in revenue between 2018 and 2020?
3. What  is the difference in revenue for movie length of short vs. long

# Create project

In [1]:
# basic imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# SQL
from sqlalchemy import create_engine
import pymysql
pymysql.install_as_MySQLdb()

# Stats
import scipy.stats as stats
import statsmodels.api as sm

# settings
import warnings
warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:,.2f}'.format

# Load Data and Process

#### Several step need to be taken to clean and prepare the data

-  for the first hypothesis drop movies with low counts in certification categories
- for the second hypothesis  the movies need to be grouped

## First Hypothesis prep

In [53]:
# load data of years 2010-2020
year_2010 = pd.read_csv('API_Data/final_tmdb_data_2010.csv.gz', low_memory = False)
year_2011 = pd.read_csv('API_Data/final_tmdb_data_2011.csv', low_memory = False, lineterminator='\n')
year_2012 = pd.read_csv('API_Data/final_tmdb_data_2012.csv', low_memory = False, lineterminator='\n')
year_2013 = pd.read_csv('API_Data/final_tmdb_data_2013.csv', low_memory = False)
year_2014 = pd.read_csv('API_Data/final_tmdb_data_2014.csv', low_memory = False, lineterminator='\n')
year_2015 = pd.read_csv('API_Data/final_tmdb_data_2015.csv', low_memory = False, lineterminator='\n')
year_2016 = pd.read_csv('API_Data/final_tmdb_data_2016.csv', low_memory = False)
year_2017 = pd.read_csv('API_Data/final_tmdb_data_2017.csv', low_memory = False, lineterminator='\n')
year_2018 = pd.read_csv('API_Data/final_tmdb_data_2018.csv', low_memory = False, lineterminator='\n')
year_2019 = pd.read_csv('API_Data/final_tmdb_data_2019.csv', low_memory = False, lineterminator='\n')
year_2020 = pd.read_csv('API_Data/final_tmdb_data_2020.csv', low_memory = False, lineterminator='\n')


FileNotFoundError: [Errno 2] No such file or directory: 'API_Data/final_tmdb_data_2010.csv'

In [39]:
# import the data
filename = 'Data/tmdb_results_combined_df.csv.gz'
firsthypo_df=pd.read_csv(filename)
#check that the data loaded
firsthypo_df.head(2)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,


In [45]:
#check info
firsthypo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2580 entries, 0 to 2579
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2580 non-null   object 
 1   adult                  2578 non-null   float64
 2   backdrop_path          1412 non-null   object 
 3   belongs_to_collection  208 non-null    object 
 4   budget                 2578 non-null   float64
 5   genres                 2578 non-null   object 
 6   homepage               171 non-null    object 
 7   id                     2578 non-null   float64
 8   original_language      2578 non-null   object 
 9   original_title         2578 non-null   object 
 10  overview               2529 non-null   object 
 11  popularity             2578 non-null   float64
 12  poster_path            2322 non-null   object 
 13  production_companies   2578 non-null   object 
 14  production_countries   2578 non-null   object 
 15  rele

In [46]:
# check the value counts on certification column for first hypothesis
firsthypo_df['certification'].value_counts()

R          467
PG-13      182
NR          71
PG          63
G           25
NC-17        6
Unrated      1
Name: certification, dtype: int64

In [47]:
# explore the  missing data
num_missing = firsthypo_df['certification'].isna().sum()

total_rows = firsthypo_df.shape[0]

percent_missing = num_missing / total_rows
print(f'{percent_missing:.2f}% of the data in the certification column is missing')

0.68% of the data in the certification column is missing


In [49]:
# drop the unrated column
firsthypo_df = firsthypo_df.loc[firsthypo_df['certification'] != 'Unrated']

In [35]:
# drop the NC-17 column
firsthypo_df = firsthypo_df.loc[firsthypo_df['certification'] != 'NC-17']

In [50]:
# resolve small data counts by creating  new column  and check value
firsthypo_df['certification'].value_counts()

R        467
PG-13    182
NR        71
PG        63
G         25
NC-17      6
Name: certification, dtype: int64

In [31]:
#drop null values in certification column
firsthypo_df = firsthypo_df.dropna(subset=['certification'])

In [32]:
firsthypo_df['genres'].describe()


count                               814
unique                              332
top       [{'id': 18, 'name': 'Drama'}]
freq                                 70
Name: genres, dtype: object

In [7]:
# resolve small data counts by creating  new column  and check value
firsthypo_df['certification'].value_counts()

R        205
PG-13    125
PG        34
G         13
NR        12
Name: certification, dtype: int64

## Second Hypothesis prep

In [8]:
#read the data
df_2018 = pd.read_json('API_Data/tmdb_api_results_2018.json')
df_2018.head()


Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0069049,0.0,/zjG95oDnBcFKMPgBEmmuNVOMC90.jpg,,12000000.0,"[{'id': 18, 'name': 'Drama'}]",https://www.netflix.com/title/80085566,299782.0,en,The Other Side of the Wind,...,0.0,122.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,40 years in the making,The Other Side of the Wind,0.0,6.7,155.0,R
2,tt0192528,0.0,/kOxAfSyHZEDEhOCic8TxXprUg4T.jpg,,5000000.0,"[{'id': 18, 'name': 'Drama'}]",,567662.0,en,Reverse Heaven,...,0.0,104.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Heaven's in trouble and there's one man who ca...,Heaven & Hell,0.0,7.2,5.0,
3,tt0360556,0.0,/7oy4miyq4WYYy0xtX6lbNVPrEsr.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 878, 'nam...",https://www.hbo.com/movies/fahrenheit-451,401905.0,en,Fahrenheit 451,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Knowledge is a dangerous thing,Fahrenheit 451,0.0,5.4,686.0,PG-13
4,tt0365545,0.0,/ljTYcQ3pkzYF52Z8ev1Z1UThnPy.jpg,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",https://www.netflix.com/title/80189630,519035.0,en,Nappily Ever After,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Let yourself grow,Nappily Ever After,0.0,7.2,752.0,


In [9]:
#loading 2021 data, errors in reading 2020 data
df_2021 = pd.read_csv('API_Data/final_tmdb_data_2001.csv.gz', low_memory=False)
df_2021.head(2)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0035423,0.0,/hfeiSfWYujh6MKhtGTXyK3DD4nN.jpg,,48000000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 14, ...",,11232.0,en,Kate & Leopold,...,76019048.0,118.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,"If they lived in the same century, they'd be p...",Kate & Leopold,0.0,6.33,1195.0,PG-13


In [10]:
# create one dataframe
combinedyears = pd.concat([df_2018, df_2021])

In [11]:
#seperate out years
combinedyears['release_date'] = pd.to_datetime(combinedyears['release_date'])

In [12]:
#create year column
combinedyears['year'] = combinedyears['release_date'].dt.year
combinedyears.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5857 entries, 0 to 1336
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   imdb_id                5857 non-null   object        
 1   adult                  5855 non-null   float64       
 2   backdrop_path          4170 non-null   object        
 3   belongs_to_collection  319 non-null    object        
 4   budget                 5855 non-null   float64       
 5   genres                 5855 non-null   object        
 6   homepage               4182 non-null   object        
 7   id                     5855 non-null   float64       
 8   original_language      5855 non-null   object        
 9   original_title         5855 non-null   object        
 10  overview               5827 non-null   object        
 11  popularity             5855 non-null   float64       
 12  poster_path            5612 non-null   object        
 13  pro

In [13]:
# drop null values
combinedyears = combinedyears.dropna(subset=['year','revenue'])

In [14]:
#create 2001 column
combinedyears['year_2021'] = combinedyears['year'] == 2021

In [15]:
#check values
combinedyears['year_2021'].value_counts()

False    5736
True       23
Name: year_2021, dtype: int64

In [16]:
## save list of columns needed for each group
needed_cols = ['year_2021', 'revenue']

In [17]:
## save year_2021 in separate variable
year_2021_df = combinedyears.loc[combinedyears['year_2021']==True, needed_cols]
year_2021_df

Unnamed: 0,year_2021,revenue
123,True,0.0
126,True,0.0
127,True,0.0
248,True,0.0
600,True,0.0
764,True,0.0
980,True,0.0
1022,True,0.0
1260,True,0.0
1306,True,0.0


In [18]:
## save year_2018 in separate variable
year_2018_df = combinedyears.loc[combinedyears['year_2021']==False, needed_cols]
year_2018_df

Unnamed: 0,year_2021,revenue
1,False,0.00
2,False,0.00
3,False,0.00
4,False,0.00
5,False,0.00
...,...,...
1332,False,0.00
1333,False,0.00
1334,False,0.00
1335,False,0.00


In [19]:
## Saving JUST the numeric col as final group variables
year_2021_group = year_2021_df['revenue']
year_2018_group = year_2018_df['revenue']
year_2021_group

123    0.00
126    0.00
127    0.00
248    0.00
600    0.00
764    0.00
980    0.00
1022   0.00
1260   0.00
1306   0.00
1314   0.00
1887   0.00
1972   0.00
2049   0.00
2106   0.00
2381   0.00
2689   0.00
2838   0.00
3528   0.00
4018   0.00
4116   0.00
4495   0.00
712    0.00
Name: revenue, dtype: float64

## Third Hypothesis prep

In [20]:
#explore runtime
firsthypo_df['runtime'].describe()

count   546.00
mean    104.60
std      22.40
min       0.00
25%      91.00
50%     100.00
75%     114.00
max     224.00
Name: runtime, dtype: float64

In [21]:
# look at runtime
firsthypo_df['runtime'].value_counts()

90.00     28
99.00     18
95.00     18
108.00    18
98.00     18
          ..
199.00     1
77.00      1
165.00     1
216.00     1
70.00      1
Name: runtime, Length: 87, dtype: int64

In [22]:
# create groups
long_film_df = firsthypo_df.loc[firsthypo_df['runtime'] > 150].copy()
short_film_df = firsthypo_df.loc[firsthypo_df['runtime'] < 90].copy()

In [23]:
display(long_film_df.info(), short_film_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22 entries, 61 to 2527
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                22 non-null     object 
 1   adult                  22 non-null     float64
 2   backdrop_path          20 non-null     object 
 3   belongs_to_collection  3 non-null      object 
 4   budget                 22 non-null     float64
 5   genres                 22 non-null     object 
 6   homepage               3 non-null      object 
 7   id                     22 non-null     float64
 8   original_language      22 non-null     object 
 9   original_title         22 non-null     object 
 10  overview               22 non-null     object 
 11  popularity             22 non-null     float64
 12  poster_path            22 non-null     object 
 13  production_companies   22 non-null     object 
 14  production_countries   22 non-null     object 
 15  relea

None

None

In [24]:
longfilm_runtime = long_film_df['runtime']
shortfilm_runtime = short_film_df['runtime']

# First Hypothesis Testing
#### What are the differences in  revenue for movie ratings?


- Null: Movies have the same revenue in all MPAA ratings

- Alternate: Movies with different MPAA ratings have different revenue.

Possible question: How much is the differences using linear regression
  -  For this question rating will need to be grouped. 

## Test types: ANOVA and Kruskall

In [51]:
## Create groups dictionary. 
groups = {}
## Loop through all unique categories
for i in firsthypo_df['certification'].unique():
    ## Get series for group and rename
    data = firsthypo_df.loc[firsthypo_df['certification']==i,'revenue'].copy()
    
    # save into the dictionary
    groups[i] = data
groups.keys()

dict_keys([nan, 'PG', 'R', 'G', 'NR', 'PG-13', 'NC-17'])

### Check Assumptions for ANOVA
- normality
- equal variance
- outliers

### Normality

In [52]:
## Running normal test on each group and confirming there are >20 in each group
norm_results = {}
for i, data in groups.items():
    stat, p = stats.normaltest(data)
    ## save the p val, test statistic, and the size of the group
    norm_results[i] = {'n': len(data),
                             'p':p,
                             'test stat':stat,}
## convert to a dataframe
norm_results_df = pd.DataFrame(norm_results).T
norm_results_df

ValueError: skewtest is not valid with less than 8 samples; 0 samples were given.