#### Overview

1.  How does MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
2. Is there a difference in revenue between 2018 and 2020?
3. What  is the difference in revenue for movie length of short vs. long

# Create project

In [1]:
# basic imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# SQL
from sqlalchemy import create_engine
import pymysql
pymysql.install_as_MySQLdb()

# Stats
import scipy.stats as stats
import statsmodels.api as sm

# settings
import warnings
warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:,.2f}'.format

# Load Data and Process

## First Hypothesis prep

In [11]:
# import the data
filename = 'Data/tmdb_results_combined_df.csv.gz'
firsthypo_df=pd.read_csv(filename)
#check that the data loaded
firsthypo_df.head(2)

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,


In [3]:
#check info
firsthypo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2580 entries, 0 to 2579
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2580 non-null   object 
 1   adult                  2578 non-null   float64
 2   backdrop_path          1412 non-null   object 
 3   belongs_to_collection  208 non-null    object 
 4   budget                 2578 non-null   float64
 5   genres                 2578 non-null   object 
 6   homepage               171 non-null    object 
 7   id                     2578 non-null   float64
 8   original_language      2578 non-null   object 
 9   original_title         2578 non-null   object 
 10  overview               2529 non-null   object 
 11  popularity             2578 non-null   float64
 12  poster_path            2322 non-null   object 
 13  production_companies   2578 non-null   object 
 14  production_countries   2578 non-null   object 
 15  rele

In [4]:
# check the value counts on certification column for first hypothesis
firsthypo_df['certification'].value_counts()

R          467
PG-13      182
NR          71
PG          63
G           25
NC-17        6
Unrated      1
Name: certification, dtype: int64

In [5]:
# explore the  missing data
num_missing = firsthypo_df['certification'].isna().sum()

total_rows = firsthypo_df.shape[0]

percent_missing = num_missing / total_rows
print(f'{percent_missing:.2f}% of the data in the certification column is missing')

0.68% of the data in the certification column is missing


#### Several step need to be taken to clean and prepare the data
- drop movies with no revenue and no budget
- drop movies with low counts in certification categories

In [12]:
# viewing shape before filtering
print(firsthypo_df.shape)
# dropping movies with zero for both 'revenue' AND 'budget'
firsthypo_df = firsthypo_df[((firsthypo_df['revenue'] > 0) & (firsthypo_df['budget'] > 0)) |
                     ((firsthypo_df['revenue'] == 0) & (firsthypo_df['budget'] > 0))].copy()

print(firsthypo_df.shape)

(2580, 26)
(546, 26)


In [14]:
# resolve small data counts by creating  new column  and check value
firsthypo_df['certification'].value_counts()

R        205
PG-13    125
PG        34
G         13
NR        12
Name: certification, dtype: int64

## Second Hypothesis prep

In [None]:
# open 2018 data json
with open('Data4/tmdb_api_results_2018.json') as f:
    tmbd_2018=json.load(f)