IMDB_Movies Hypothesis Testing 

1. The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?

* They want you to perform a statistical test to get a mathematically-supported answer.

* They want you to report if you found a significant difference between ratings.

> . If so, what was the p-value of you analysis?

> . And which rating earns the most revenue?

In [1]:
# import libraires 
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import scipy
scipy.__version__

'1.8.1'

In [2]:
# loaded the tmdb combined dataframe 
file = 'In_progress/DataClean/combined_tmdb_d.csv.gz'
df = pd.read_csv(file,lineterminator='\n')
df

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.4,21.0,
1,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,4.4,7.0,
2,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,0.0,0.0,
3,tt0118694,0.0,/n4GJFGzsc7NinI1VeGDXIcQjtU2.jpg,,150000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.wkw-inthemoodforlove.com/,843.0,cn,花樣年華,...,12854953.0,99.0,"[{'english_name': 'Cantonese', 'iso_639_1': 'c...",Released,"Feel the heat, keep the feeling burning, let t...",In the Mood for Love,0.0,8.1,1868.0,PG
4,tt0118852,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}]",,49511.0,en,Chinese Coffee,...,0.0,99.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,There's a fine line between friendship and bet...,Chinese Coffee,0.0,6.9,46.0,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60125,tt9895024,0.0,/S1kutYyoyuBJKGS0mXxv2fZNbr.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",,834443.0,de,Heikos Welt,...,0.0,118.0,"[{'english_name': 'German', 'iso_639_1': 'de',...",Released,,Heiko's World,0.0,0.0,0.0,
60126,tt9896876,0.0,/hMvRbT6HOqERhh3K8kXbaLz9LlZ.jpg,,0.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,821493.0,en,India Sweets and Spices,...,0.0,101.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Let the aunties talk.,India Sweets and Spices,0.0,6.0,2.0,PG-13
60127,tt9898844,0.0,/q2KFBGyUSzHDhNqXEYv2LqTWVSz.jpg,,0.0,"[{'id': 27, 'name': 'Horror'}]",,870671.0,en,The Hunting,...,0.0,91.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Hunger runs deep.,The Hunting,0.0,4.9,20.0,
60128,tt9900940,0.0,/3jjWLg5bevWqReyVroYqxwVrH0k.jpg,,0.0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,861294.0,en,The Scrapper,...,0.0,87.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,You can't escape your past.,The Scrapper,0.0,6.0,2.0,


In [3]:
# dataframe info 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60130 entries, 0 to 60129
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                60130 non-null  object 
 1   adult                  60130 non-null  float64
 2   backdrop_path          36994 non-null  object 
 3   belongs_to_collection  3738 non-null   object 
 4   budget                 60130 non-null  float64
 5   genres                 60130 non-null  object 
 6   homepage               14776 non-null  object 
 7   id                     60130 non-null  float64
 8   original_language      60130 non-null  object 
 9   original_title         60130 non-null  object 
 10  overview               58761 non-null  object 
 11  popularity             60130 non-null  float64
 12  poster_path            54382 non-null  object 
 13  production_companies   60130 non-null  object 
 14  production_countries   60130 non-null  object 
 15  re

In [4]:
# Preprocessing
## Columns to exclude
drop_cols = ['backdrop_path','backdrop_path','original_title','overview',
                 'poster_path','status','tagline','id','homepage',
                 'production_countries','video','production_companies','spoken_languages',
            'original_language']
df = df.drop(columns=drop_cols)
df.head(3)

Unnamed: 0,imdb_id,adult,belongs_to_collection,budget,genres,popularity,release_date,revenue,runtime,title,vote_average,vote_count,certification
0,tt0113026,0.0,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",2.491,2000-09-22,0.0,86.0,The Fantasticks,5.4,21.0,
1,tt0113092,0.0,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",1.974,2000-11-15,0.0,100.0,For the Cause,4.4,7.0,
2,tt0116391,0.0,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",1.341,2000-04-14,0.0,152.0,Gang,0.0,0.0,


**1. State the Hypothesis & Null Hypothesis**

- $H_0$ (Null Hypothesis): Movies over 2.5 hours long earn more revenue than movies that are 1.5 hours long(or less). 
- $H_A$ (Alternative Hypothesis): There is no difference with movies that or 2.5 long to 1.5 hours long.

**2. Determine the correct test to perform.**
- Type of Data?
    - Categorial
- How many groups/samples? 
     - 2-4 groups
- Therefore, which test is appropriate?
    - Anova Test

**Visualize and separate data for hypothesis**

    - What column is our target?
    - What column determines our groups?


In [5]:
# check for col 'runtime'
df['runtime']

0         86.0
1        100.0
2        152.0
3         99.0
4         99.0
         ...  
60125    118.0
60126    101.0
60127     91.0
60128     87.0
60129    111.0
Name: runtime, Length: 60130, dtype: float64