# Project 2 Part 4
**Apply Hypothesis Testing**


*Christina Brockway*

## Business Problem

- Need a MySQL database on Movies from a subset of IMDB's publicly available dataset.
- Use this database to analyze what makes a movie successul
- Provide recommendations to the staakeholder on how to make a movie successful
- Create 3 senarios with the dataset
      -  Perform statistical testing to get mathematically-supported answers
      -  Report if there is a significance difference between features
          -  If yes, what was the p-value?
          -  which feature earns the most revenue?
      -  Prepare a visualization that supports findings

## Import/Load Data

In [1]:
import os, time, json
import tmdbsimple as tmdb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import missingno as msno
from tqdm.notebook import tqdm_notebook

import scipy.stats as stats

In [2]:
## Load API Key
with open('/Users/csbro/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
login.keys()

dict_keys(['api_key'])

In [3]:
tmdb.API_KEY = login['api_key']

In [4]:
FOLDER = 'MovieData/'


In [5]:
# Load in data from IMDB to compare to TMDB info
basics = pd.read_csv("data/basics-filtered.csv")
basics.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama


In [6]:
## Will use past 10 years from 2013 to 2023
GET_YEARS = list(range(2019, 2021))

#Create an empty list for errors
errors = []

In [7]:
#Define API function


def get_movie_with_rating(movie_id):
    #Get movie object using movie_id
    movie= tmdb.Movies(movie_id)
    #Save the dictionaries 
    movie_info = movie.info()
    releases = movie.releases()
    #Loop through countries for only US
    for c in releases['countries']:
        if c['iso_3166_1'] == 'US':
            movie_info['certification']= c['certification']
    return movie_info



def write_json(new_data, filename):
    """Appends a list of records (new_data) into a json file (filename).
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""

    with open(filename, 'r+') as file:
        #Load existing data into dictionary
        file_data = json.load(file)
        #choose to extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
            file_data.append(new_data)
        #set file's current position at offset
        file.seek(0)
        #convert back to json
        json.dump(file_data, file)

In [8]:
## Confirm APIO works
test= ["tt0848228", "tt0332280"]
results= []
for movie_id in test:
    movie_info = get_movie_with_rating(movie_id)
    results.append(movie_info)
pd.DataFrame(results)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",220000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",https://www.marvel.com/movies/the-avengers,24428,tt0848228,en,The Avengers,...,1518815515,143,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Some assembly required.,The Avengers,False,7.711,29299,PG-13
1,False,/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg,,29000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",http://www.newline.com/properties/notebookthe....,11036,tt0332280,en,The Notebook,...,115603229,123,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Behind every great love is a great story.,The Notebook,False,7.881,10702,PG-13


In [10]:
##OUTER LOOP
for YEAR in tqdm_notebook(GET_YEARS, desc='YEARS', position=0):
  
    #Prepare DF for json file
    JSON_MOVIE= f'{FOLDER}tmdb_api_results {YEAR}.json'
        #Check if file exists
    file_exists = os.path.isfile(JSON_MOVIE)
    
    if file_exists == False:
        print(f'Creating json file for API results for {YEAR}')
        with open(JSON_MOVIE, 'w') as f:
            json.dump([{'imdb_id':0}], f)
    else: 
        print(f'{JSON_MOVIE} already exists.')
    
    #Save dataframe
    df = basics.loc[basics['startYear'] == YEAR].copy()
    #saving movie_id to separate variable
    movie_ids = df['tconst'].copy() #.to_list()

    #Load exisiting data from json into DF called previous_df
    previous_df = pd.read_json(JSON_MOVIE)

    #filter out any ids that are already in the file
    needed_mids = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    #INNER LOOP
    for movie_id in tqdm_notebook(needed_mids,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            temp = get_movie_with_rating(movie_id)
            #Append/Extend results to json file
            write_json(temp, JSON_MOVIE)
            time.sleep(0.02)
        except Exception as e:
            errors.append([movie_id, e])

    print(f' - Total Errors: {len(errors)}')    


    final_year_df = pd.read_json(JSON_MOVIE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression= 'gzip', index=False)

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

MovieData/tmdb_api_results 2019.json already exists.


Movies from 2019:   0%|          | 0/1867 [00:00<?, ?it/s]

 - Total Errors: 955
Creating json file for API results for 2020


Movies from 2020:   0%|          | 0/5010 [00:00<?, ?it/s]

 - Total Errors: 1997


In [None]:
#Combine files with glob

import glob
q= "MovieData/tmdb_api*.json"
tmdb_glob = sorted(glob.glob(q))
tmdb_glob

In [None]:
#Loading all files into dataframe
df_glob = []
for file in tmdb_glob:
    temp_df = pd.read_json(file, index_col=0)
    df_glob.append(temp_df)
#concat files
df_tmdb = pd.concat(df_list)
df_tmdb.head(2)

In [None]:
## Inspect the data
df_tmdb.info()

In [None]:
df_tmdb.duplicated().sum()

In [None]:
df_tmdb.drop_duplicates(inplace=True)

### First Senario:

##### Does the MPAA rating of a movie affect how much revenue the movie generates?

**Null Hypothesis:**  There is no significant association between the MPAA rating of a movie and the revenue it generates.

**Alternative Hypothesis:**  The is a significant association between the MPAA rating of a movie and the revenue it generates.

In [None]:
sns.barplot(data=df_tmdb, x='certification', y='revenue');

- The following features are needed to test this hypothesis:  certification and revenue
- It is numeric data
- there are multiple groups
- Use a ANOVA
  - normality
  - equal variance
  - no significant outliers

In [None]:
df_tmdb['revenue'].value_counts()

In [None]:
#Create groups dictionary
groups ={}

#Loop through all unique categories
for certification in df_tmdbb['certification'].unique():
    data = df_tmdb.loc[df_tmdb['certification']==certification,'revenu'].copy()

#save into dictionary
    groups[certification]=data
groups.keys()

In [None]:
#Loop through the groups to get rid of outliers
groups_clean={}

for group, data in groups.items():
    outliers=np.abs(stats.zscore(data))>3
    n-outliers=np.sum(outliers)

    print(f" - For {group}, there were {n_outliers} outliers removed.")
    clean_data = data[-outliers]

    #Save into clean dictionary
    groups_clean[group] = clean_data
groups_clean.keys()

In [None]:
#Test for Normality

#Run normal test on each group and confirm there are >20 in each group
norm_results = []

for group, data in groups_clean.items():
    stat, p = stats.normaltest(data)
    norm_resuls.append({'group':group, "n": len(data),
                        'p':p, "test stat": stat, 'significance?': p<0.05})

#convert to dataframe
results_df = pd.DataFrame(norm_results)
results_df

-  None of the groups are normally distributed, BUT groups are greater than n=15, so the assumption of normality can be safely disregarded.

In [None]:
## Test for Equal Variance

res= stats.levene(*groups_clean.values())
res

In [None]:
res.pvalue<0.05

-  The null hypothesis of the Levene's test is that the samples DO have equal variance.
-The p-value indicates 

***The p-value is greater than 0.05, so we fail to reject the null hypothesis:***
    --  ***MPAA rating has no significant effect on revenue***

In [None]:
### Second Senario:

#####