# Project 3 Part 4 - Hypothesis Testing
Cameron Peace

## Task

For part 4 of the project, you will be using your MySQL database from part 3 to answer meaningful questions for your stakeholder. They want you to use your hypothesis testing and statistics knowledge to answer 3 questions about what makes a successful movie.

**Questions to Answer:**

*The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?*

* [ ] They want you to perform a statistical test to get a mathematically-supported answer.
* [ ] They want you to report if you found a significant difference between ratings.
* [ ] If so, what was the p-value of your analysis?
* [ ] And which rating earns the most revenue?
* [ ] They want you to prepare a visualization that supports your finding.

It is then up to you to think of 2 additional hypotheses to test that your stakeholder may want to know.
* [ ] Hypothesis 1
* [ ] Hypothesis 2

Some example hypotheses you could test:

* Do movies that are over 2.5 hours long earn more revenue than movies that are 1.5 hours long (or less)?
* Do movies released in 2020 earn less revenue than movies released in 2018?
* How do the years compare for movie ratings?
* Do some movie genres earn more revenue than others?
* Are some genres higher rated than others?
etc.

Specifications

Your Data

* A critical first step for this assignment will be to retrieve additional movie data to add to your SQL database.
* You will want to use the TMDB API again and extract data for additional years.
* You may want to review the optional lesson from Week 1 on "Using glob to Load Many Files" to load and combine all of your API results for each year.
* However, trying to extract the TMDB data for all movies from 2000-2022 could take >24 hours!
* To address this issue, you should EITHER:
* Define a smaller (but logical) period of time to use for your analyses (e.g. last 10 years, 2010-2019 (pre-pandemic, etc).
* OR coordinate with cohort-mates and divide the API calls so that you can all download the data for a smaller number of years and then share your downloaded JSON data.


**Deliverables:**

* [ ] You should use the same project repository you have been using for Parts 1-3 (for your portfolio).
* [ ] Create a new notebook in your project repository just for the hypothesis testing (like "Part 4 - Hypothesis Testing.ipynb")
* [ ] Make sure the results and visualization for all 3 hypotheses are in your notebook.
* [ ] Please submit the link to your GitHub repository for this assignment.

## Imports

In [10]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import pymysql
pymysql.install_as_MySQLdb()

## Connecting to Database

In [13]:
# loading password
with open('Data/sqlpass.txt') as f:
    my_pass = f.read()

In [15]:
# creating engine
engine = create_engine(f'mysql+pymysql://root:{my_pass}@localhost/movies')

# verifying
engine                 

Engine(mysql+pymysql://root:***@localhost/movies)

In [18]:
# displaying tables
pd.read_sql('show tables', engine)

Unnamed: 0,Tables_in_movies
0,genres
1,title_basics
2,title_genres
3,title_ratings
4,tmdb_data


## Loading, Viewing Data

In [20]:
# loading in tables as dataframes
genres = pd.read_sql('select * from genres', engine)
title_basics = pd.read_sql('select * from title_basics', engine)
title_genres = pd.read_sql('select * from title_genres', engine)
title_ratings = pd.read_sql('select * from title_ratings', engine)
tmdb_data = pd.read_sql('select * from tmdb_data', engine)

In [21]:
# initial view of dfs
display(genres.sample(3), title_basics.sample(3), title_genres.sample(3), 
        title_ratings.sample(3), tmdb_data.sample(3))

Unnamed: 0,genre_name,genre_id
19,Sci-Fi,19
17,Reality-TV,17
14,Musical,14


Unnamed: 0,tconst,primary_title,start_year,runtime_minutes
15841,tt10476336,Love Trip,2018.0,86
35888,tt1645129,Planzet,2010.0,53
82611,tt8751994,Flesh,2018.0,120


Unnamed: 0,tconst,genre_id
46734,tt12892776,5
34190,tt10846772,15
53560,tt13919696,23


Unnamed: 0,tconst,average_rating,num_votes
149797,tt0519962,5.5,77
346985,tt2123323,6.3,9
438155,tt5865148,6.1,82


Unnamed: 0,imdb_id,revenue,budget,certification
46516,tt3976280,0.0,0.0,
3913,tt0327409,0.0,7000000.0,R
41092,tt2980592,2700050.0,5000000.0,R


## ***Question 1:***

<font color='dodgerblue' size=4><i>
Does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
</i></font>

### Hypothesis Statements + Alpha Value



**Null Hypothesis($H_0$):**

<font color='forestgreen' size=4>The given rating of a movie has ***no significant effect*** on its revenue
</font>

**Alternate Hypothesis($H_1$):**
    
<font color='forestgreen' size=4>The given rating of a movie ***has a significant effect*** on its revenue
</font>

**Alpha Value:**

<font color='forestgreen' size=4>
0.05
</font>

In [22]:
tmdb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65850 entries, 0 to 65849
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   imdb_id        65850 non-null  object 
 1   revenue        65849 non-null  float64
 2   budget         65849 non-null  float64
 3   certification  39183 non-null  object 
dtypes: float64(2), object(2)
memory usage: 2.0+ MB
