# Phase 2 Review

### Imports

In [13]:
import pandas as pd
import numpy as np
import scipy.stats as st
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from statsmodels.formula.api import ols
from statsmodels.stats.proportion import proportions_ztest
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [14]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


# Questions

## Question 1: Confidence Interval

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

### Steps to answer question:

**1) Breakdown the question**

* Rating: R
* Time: all dates post-2000
* Earnings: based on average data?

**2) Determine steps to get sample data**

* Filter orginial data for:
    * Titles: "movie_title"
    * Ratings: "content_rating" == "R"
    * Timeframe: "title_year" > 2000
    * Earnings: "gross"

**3) Use the stats.t.interval() method to determine our interval**

### Filter Data

#### My Code

In [15]:
# # Filter columns for desired topics
# df_filtered = df.loc[:, ['movie_title', 'content_rating', 'title_year', 'gross']]
# df_filtered.dropna(subset=['gross'], inplace=True)
# df_filtered.reset_index(drop=True, inplace= True)

# Filter for "R" rating
# df_full_filter = df_filtered[(df_filtered['content_rating'] == "R") & (df_filtered['title_year'] > 2000)]
# df_full_filter.reset_index(drop=True, inplace = True)

# # Filter for release year > 2000
# df_full_filter = df_fltr_rating[df_fltr_rating].reset_index(drop=True)

# Fully filtered data
# df_full_filter

In [16]:
df_2000r = df[(df['content_rating'] == "R") & (df['title_year'] > 2000)]

In [17]:
# Set sample data for confidence interval
data = df_2000r.gross
data

84              NaN
94      150350192.0
113      34293771.0
124     139259759.0
126     281492479.0
           ...     
5012     10499968.0
5014            NaN
5019            NaN
5021       192467.0
5026       136007.0
Name: gross, Length: 1390, dtype: float64

##### **Answer**

In [18]:
# Determining the confidence interval
ci_start, ci_end = st.t.interval(alpha=.05, df=len(data)-1, loc=np.mean(data), scale=st.tstd(data))

print(f'We are 95% confident that the average gross ($) for a movie released after 2000 falls between ${ci_start} and ${ci_end}.')

We are 95% confident that the average gross ($) for a movie released after 2000 falls between $nan and $nan.


  cond0 = self._argcheck(*args) & (scale > 0) & (loc == loc)


#### Answer Key Code

In [7]:
# do it in code here

df.dropna(subset=['gross'], inplace=True)

df_2000R = df[(df['title_year'] > 2000) & (df['content_rating'] == 'R')]

mean = df_2000R.gross.mean()

sd = df_2000R.gross.std()

n = df_2000R.gross.count()

mean, sd, n

(27648848.437913906, 39088854.94277442, 1208)

In [8]:
se = sd/n**.5

In [9]:
# 95% confidence interval
mean - 1.96 * (sd / n**.5), mean + 1.96 * (sd / n**.5)

(25444523.810569864, 29853173.065257948)

## Question 2a: Stat Tests and Hypotheses

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
- Do foreign films perform differently at the box office than non-foreign films?
- Of all movies created are 40% rated R?
- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
- Is there a relationship between the content rating of a film and its budget? 

### **Answers**

1a) Pooled T Test - two independent samples w/unknown population SD 

1b) 
 * H0: There is not a relationship between FB likes and box office gross.

 * H1: There is a relationship b/t FB likes, BO gross.
---
2a) T-test (2 sample)

2b)
 * H0: Foreign films do not do better vs. non-foreign films at the box office gross.

 * H1: Foreign films do better vs. non-foreign films at the box office gross.
---
3a) Z-test for proportion

3b)
 * H0: Of all movies created, 40% are not rated R.

 * H1: Of all movies created, 40% may be rated R.
---
4a) K2 - test for independence

4b)
 * H0: There is not a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film.

 * H1: There is a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film.
---
5a) **UNSURE**

5b)
 * H0: There is not a relationship between the content rating of a film and its budget.

 * H1: there is a relationship between the content rating of a film and its budget.

## Question 2b: Performing Testing

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [None]:
# df

In [None]:
df_no_na = df.dropna()
# df_no_na

### Determine the performance of foreign films at the box office

In [None]:
# Identify the non-USA films
df_foreign_films = df_no_na[df_no_na.loc[:, "country"] != 'USA']['gross']
# df_foreign_films.reset_index(drop=True,inplace=True)
df_foreign_films = np.array(df_foreign_films)

In [None]:
# # Determine average box office gross
# df_ff_perf = df_foreign_films
# df_ff_perf = np.array(df_ff_perf)
# df_ff_perf

### Determine performance of USA films at box office

In [None]:
# Identify the USA films
df_usa_films = df_no_na[df_no_na.loc[:, "country"] == 'USA']['gross']
# df_usa_films.reset_index(drop=True,inplace=True)
df_usa_films = np.array(df_usa_films)

In [None]:
# Determine average box office gross
# df_df_perf = df_usa_films['gross']
# df_df_perf = np.array(df_df_perf)
# df_df_perf

### Perform Test

In [None]:
# Perform a 2-sample T-test difference of means

t_stat, p_value = st.ttest_ind(df_foreign_films, df_df_perf)
t_stat, p_value

### **Answer:** 

Since our p-value is less than .05, we can reject the null hypothesis that there is not any difference between foreign and non-foreign films.

## Question 3: Feature Selection

Now that you have answered all of those questions, the executive wants you to **create a model that predicts the money a movie will make if it is released next year in the US.** She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Would you use all of these features in the model? Identify which features you might drop and why.


*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **total_cast_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


Which ones to remove based on corr coeff b/t ea. variable? Avoid multicollinearity

### **Answer**

Remove....

* years_old - movie not yet released
* avg_user_rating - not released; no scores
* either actor_1_facebook_likes or total_cast_facebook_likes to avoid collinearity

## Question 4a: Dummies and Modeling

## Create the following variables:

### `years_old`: The number of years since the film was released.

In [None]:
# Create new column in dataframe
# Set the values equal to this year minus the title year

years_old = 2021 - df['title_year']
years_old

In [None]:
df['years_old'] = years_old

In [None]:
# Confirm results
df.columns

### Dummy categories for each of the following ratings:

    - `G`
    - `PG`
    - `R`

#### Dummies via Pandas

In [None]:
# Trying via .loc and/or np.select

# Use .loc to find ratings
# df_g_ratings = df[df['content_rating'] == 'G']

# df_ratings = df[df['content_rating'] == 'G' | df['content_rating'] == 'PG' | df['content_rating'] == 'R']



# condlist = [df['content_rating'] == 'G',
#             df['content_rating'] == 'PG',
#             df['content_rating'] == 'R'
#            ]

# choicelist = [df[g_rating]]

# np.select(condlist, choicelist)

In [None]:
# PD.Dummies

dummies_rating = pd.get_dummies(df['content_rating'], drop_first=True)
dummies_rating = dummies_rating.loc[:, ["G", "PG", "R"]]
dummies_rating

In [None]:
df_with_dummies = pd.concat([df, dummies_rating], axis=1)
df_with_dummies.dropna(inplace=True)
df_with_dummies

#### ✨ HELP - Dummies via OHE

In [None]:
# df_nan = df.copy()
# df_nan.dropna(inplace=True)
# df_nan

In [None]:
# ohe = OneHotEncoder(drop='first')
# df_ohe = ohe.fit_transform(df_nan)

In [None]:
# new_df_ohe = pd.DataFrame(df_ohe.todense(), columns=ohe.get_feature_names())
# new_df_ohe.head()

## Create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [None]:
#Copied formula from 'Simple Linear Regression' notebook

sm.formula.ols(formula = "gross~cast_total_facebook_likes+budget+years_old+G+PG+R", data = df_with_dummies).fit().summary()

## Question 4b: Judging the Model

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 

### Bad summary

<img src="ols_summary.png" style="withd:300px;">

### Answer


* Poor-quality model; r^2 is .136


* Help explain: all p < .05


* Do not help explain: 'years_old' due to p-value > .05

## Question 5: Bayes

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

***

**P(wives watching)** = .6

**P(husband watching | wife watching)** = .4

**P(husband watching | wife not watching)** = .3

**P(h

In [None]:
# your answer here


## Question 6: Type 1 Error

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

Answer: A T1 error (aka *False Positive*) is when we reject the null hypothesis when we should fail to reject it. We indicate our accepted risk for T1s by stating our p-value, aka "significance level." A p-value of .05 says that we are willing to accept the risk of false positives 5% of the time

## Question 7: CI and T-Tests

How is the confidence interval for a sample related to a one sample t-test?

We would use a 1-sample T test to test if a specified population mean would fall within our confidence interval.