# Phase 2 Review

In [76]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from statsmodels.formula.api import ols
from scipy import stats

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [3]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

In [6]:
# Steps:
# 1. Check the count of null values in the data frame
# 2. Clean data by dropping NaN values, and filter the data to show only R rated movies made after the year 2000. 
# 3. Calculate the mean, std, and z/t critical value for the subset data of "gross" in step one to calculate propective earnings.
# 4. Plug the calculated data into an equation/function to get the 95% confidence value.

In [9]:
# do it in code here
df.isna().sum()

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        884
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               303
budget                       492
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64

In [42]:
# 95% confidence interval
df.dropna(subset=['gross'], inplace=True)
df_r2k = df[(df['title_year'] > 2000) & (df['content_rating'] == 'R')]
r2k_mean = df_r2k['gross'].mean()
r2k_std = df_r2k['gross'].std()
r2k_n = df_r2k['gross'].count()
r2k_ci = stats.norm.interval(.95, loc=r2k_mean, scale=r2k_std/np.sqrt(r2k_n))

print(f'mean = {r2k_mean}')
print(f'std = {r2k_std}')
print(f'n = {r2k_n}')
print(f'The 95% confidence interval is {r2k_ci}')

mean = 27648848.437913906
std = 39088854.942774445
n = 1208
The 95% confidence interval is (25444564.31555217, 29853132.56027564)


In [33]:
r2k_std/np.sqrt(r2k_n)

1124655.4221143075

## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
- Do foreign films perform differently at the box office than non-foreign films?
- Of all movies created are 40% rated R?
- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
- Is there a relationship between the content rating of a film and its budget? 

#### 1. Simple Linear Regression
* H0: The number of FB likes for a cast and the box office gross of a movie is related
* Ha: The number of FB likes for a cast and the box office gross of a movie is not related    

#### 2. Two sided t-test
* H0: Foreign films perform the same at the box office than non-foreign films
* Ha: There is a significant difference in performance at the box office between foreign and non-foreign films

#### 3. One sided z test
* H0: 40% of all movies created are rated R (P = .40)
* Ha: 40% of all movies created are not rated R (P != .40)

#### 4. Chi-Squared test
* H0: Distributions of content ratings are correlated to the language of the film
* Ha: Distributions of content ratings are not equal to the language of the film

#### 5. ANOVA
* H0: The content rating of a film is directly correlated to its budget (equal)
* Ha: There is no relationship between content rating of a film and budget (not equal)

## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [4]:
# your answer here
df.head(3)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


In [38]:
# Assuming an alpha of .05
df[df['country'] == 'USA'].dropna(subset=['country'])
df[df['country'] != 'USA'].dropna(subset=['country'])

domestic = df[df['country'] == 'USA']['gross']
foreign = df[df['country'] != 'USA']['gross']
domestic_mean = domestic.mean()
foreign_mean = foreign.mean()
domestic_std = domestic.std()
foreign_std = foreign.std()
domestic_n = domestic.count()
foreign_n = foreign.count()

print(stats.ttest_ind(foreign, domestic, equal_var=True, nan_policy='omit'))
print(stats.ttest_ind_from_stats(domestic_mean, domestic_std, domestic_n, foreign_mean, foreign_std, foreign_n, equal_var=True))
print("As the P_val is less than the alpha of .05, we reject the null hypothesis that foreign and domestic films perform the same at the box office")

Ttest_indResult(statistic=-12.098302287742106, pvalue=3.8631094668570096e-33)
Ttest_indResult(statistic=12.098302287742106, pvalue=3.863109466861356e-33)
As the P_val is less than the alpha of .05, we reject the null hypothesis that foreign and domestic films perform the same at the box office


## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Create a new frame containing only those variables.

Would you use all of these features in the model?

Identify which features you might drop and why.

*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **total_cast_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


In [46]:
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,1873,Polly Walker,1.0,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


In [77]:
df.loc[0, 'genres'].split('|')

['Action', 'Adventure', 'Fantasy', 'Sci-Fi']

In [117]:
# your answer here
model_prep = df[['gross', 'budget', 'title_year', 'genres', 'imdb_score', 'actor_1_facebook_likes', 
                  'cast_total_facebook_likes', 'content_rating', 'language']]
model_prep.dropna(subset=['title_year'], inplace=True)
model_prep['years_old'] = 2021 - model_prep['title_year'].astype(int)
model_prep.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_prep.dropna(subset=['title_year'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_prep['years_old'] = 2021 - model_prep['title_year'].astype(int)


Unnamed: 0,gross,budget,title_year,genres,imdb_score,actor_1_facebook_likes,cast_total_facebook_likes,content_rating,language,years_old
0,760505847.0,237000000.0,2009.0,Action|Adventure|Fantasy|Sci-Fi,7.9,1000.0,4834,PG-13,English,12
1,309404152.0,300000000.0,2007.0,Action|Adventure|Fantasy,7.1,40000.0,48350,PG-13,English,14
2,200074175.0,245000000.0,2015.0,Action|Adventure|Thriller,6.8,11000.0,11700,PG-13,English,6
3,448130642.0,250000000.0,2012.0,Action|Thriller,8.5,27000.0,106759,PG-13,English,9
5,73058679.0,263700000.0,2012.0,Action|Adventure|Sci-Fi,6.6,640.0,1873,PG-13,English,9


In [118]:
model_prep.corr()

Unnamed: 0,gross,budget,title_year,imdb_score,actor_1_facebook_likes,cast_total_facebook_likes,years_old
gross,1.0,0.102179,0.030886,0.199432,0.15427,0.247184,-0.030886
budget,0.102179,1.0,0.04499,0.029135,0.017544,0.030189,-0.04499
title_year,0.030886,0.04499,1.0,-0.131504,0.085532,0.112207,-1.0
imdb_score,0.199432,0.029135,-0.131504,1.0,0.088893,0.099612,0.131504
actor_1_facebook_likes,0.15427,0.017544,0.085532,0.088893,1.0,0.945742,-0.085532
cast_total_facebook_likes,0.247184,0.030189,0.112207,0.099612,0.945742,1.0,-0.112207
years_old,-0.030886,-0.04499,-1.0,0.131504,-0.085532,-0.112207,1.0


####  Answers for model
* 1. Would not use all of the features in the model
* 2. Would drop years old and total_cast_facebook_likes as years_old is not necessary to compare against titles slated for next year, and total_cast_facebook_likes is redundant to actor_1_facebook_likes.

## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    
Once you have those variables, create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [119]:
model_prep[(model_prep['content_rating'] == 'PG-13')].count()

gross                        1400
budget                       1331
title_year                   1400
genres                       1400
imdb_score                   1400
actor_1_facebook_likes       1400
cast_total_facebook_likes    1400
content_rating               1400
language                     1400
years_old                    1400
dtype: int64

In [120]:
model_prep.head()

Unnamed: 0,gross,budget,title_year,genres,imdb_score,actor_1_facebook_likes,cast_total_facebook_likes,content_rating,language,years_old
0,760505847.0,237000000.0,2009.0,Action|Adventure|Fantasy|Sci-Fi,7.9,1000.0,4834,PG-13,English,12
1,309404152.0,300000000.0,2007.0,Action|Adventure|Fantasy,7.1,40000.0,48350,PG-13,English,14
2,200074175.0,245000000.0,2015.0,Action|Adventure|Thriller,6.8,11000.0,11700,PG-13,English,6
3,448130642.0,250000000.0,2012.0,Action|Thriller,8.5,27000.0,106759,PG-13,English,9
5,73058679.0,263700000.0,2012.0,Action|Adventure|Sci-Fi,6.6,640.0,1873,PG-13,English,9


In [121]:
# your answer here
summary_ols = pd.get_dummies(model_prep, columns=['content_rating']).drop(columns='content_rating_PG-13')

In [122]:
summary_ols.head()

Unnamed: 0,gross,budget,title_year,genres,imdb_score,actor_1_facebook_likes,cast_total_facebook_likes,language,years_old,content_rating_Approved,content_rating_G,content_rating_GP,content_rating_M,content_rating_NC-17,content_rating_Not Rated,content_rating_PG,content_rating_Passed,content_rating_R,content_rating_Unrated,content_rating_X
0,760505847.0,237000000.0,2009.0,Action|Adventure|Fantasy|Sci-Fi,7.9,1000.0,4834,English,12,0,0,0,0,0,0,0,0,0,0,0
1,309404152.0,300000000.0,2007.0,Action|Adventure|Fantasy,7.1,40000.0,48350,English,14,0,0,0,0,0,0,0,0,0,0,0
2,200074175.0,245000000.0,2015.0,Action|Adventure|Thriller,6.8,11000.0,11700,English,6,0,0,0,0,0,0,0,0,0,0,0
3,448130642.0,250000000.0,2012.0,Action|Thriller,8.5,27000.0,106759,English,9,0,0,0,0,0,0,0,0,0,0,0
5,73058679.0,263700000.0,2012.0,Action|Adventure|Sci-Fi,6.6,640.0,1873,English,9,0,0,0,0,0,0,0,0,0,0,0


In [123]:
film_lr = ols(formula='gross~cast_total_facebook_likes+budget+years_old+content_rating_G+content_rating_PG+content_rating_R', data=summary_ols).fit()
film_lr.summary()

0,1,2,3
Dep. Variable:,gross,R-squared:,0.134
Model:,OLS,Adj. R-squared:,0.132
Method:,Least Squares,F-statistic:,99.95
Date:,"Wed, 03 Mar 2021",Prob (F-statistic):,2.71e-117
Time:,16:13:01,Log-Likelihood:,-75517.0
No. Observations:,3891,AIC:,151000.0
Df Residuals:,3884,BIC:,151100.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.24e+07,2.62e+06,20.018,0.000,4.73e+07,5.75e+07
cast_total_facebook_likes,864.9426,55.572,15.564,0.000,755.990,973.895
budget,0.0270,0.005,5.743,0.000,0.018,0.036
years_old,-1.444e+05,1.06e+05,-1.365,0.172,-3.52e+05,6.3e+04
content_rating_G,2.888e+07,7.06e+06,4.089,0.000,1.5e+07,4.27e+07
content_rating_PG,1.573e+07,3.2e+06,4.913,0.000,9.45e+06,2.2e+07
content_rating_R,-2.851e+07,2.3e+06,-12.378,0.000,-3.3e+07,-2.4e+07

0,1,2,3
Omnibus:,2340.243,Durbin-Watson:,1.061
Prob(Omnibus):,0.0,Jarque-Bera (JB):,39932.29
Skew:,2.544,Prob(JB):,0.0
Kurtosis:,17.846,Cond. No.,1550000000.0


## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


#### Question 4b answers
* The r^2 value being so low indicates that the model does not express or explain well (only 13%) of the variation of the dependent variable(target) around its mean. 
* All independent variables other than years_old help to show the variance in the target variable according to p_value. 
* Due to its higher than normal (in relation to other independent variables) years_old does not truly help to explain the variance in the target variable.

<img src="ols_summary.png" style="withd:300px;">

In [None]:
# your answer here


## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [None]:
# your answer here


## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [None]:
# your answer here


## Question 7

How is the confidence interval for a sample related to a one sample t-test?

In [None]:
#your answer here 