# Phase 2 Review

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.formula.api import ols
import scipy.stats as stats
from statsmodels.stats.proportion import proportions_ztest

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [2]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

**Talk through your answer here**


**My Answer: First we need to clean our data by year, and rating. Then we are going to find the degree of freedom, followed by the mean of the gross and lastly the standard deviation.**


**Class answer: Subset df for content = R, drop NA, title_year > 2000**

In [3]:
#Class answer. This way you're not creating a bunch of new data frames, which can use up your ram with larger data sets.

#newer_df = df[(df['title_year'] > 2000) & (df['content_rating'] == 'R')].dropna(subset=['gross'])


In [4]:
#se = sd/n**.5

In [5]:
#mean - 1.96 * (se), mean + 1.96* se

In [6]:
# Filter columns for desired topics
df1 = df.loc[:, ['movie_title', 'content_rating', 'title_year', 'gross']]
df1.head()

Unnamed: 0,movie_title,content_rating,title_year,gross
0,Avatar,PG-13,2009.0,760505847.0
1,Pirates of the Caribbean: At World's End,PG-13,2007.0,309404152.0
2,Spectre,PG-13,2015.0,200074175.0
3,The Dark Knight Rises,PG-13,2012.0,448130642.0
4,Star Wars: Episode VII - The Force Awakens ...,,,


In [7]:
# Remove null values
df1.dropna(inplace=True)
df1.reset_index(drop=True, inplace= True)
df1

Unnamed: 0,movie_title,content_rating,title_year,gross
0,Avatar,PG-13,2009.0,760505847.0
1,Pirates of the Caribbean: At World's End,PG-13,2007.0,309404152.0
2,Spectre,PG-13,2015.0,200074175.0
3,The Dark Knight Rises,PG-13,2012.0,448130642.0
4,John Carter,PG-13,2012.0,73058679.0
...,...,...,...,...
4087,Cavite,Not Rated,2005.0,70071.0
4088,El Mariachi,R,1992.0,2040920.0
4089,Newlyweds,Not Rated,2011.0,4584.0
4090,Shanghai Calling,PG-13,2012.0,10443.0


In [8]:
# Filter for "R" rating
df1_rating = df1[df1['content_rating'] == "R"].reset_index(drop=True)

In [9]:
# Filter for release year > 2000
df1_filter = df1_rating[df1_rating['title_year'] > 2000].reset_index(drop=True)

In [10]:
df1_filter

Unnamed: 0,movie_title,content_rating,title_year,gross
0,Terminator 3: Rise of the Machines,R,2003.0,150350192.0
1,Alexander,R,2004.0,34293771.0
2,The Matrix Revolutions,R,2003.0,139259759.0
3,The Matrix Reloaded,R,2003.0,281492479.0
4,Mad Max: Fury Road,R,2015.0,153629485.0
...,...,...,...,...
1203,The Legend of God's Gun,R,2007.0,243768.0
1204,Down Terrace,R,2009.0,9609.0
1205,Sabotage,R,2014.0,10499968.0
1206,The Puffy Chair,R,2005.0,192467.0


In [11]:
gross = df1_filter.loc[:, "gross"]
gross

0       150350192.0
1        34293771.0
2       139259759.0
3       281492479.0
4       153629485.0
           ...     
1203       243768.0
1204         9609.0
1205     10499968.0
1206       192467.0
1207       136007.0
Name: gross, Length: 1208, dtype: float64

In [12]:
df_mean = gross.mean()
df_mean

27648848.437913906

In [13]:
dfree = len(df1_filter)-1
dfree

1207

In [14]:
df_std = gross.std()
df_std

39088854.942774445

In [15]:
stats.t.interval(alpha=0.05, df = dfree, loc = df_mean, scale = df_std) 

(25197202.549191598, 30100494.326636214)

**We are 95% confident that a post-2000 R-rated movie's box office gross would fall between 25197202.55 and 30100494.33 dolalrs.**

## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
- Do foreign films perform differently at the box office than non-foreign films?
- Of all movies created are 40% rated R?
- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
- Is there a relationship between the content rating of a film and its budget? 

**our answers here**


**1. Pearson Correlation (not a statistical test)/Linear regression**

H0: There is no relationship between the number of Facebook likes for a cast and the box office gross of the movie. h0: Beta = 0 


Ha: There is a relationship between the number of Facebook likes for a cast and the box office gross of the movie. Ha: Beta!=0


**2. 2 sample T test of independence**


H0: The box office gross of foreign films is equal to domestic films.

Ha: The box office gross of foreign films is not equal to domestic films.

**3. Z test for porpotion**


H0: The proportion of rated R movies are equal to .4.


Ha: The proportion of rated R movies are not equal to .4.


**4. Chi-Squared Test for independence (2 categorical variables)**


H0: The distributions of ratings for different languages are equal. There is no relationship between the language of a film and the content rating.


Ha: The distributions of ratings for different languages are not equal.There is a relationship between the language of a film and the content rating.

**5. ANOVA (1 categorical variable, 1 continuous variable)**


H0: There is no relationship between the content rating of a film and its budget.


Ha: The is a relationshiop between the content rating of a film and its budget.

## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [16]:
#class answer
#domestic = df[df['country'] == 'USA']['gross']
#foreign = domestic = df[df['country'] != "USA"]['gross']

In [17]:
#stats.ttest_ind(domestic, foreign)

In [18]:
df2 = df.loc[:, ['movie_title', 'country', 'gross']]
df2.head()

Unnamed: 0,movie_title,country,gross
0,Avatar,USA,760505847.0
1,Pirates of the Caribbean: At World's End,USA,309404152.0
2,Spectre,UK,200074175.0
3,The Dark Knight Rises,USA,448130642.0
4,Star Wars: Episode VII - The Force Awakens ...,,


In [19]:
df2.dropna(inplace=True)
df2.reset_index(drop=True, inplace= True)
df2

Unnamed: 0,movie_title,country,gross
0,Avatar,USA,760505847.0
1,Pirates of the Caribbean: At World's End,USA,309404152.0
2,Spectre,UK,200074175.0
3,The Dark Knight Rises,USA,448130642.0
4,John Carter,USA,73058679.0
...,...,...,...
4154,Cavite,Philippines,70071.0
4155,El Mariachi,USA,2040920.0
4156,Newlyweds,USA,4584.0
4157,Shanghai Calling,USA,10443.0


In [20]:
df2_domestic = df2[df2['country'] == "USA"].reset_index(drop=True)
df2_domestic

Unnamed: 0,movie_title,country,gross
0,Avatar,USA,760505847.0
1,Pirates of the Caribbean: At World's End,USA,309404152.0
2,The Dark Knight Rises,USA,448130642.0
3,John Carter,USA,73058679.0
4,Spider-Man 3,USA,336530303.0
...,...,...,...
3230,Primer,USA,424760.0
3231,El Mariachi,USA,2040920.0
3232,Newlyweds,USA,4584.0
3233,Shanghai Calling,USA,10443.0


In [21]:
df2_domestic_len = len(df2_domestic)
df2_domestic_len

3235

In [22]:
df2_foreign = df2[df2['country'] != "USA"].reset_index(drop=True)
df2_foreign

Unnamed: 0,movie_title,country,gross
0,Spectre,UK,200074175.0
1,Harry Potter and the Half-Blood Prince,UK,301956980.0
2,Quantum of Solace,UK,168368427.0
3,The Hobbit: The Battle of the Five Armies,New Zealand,255108370.0
4,King Kong,New Zealand,218051260.0
...,...,...,...
919,In the Company of Men,Canada,2856622.0
920,Clean,France,136007.0
921,The Circle,Iran,673780.0
922,The Cure,Japan,94596.0


In [23]:
df2_foreign_len = len(df2_foreign)
df2_foreign_len

924

In [24]:
df2_domestic_gross = df2[df2['country'] == "USA"]['gross']
df2_domestic_gross

0       760505847.0
1       309404152.0
3       448130642.0
4        73058679.0
5       336530303.0
           ...     
4153       424760.0
4155      2040920.0
4156         4584.0
4157        10443.0
4158        85222.0
Name: gross, Length: 3235, dtype: float64

In [25]:
df2_domestic_mean = df2_domestic_gross.mean()
df2_domestic_mean

55214607.22874807

In [26]:
df2_foreign_gross = df2[df2['country'] != "USA"]['gross']
df2_foreign_gross

2       200074175.0
8       301956980.0
11      168368427.0
19      255108370.0
24      218051260.0
           ...     
4144      2856622.0
4150       136007.0
4151       673780.0
4152        94596.0
4154        70071.0
Name: gross, Length: 924, dtype: float64

In [27]:
df2_foreign_mean = df2_foreign_gross.mean()
df2_foreign_mean

24849407.48809524

In [28]:
# your answer here
#proportions_ztest([df2_foreign_len, df2_domestic_len], [df2_foreign_mean, df2_domestic_mean], value = 0)
stats.ttest_ind(df2_foreign_gross, df2_domestic_gross)


Ttest_indResult(statistic=-12.098302287742106, pvalue=3.863109466861356e-33)

**The pvalue is less than .05 so we reject the null hypothesis.**

## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Would you use all of these features in the model? Identify which features you might drop and why.


*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **cast_total_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


# Class answer

**Keep: budget, years_old, avg_user_rating, cast_total_facebook_likes, language**


**Drop: avg_user_rating (this is colinear with years_old, so we don't drop years_old), genre (too complicated and you can't go by alphabetical order), actor_1_facebook_likes (because it is highly correlated with cast_total_facebook_likes; you can drop either)**

# Your answer here
**Keep: budget, genre, actor_1_facbook_likes, cast_total_facebook_likes, language, title_year**


**Drop: avg_user_rating**

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              4939 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5030 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5036 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64

In [36]:
df3 = df.loc[:, ['gross','budget', 'actor_1_facebook_likes', 'cast_total_facebook_likes', 'language', 'title_year', 'content_rating']]
df3.head()

Unnamed: 0,gross,budget,actor_1_facebook_likes,cast_total_facebook_likes,language,title_year,content_rating
0,760505847.0,237000000.0,1000.0,4834,English,2009.0,PG-13
1,309404152.0,300000000.0,40000.0,48350,English,2007.0,PG-13
2,200074175.0,245000000.0,11000.0,11700,English,2015.0,PG-13
3,448130642.0,250000000.0,27000.0,106759,English,2012.0,PG-13
4,,,131.0,143,,,


## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    
Once you have those variables, create a summary output for the following OLS model:

`gross+cast_total_facebook_likes+budget+years_old+G+PG+R`

In [31]:
from statsmodels.formula.api import ols #writing out formula

#from statsmodels.api import OLS #using x,y

In [32]:
# class answer here
df3['years_old'] = 2020 - df3.title_year
df3 = pd.get_dummies(df3, columns=['content_rating']).drop(columns='content_rating_PG-13') 
#always drop one variable because you can infer the last one, and this can cause colinearity

In [34]:
df3.columns

Index(['budget', 'actor_1_facebook_likes', 'cast_total_facebook_likes',
       'language', 'title_year', 'years_old', 'content_rating_Approved',
       'content_rating_G', 'content_rating_GP', 'content_rating_M',
       'content_rating_NC-17', 'content_rating_Not Rated', 'content_rating_PG',
       'content_rating_Passed', 'content_rating_R', 'content_rating_TV-14',
       'content_rating_TV-G', 'content_rating_TV-MA', 'content_rating_TV-PG',
       'content_rating_TV-Y', 'content_rating_TV-Y7', 'content_rating_Unrated',
       'content_rating_X'],
      dtype='object')

In [37]:
lr_model = ols(formula='gross+cast_total_facebook_likes+budget+years_old+G+PG+R')

TypeError: from_formula() missing 1 required positional argument: 'data'

## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


<img src="ols_summary.png" style="withd:300px;">

In [None]:
# Class answer
# R squared is very low, so this is a terrible model.
# There's not a statistically significant rating between G rating and gross (based on p value)
# G, PG, and R are dummy variables which means we are comparing them to the variable we left out--PG 13

In [None]:
# your answer here


## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [None]:
# your answer here
'''
P(A) = Probability wife is watching tv
P(B) = Probability husband is watching tv
P(A|B) = Probbility wife is watching tv given husband is
P(B|A) = Probability husband is watching tv given wife is
'''

## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [None]:
# your answer here


## Question 7

How is the confidence interval for a sample related to a one sample t-test?

In [None]:
#your answer here 