# 3B: Perform
In this exercise, you will demonstrate your learning of inferential statistics with confidence intervals, bootstrapping, and hypothesis testing. Problems may involve a combination of math and code. 

Recall that you can use LaTeX to nicely format your math inside Markdown cellsby enclosing equations in single dollar signs (e.g., $x^2+4=8$) for inline math or double dollar signs for centered equations like $$P(X > 5) = \frac{1}{6}.$$ For a reference if you are new to LaTeX, see the [overleaf documentation for mathematical expressions](https://www.overleaf.com/learn/latex/mathematical_expressions). 

Show your work and/or briefly explain your answers. In general, you will not receive full credit for numeric answers with no accompanying work or justification (math, code, explanation). For numeric answers, we will accept answers that are very slightly off due to rounding, z score of 2 vs. 1.96, etc. 

When you finish please go to Kernel --> Restart and Run All, and then double check that your notebook looks correct before submitting your .ipynb file (the notebook file) on gradescope.

In [2]:
# Run this code cell to import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

### Question 1

1. A website is trying to increase registration for first-time visitors, exposing a random subset of these visitors to a new site design. Of 752 randomly sampled visitors over a month who saw the new design, 64 registered. Construct a 95% confidence interval for the percentage of visitors who would register for the website under the new design using the normal distribution. Save your answer in a tuple `q1_1` with 2 `numpy.float64` items that `q1_1[0]` is the left bound and `q1_1[1]` is the right bound. Use percentages for both bounds, for example, save 50.0 in your answer for 50% of visitors.
2. A study examined the average pay for a random sample of men and women entering the workforce as doctors for 21 different positions. If each gender was equally paid, then we would expect about half of those positions to have men paid more than women and women would be paid more than men in the other half of positions. In the study, men were, on average, paid more in 17 of the 21 positions. Complete a hypothesis test (two-sided or one-sided, just be clear which you are reporting) to examine whether there is significant evidence (at the 0.05 level) of gender discrimination in pay in these positions. Report your p-value and interpret the result. Save your p-value in `q1_2` as a `numpy.float64` and interpret it in the "Answer 1" cell.

In [3]:
# Code for question 1
alpha1 = 0.95
mu = 64/752
sigma = np.sqrt(mu*(1-mu))
n = 752
c1 = list(stats.norm.interval(alpha= alpha1, loc = mu, scale = sigma/np.sqrt(n)))
c1[0] *= 100
c1[1] *= 100

q1_1 = tuple(c1)


#Two-sided
mu2 = 0.5
sigma2 = np.sqrt(mu2*(1-mu2))
value2 = (17/21-mu2)/sigma2 * np.sqrt(21)
c2 = 2*(1-stats.norm.cdf(value2))

q1_2 = c2

q1_2

0.004556349803185089

### Answer 1

Essentially we are comparing the two hypothesis. And the null hypothesis is that half of the positions for men and women entering the workforce as doctors would have men paid more and half would have women paid more.The alternative hypothesis was that it is inaccurate to say half of jobs related to sex are paid more. 
I used the function for a normal distribution to find the p-value of 0.00455. The significance level was 0.05 which is more than the p-value so we can reject the null hypothesis.

<!-- END QUESTION -->



## Movie Ratings Data
In the remainder of this assignment you will work with the movielens dataset of movie ratings that we have seen before. Below we import and preview the data. It consists of 2 tables: `users` has a row for every individual who has rated any movies, `movie-ratings` has a row for every rating of a particular movie by a particular user. This means users with multiple ratings are in the `movie_ratings` multiple times. The data is a random sample of all of the movie ratings made on the movielens service.

In [4]:
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,age,sex,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


In [5]:
movie_ratings = pd.read_csv("movies-all.csv")
movie_ratings.head()

Unnamed: 0,user_id,age,sex,occupation,movie_id,rating,movie_title
0,1,24,M,technician,61,4,Three Colors: White (1994)
1,13,47,M,educator,61,4,Three Colors: White (1994)
2,18,35,F,other,61,4,Three Colors: White (1994)
3,58,27,M,programmer,61,5,Three Colors: White (1994)
4,59,49,M,educator,61,4,Three Colors: White (1994)


### Question 2
1. Compute a 95% confidence interval for the mean `age` of users using the normal distribution. Save your answer in a tuple `q2_1` with 2 `numpy.float64` items that `q2_1[0]` is the left bound and `q2_1[1]` is the right bound.
2. Compute a 95% confidence interval for the mean `age` of users who have rated the movie `Casablanca (1942)` using the normal distribution. Save your answer in `q2_2`, similar requirements as above.
3. Casablanca is an old movie, one might suspect that it has been rated by older individuals on average than the entire dataset. Just looking at the confidence intervals you computed in steps 1 and 2, can you conclude that there is significant evidence for this belief? Why or why not? Put your answer in the "Answer 2" cell.

In [6]:
# Code for question 2
newmovie = movie_ratings.drop_duplicates(subset="user_id")

#Part 1
alpha = 0.95
mu = newmovie["age"].mean()
sigma = np.std(newmovie["age"])
n = len(newmovie["age"])
c1 = list(stats.norm.interval(alpha= alpha, loc = mu, scale = sigma/np.sqrt(n)))

q2_1 = tuple(c1)



#Part 2

movies = movie_ratings[movie_ratings["movie_title"] == "Casablanca (1942)"]
mu1 = movies["age"].mean()
sigma1 = np.std(movies["age"])
n1= len(movies["age"])
c2 = list(stats.norm.interval(alpha= alpha, loc = mu1, scale = sigma1/np.sqrt(n1)))

q2_2 = tuple(c2)

q2_2


#Part 3
tester = stats.ttest_ind_from_stats(mean1 = mu, std1=sigma, nobs1 = n,
                                   mean2 = mu1, std2=sigma1, nobs2= n1)

pvalue = list(tester)[1]/2
q2_3 = pvalue

q2_3

0.01660840728074272

### Answer 2
We need to look at the null hypothesis is the average age of all reviewers of all users and the average age of the reviewers of casablanca are equal. While the alterante hypothesis is that they are unequal and that the average age for casablanca reviewers is higher than the regular average. By looking at our values for Part 1 and Part 2, as well as the pvalue calculated from a test. It showed the p-value was about 0.017 which is less than 0.05 then we reject the null hypothesis and we can conclude that the casablanca viewers are older. 

<!-- END QUESTION -->



### Question 3
Only 18 users have rated the movie `Lost in Space (1998)`.
1. Use bootstrapping with 10,000 bootstrap resamples to compute a 95% confidence interval for the average `age` of users who have rated `Lost in Space (1998)`. Save your answer in a tuple `q3_1` with 2 `numpy.float64` items that `q3_1[0]` is the left bound and `q3_1[1]` is the right bound.
2. One of the advantages of bootstrapping is that we can easily compute confidence intervals for arbitrary measurements of distributions. Use bootstrapping with 10,000 bootstrap resamples to compute a 95% confidence interval for the **median** `rating` of `Lost in Space (1998)`. Note that numpy provides a vectorized function for [calculating the median](https://numpy.org/doc/stable/reference/generated/numpy.median.html) as well as the mean. Save your answer in a tuple `q3_2` with 2 `numpy.float64` items. Similar requirements as above

In [7]:
# Code for question 3

#Part 1

bootstraps = 100000
movies = movie_ratings[movie_ratings["movie_title"] == "Lost in Space (1998)"]["age"]
n = len(movies)
bootstrap_resample = np.random.choice(movies, size = (bootstraps, n), replace = True)
means = np.average(bootstrap_resample, axis =1)
bootstrapConfIntlower = np.percentile(means, 2.5)
bootstrapConfIntupper = np.percentile(means, 97.5)


q3_1 = (bootstrapConfIntlower,bootstrapConfIntupper)



#Part 2
movies = movie_ratings[movie_ratings["movie_title"] == "Lost in Space (1998)"]["rating"]
n = len(movies)
bootstrap_resample = np.random.choice(movies, size = (bootstraps, n), replace = True)
means = np.median(bootstrap_resample, axis =1)
bootstrapConfIntlower = np.percentile(means, 2.5)
bootstrapConfIntupper = np.percentile(means, 97.5)

q3_2 = (bootstrapConfIntlower,bootstrapConfIntupper)

q3_2

(2.5, 4.0)

### Question 4
The `Star Wars (1977)` film is quite popular, with a median rating of 5 out of 5. However, male users gave it a slightly higher average rating of about 4.4 whereas female users gave the same movie an average rating of about 4.2.

1. Consider the null hypothesis that the average rating of `Star Wars (1977)` is the same for `sex='F'` and `sex='M'` users. The alternative hypothesis is that the average ratings are not equal. Conduct a two-sided t test using [`stats.ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to evaluate this using the sample ratings data. Report your p-value and interpret it at a significance level of 0.05. Save your p-value in `q4_1` as a `numpy.float64` and interpret it in the "Answer 4" cell.

2. About 51% of female users rated `Star Wars (1977)` a `5` (the highest rating). Consider the null hypothesis that 51% of male users rate `Star Wars (1977)` a `5`. Conduct a two-sided hypothesis test  using [`stats.t.cdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) to evaluate this in light of the sample ratings data of male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of 0.05. Save your p-value in `q4_2` as a `numpy.float64` and interpret it in the "Answer 4" cell.

3. Consider the null hypothesis that female and male users are equally likely to rate `Star Wars (1977)` a `5`. Conduct a two-sided t test using [`stats.ttest_ind_from_stats`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind_from_stats.html) to evaluate this in light of the sample data of female and male users who rated `Star Wars (1977)`. Report your p-value and interpret it at a significance level of 0.05. Save your p-value in `q4_3` as a `numpy.float64` and interpret it in the "Answer 4" cell. You should observe a different p-value than in step 2 despite the hypotheses under consideration being ostensibly similar. Briefly explain why you observe this difference.

In [9]:
# Code for question 4

#Part 1
maleRatings = movie_ratings[(movie_ratings["movie_title"]=="Star Wars (1977)") & (movie_ratings["sex"] == "M")]["rating"]
femaleRatings = movie_ratings[(movie_ratings["movie_title"]=="Star Wars (1977)") & (movie_ratings["sex"] == "F")]["rating"]
maleAverage = maleRatings.mean()
femaleAverage = femaleRatings.mean()
result = stats.ttest_ind(maleRatings,femaleRatings)

pvalue1 = list(result)[1]
q4_1 = pvalue1


#Part 2
averagefivestar= len(maleRatings[maleRatings==5])/len(maleRatings)
length = len(maleRatings)
decimal = 0.51
stderr = np.sqrt(decimal*(1-decimal)/length)
scores = (averagefivestar-0.51)/stderr
value = (1- stats.t.cdf(scores,df=length-1))*2

q4_2 = value


#DIFFERENT .1717
#Part 3
sigmaMale = np.sqrt(averagefivestar*(1-averagefivestar))
mu_fem = len(femaleRatings[femaleRatings==5])/len(femaleRatings)

sigmafemale = np.sqrt(mu_fem*(1-mu_fem))
result1 = stats.ttest_ind_from_stats(mean1 = averagefivestar, std1= sigmaMale, nobs1 = len(maleRatings),
                                   mean2 = mu_fem, std2 = sigmafemale, nobs2= len(femaleRatings))

pvalue3 = list(result1)[1]
q4_3 = pvalue3

q4_3

0.1717837459146108

### Answer 4

Part 1 Interpretation:
We need to look at the null hypothesis which is the average rating of Star Wars for both sexes is the same. While the alterante hypothesis is that they are unequal and that one gender rated Star Wars higher on average. The test showed the p-value was about 0.06 which is higher than 0.05 then we accept the null hypothesis and that the rating for Star Wars of both sexes is the same.


Part 2 Interpretation:
We need to look at the null hypothesis which is 51% of males rated Star Wars a 5. While the alterante hypothesis is that this is not true. The test showed the p-value was about 0.02 which is lower than 0.05 then we reject the null hypothesis and that the percent of males that rated Star Wars a 5 is not equal to 51%.

Part 3 Interpretation:
We need to look at the null hypothesis which is both sexes are equally rate Star Wars a 5. While the alterante hypothesis is that their ratings of Star Wars is unequal based off of gender. The test showed the p-value was about 0.12 which is higher than 0.05 then we accept the null hypothesis and that the rating for Star Wars of a 5 is at an equal rate for both genders.
