# HW10 Problem 7 Implementation

You are not allowed to use a library function that directly calls for a statistical test. For this question, use $\alpha = 0.05$.


(a) Recall that a one-sample $z$-test is calculated by 
$$z = \frac{x - \mu}{\sigma / \sqrt{n}},$$
where $x$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the population standard deviation, and $n$ is the sample size.
    
Write a function $one\_sample\_z\_test$ that takes in the sample and population parameters (both as list of numbers) and produces the $z$-score above.


In [1]:
import numpy as np

def one_sample_z_test(sample, params):
    """
    Performs a one-sample z-test and returns the z-score.
    
    :param sample: List of sample values.
    :param params: List of parameters [mu, sigma]
    :return: z-score
    """
    sample_mean = np.mean(sample)
    sample_size = len(sample)
    z_score = (sample_mean - params[0]) / (params[1] / np.sqrt(sample_size))
    return z_score

# Example usage:
sample = [50, 52, 53, 54, 55]  # Replace with your sample data
mu = 50           # Replace with the population mean
sigma = 4             # Replace with the population standard deviation
params = [mu, sigma]

z_score = one_sample_z_test(sample, params)
print(f"The example z-score is: {z_score}")

The example z-score is: 1.5652475842498512


(b) Recall that a two-sample $z$-test is calculated by:
$$z = \frac{1}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \left\{(x_1 - x_2) - (\mu_1 - \mu_2)\right\},$$
where $x_1, \mu_1, \sigma_1$ and $n_1$ are the sample mean, population mean, population standard deviation, and sample size of the first group, and similarly for $x_2, \mu_2, \sigma_2, n_2$.
    
Write a function $two\_sample\_z\_test$ that takes in the two samples and the two populations parameters (all as list of numbers) and produces the $z$-score above.

In [2]:
def two_sample_z_test(sample1, sample2, population1_params, population2_params):
    """
    Performs a two-sample z-test and returns the z-score.
    
    :param sample1: List of sample values for the first group.
    :param sample2: List of sample values for the second group.
    :param population1_params: Tuple containing the mean and standard deviation of the first population.
    :param population2_params: Tuple containing the mean and standard deviation of the second population.
    :return: z-score
    """
    x1, mu1, sigma1, n1 = np.mean(sample1), *population1_params, len(sample1)
    x2, mu2, sigma2, n2 = np.mean(sample2), *population2_params, len(sample2)

    z_score = ((x1 - x2) - (mu1 - mu2)) / np.sqrt(sigma1**2/n1 + sigma2**2/n2)
    return z_score

# Example usage:
sample1 = [60, 62, 61, 63, 64]  # Replace with your first sample data
sample2 = [55, 57, 56, 58, 59]  # Replace with your second sample data

population1_params = (62, 5)  # Replace with the mean and std of the first population
population2_params = (58, 4)  # Replace with the mean and std of the second population

z_score = two_sample_z_test(sample1, sample2, population1_params, population2_params)
print(f"The example z-score is: {z_score}")

The example z-score is: 0.34921514788478913


(c) Using the functions from above, and the IMDB dataset at the following [link](https://stats-lab-data.surge.sh/IMDB-Movie-Data.csv), test the following hypothesis (as a function of $\mu$):

- $H_{Null}$: The population mean of the Metascore ratings of movies released in the year 2016 is $\mu$.
- $H_{Alternate}$: The population mean of the Metascore ratings of movies released in the year 2016 is different from $\mu$. 

Keep in mind the assumptions of the z-test, and make sure to justify why you can use it in regards to this data. Further, for what values of $\mu$ would the null hypothesis be rejected? 


In [5]:
import pandas as pd

alpha = 0.05

# Load the dataset
df = pd.read_csv('https://stats-lab-data.surge.sh/IMDB-Movie-Data.csv')
display(df)

movies_2016 = df[df['Year'] == 2016]
print(movies_2016.shape)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


(297, 12)


The sample dataset of the Metascore ratings of movies released in the year 2016 is large enough for us to conduct z_test.

In [13]:
# Population parameters (you need to provide these)
population_std = df['Metascore'].std()

# Perform the one-sample z-test
# Testing the hypothesis for a range of mu values
mu_values = range(int(df['Metascore'].min()), int(df['Metascore'].max()) + 1) 
alpha = 0.05
z_critical = 1.96  # Approximate z-value for a two-tailed test with alpha = 0.05

rejected_mu_values = []
mu_not_rejected = [] 
for mu in mu_values:
    z_score = one_sample_z_test(movies_2016['Metascore'], [mu, population_std])
    if abs(z_score) > z_critical:
        rejected_mu_values.append(mu)
    else:
        mu_not_rejected.append(mu)

print(f"Values of mu for which the null hypothesis is not rejected:\n {mu_not_rejected}")
print(f"Values of mu for which the null hypothesis is rejected: \n {rejected_mu_values}")

Values of mu for which the null hypothesis is not rejected:
 [57, 58, 59, 60]
Values of mu for which the null hypothesis is rejected: 
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


(d) Additionally, test the following hypothesis:
- $H_{Null}$ : There is no difference in the population mean of the Metascore ratings of movies released in the year 2015 and 2007.
- $H_{Alternate}$: There is a significant difference in the population mean of the Metascore ratings of movies released in the year 2015 and 2007.

Is it possible to test this hypothesis with a z-test? Can you test the following hypothesis with a two sample z-test? Is there a better statistical test that we can use that doesn't depend on the distribution of the data?

**Answer:** As the sample size are large enough, it's possible to test the hypothesis with a z-test. But we can also test the statistics using two-sample t-test given the population std is unknown. Below we will only conduct the two sample z-test. We cannot reject the null hypothesis by the z-test.

In [15]:
movies_2007 = df[df['Year'] == 2007]
print(movies_2007.shape)

movies_2015 = df[df['Year'] == 2015]
print(movies_2015.shape)

(53, 12)
(127, 12)


In [18]:
metascores_2015 = df[df['Year'] == 2015]['Metascore'].dropna()
metascores_2007 = df[df['Year'] == 2007]['Metascore'].dropna()

# suppose mu1=mu2 = population mean
population_mean = df['Metascore'].mean()
mu1, mu2 = population_mean, population_mean
std_2015, std_2007 = metascores_2015.std(), metascores_2007.std()

z_score = two_sample_z_test(metascores_2015, metascores_2007, [mu1, std_2015], [mu2, std_2007])
print(f'The test z-score is {z_score}, whether to reject H0: {z_score > z_critical}')

The test z-score is -2.49389679875067, whether to reject H0: False
