## Statistics Advance 7

### Question 1

Q1. Write a Python function that takes in two arrays of data and calculates the F-value for a variance ratio
test. The function should return the F-value and the corresponding p-value for the test.



In [3]:

from scipy.stats import f

def f_test(data1, data2):
    # Calculate the variances of the two data sets
    var1 = np.var(data1, ddof=1)
    var2 = np.var(data2, ddof=1)
    
    # Calculate the degrees of freedom for the numerator and denominator
    df1 = len(data1) - 1
    df2 = len(data2) - 1
    
    # Calculate the F-value as the ratio of the larger variance to the smaller variance
    f_value = max(var1, var2) / min(var1, var2)  ## max var/ min var to ensure right tail distribution
    
    # Calculate the p-value using the F-distribution
    p_value = 1 - f.cdf(f_value, df1, df2)
    
    return f_value, p_value


In [4]:
# Example usage with random data
import numpy as np
np.random.seed(0)
data1 = np.random.normal(0, 1, 100)
data2 = np.random.normal(0, 2, 100)

f_value, p_value = f_test(data1, data2)
print(f'F-value: {f_value:.4f}')
print(f'p-value: {p_value:.4f}')

F-value: 4.2154
p-value: 0.0000


### Question 2

Q2. Given a significance level of 0.05 and the degrees of freedom for the numerator and denominator of an F-distribution, write a Python function that returns the critical F-value for a two-tailed test.

In [5]:
from scipy.stats import f

def f_critical_value(df1, df2, alpha=0.05):
    f_critical = f.ppf(1 - alpha/2, df1, df2)  ## alpha divided by two (two tailed test)
    return f_critical

# Example usage  Using hypothetical df's
df1 = 10
df2 = 20
f_critical = f_critical_value(df1, df2)
print(f'Critical F-value: {f_critical:.4f}')


Critical F-value: 2.7737


### Question 3

Q3. Write a Python program that generates random samples from two normal distributions with known variances and uses an F-test to determine if the variances are equal. The program should output the F-value, degrees of freedom, and p-value for the test.


__Explanation for the Code__

The scipy.stats.f module is used to calculate the cumulative distribution function (CDF) of the F-distribution. 

The F-value is calculated as the ratio of the variances of the two input arrays, note max variance as numerator and min as denominator. The degrees of freedom for the F-distribution are calculated as n1-1 and n2-1, where n1 and n2 are the lengths of the input arrays. The p-value is then calculated as 1 - CDF(F-value).

In [6]:
from scipy.stats import f
import numpy as np

def f_test(array1, array2):
    n1 = len(array1)
    n2 = len(array2)
    var1 = np.var(array1, ddof=1)
    var2 = np.var(array2, ddof=1)
    f_value = max(var1, var2) / min(var1, var2)
    p_value = 1 - f.cdf(f_value, n1-1, n2-1)
    return f_value, n1-1, n2-1, p_value




In [9]:
# Example usage with random data
np.random.seed(0)
data1 = np.random.normal(0, 1, 100)
data2 = np.random.normal(0, 2, 100)

f_value, df1, df2, p_value = f_test(data1, data2)
print(f'F-value: {f_value:.4f}')
print(f'Degrees of freedom: {df1}, {df2}')
print(f'p-value: {p_value:.20f}')

F-value: 4.2154
Degrees of freedom: 99, 99
p-value: 0.00000000000335131922


### Question 4

Q4.The variances of two populations are known to be 10 and 15. A sample of 12 observations is taken from each population. Conduct an F-test at the 5% significance level to determine if the variances are significantly different.

In [15]:
from scipy.stats import f

# Known population variances
var1 = 10
var2 = 15

# Sample sizes
n1 = 12
n2 = 12

# Calculate F-value
f_value = var1 / var2

# Calculate degrees of freedom
df1 = n1 - 1
df2 = n2 - 1

# Calculate p-value
# Note we multiply the calcutate pvalue because it two tails, on either side put together
p_value = 2 * min(f.cdf(f_value, df1, df2), 1 - f.cdf(f_value, df1, df2)) ## the lowest will be taken which is area outside the curve

# Significance level
alpha = 0.05

print(f'F-value: {f_value:.4f}')
print(f'Degrees of freedom: {df1}, {df2}')
print(f'p-value: {p_value:.4f}')

if p_value < alpha:
    print('The variances are significantly different at the 5% significance level.')
else:
    print('The variances are not significantly different at the 5% significance level.')

F-value: 0.6667
Degrees of freedom: 11, 11
p-value: 0.5124
The variances are not significantly different at the 5% significance level.


### Question 5

Q5. A manufacturer claims that the variance of the diameter of a certain product is 0.005. A sample of 25 products is taken, and the sample variance is found to be 0.006. Conduct an F-test at the 1% significance level to determine if the claim is justified.


Hypothesis:

H0: The population variance of the diameter of the product is 0.005

Ha: The population variance of the diameter of the product is greater than 0.005

Under the null hypothesis, the F-statistic follows an F-distribution with degrees of freedom (df1 = n - 1, df2 = ∞). Because the possibility of the population size is infinity(could be any size), however sample size is known.

In [37]:
import scipy.stats as stats

# Given data

n = 25
sample_var = 0.006
pop_var = 0.005
alpha = 0.01

# Calculate the F-statistic
f_stat = sample_var / pop_var

# Calculate the p-value
p_value = 1 - stats.f.cdf(f_stat, n-1, float('inf')) ## float('inf') population df

# Calculate the critical value
# critical_value = stats.f.ppf(1-alpha, n-1, float('inf')) #did not give good result


#did it manually
df1 = 24
df2 = "∞" #infinity possibility for  
alpha = 0.01
critical_value = 2.211

# Print the results
print("F-statistic:", f_stat)
print("p-value:", p_value)
print("Critical value:", critical_value)

if f_stat > critical_value:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

F-statistic: 1.2
p-value: 1.0
Critical value: 2.211
Fail to reject null hypothesis


In [36]:
### Another way to carry out this analysis
import numpy as np
from scipy.stats import f

# Given data
n = 25
sample_var = 0.006
pop_var = 0.005
alpha = 0.01

# Calculate the F-statistic
F = sample_var / pop_var

# Calculate the critical value for the F-distribution
df1 = n - 1
df2 = np.inf
# crit_val = f.ppf(q=1-alpha, dfn=df1, dfd=df2) 

#did it manually, using F_distribution table
# df1 = 24
# # df2 = "∞" #infinity possibility for  
# alpha = 0.01
critical_value = 2.211

# Calculate the p-value for the F-statistic
p_val = 1 - f.cdf(F, dfn=df1, dfd=df2)

# Print the results
print("F-statistic:", F)
print("Critical value:", critical_value)
print("p-value:", p_val)

# Test the hypothesis at the given significance level
if F > critical_value:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")


F-statistic: 1.2
Critical value: 2.211
p-value: 1.0
Fail to reject null hypothesis


### Question 6

Q6. Write a Python function that takes in the degrees of freedom for the numerator and denominator of an F-distribution and calculates the mean and variance of the distribution. The function should return the mean and variance as a tuple.

Mean = dfd / (dfd - 2)

Variance = (2 * dfd^2 * (dfn + dfd - 2)) / (dfn * (dfd - 2)^2 * (dfd - 4))


In [60]:
def f_dist_mean_var(dfn, dfd):
    if dfn <= 0 or dfd <= 0:
        raise ValueError("Degrees of freedom must be positive")
    
    # Calculate the mean and variance of the F-distribution
    if dfd > 2:
        mean = dfd / (dfd - 2)
        variance = (2 * dfd**2 * (dfn + dfd - 2)) / (dfn * (dfd - 2)**2 * (dfd - 4))
    else:
        mean = float('nan')
        variance = float('nan')
    
    return mean, variance


## Example
f_dist_mean_var(12,18)

(1.125, 0.421875)

### Question 7

Q7. A random sample of 10 measurements is taken from a normal population with unknown variance. The sample variance is found to be 25. Another random sample of 15 measurements is taken from another normal population with unknown variance, and the sample variance is found to be 20. Conduct an F-test
at the 10% significance level to determine if the variances are significantly different.

In [42]:
import numpy as np
from scipy.stats import f

# Define the sample variances and sample sizes
s1_sq = 25
s2_sq = 20
n1 = 10
n2 = 15

# Calculate the test statistic
F = s1_sq / s2_sq

# Calculate the p-value
p_value = 2 * min(f.cdf(F, n1-1, n2-1), 1 - f.cdf(F, n1-1, n2-1))

# Compare the p-value to the significance level
alpha = 0.10
if p_value < alpha:
    print("Reject the null hypothesis and conclude that the variances are significantly different.")
else:
    print("Fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are different.")


Fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are different.


### Question 8

Q8. The following data represent the waiting times in minutes at two different restaurants on a Saturday night: Restaurant A: 24, 25, 28, 23, 22, 20, 27; Restaurant B: 31, 33, 35, 30, 32, 36. Conduct an F-test at the 5%
significance level to determine if the variances are significantly different.

In [63]:
from scipy.stats import f
import numpy as np

def f_test(array1, array2):
    n1 = len(array1)
    n2 = len(array2)
    var1 = np.var(array1, ddof=1)
    var2 = np.var(array2, ddof=1)
    f_value = max(var1, var2) / min(var1, var2)
    p_value = 2 * (1 - f.cdf(f_value, n1-1, n2-1))  ## two tail reason for the multiplication by 2
    return f_value, p_value

data1 = np.array([24, 25, 28, 23, 22, 20, 27])
data2 = np.array([31, 33, 35, 30, 32, 36])

f_value, p_value = f_test(data1, data2)
print(f'F-value: {f_value:.4f}')
print(f'p-value: {p_value:.4f}')
alpha = 0.05
if p_value < alpha:
    print("\nWe reject the null hypothesis and conclude that the variances are significantly different.")
else:
    print("\nWe fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are significantly different.")


F-value: 1.4552
p-value: 0.6975

We fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are significantly different.


In [None]:
# ### can alos be done this way

# import numpy as np
# from scipy.stats import f_oneway

# # Define the data for the two restaurants
# restaurant_a = np.array([24, 25, 28, 23, 22, 20, 27])
# restaurant_b = np.array([31, 33, 35, 30, 32, 36])

# # Calculate the F-test and p-value
# F, p_value = f_oneway(restaurant_a, restaurant_b)

# # Define the significance level
# alpha = 0.05

# # Print the results
# print("F-test statistic:", F)
# print("p-value:", p_value)
# if p_value < alpha:
#     print("We reject the null hypothesis and conclude that the variances are significantly different.")
# else:
#     print("We fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are significantly different.")


### Question 9

Q9. The following data represent the test scores of two groups of students: Group A: 80, 85, 90, 92, 87, 83;
Group B: 75, 78, 82, 79, 81, 84. Conduct an F-test at the 1% significance level to determine if the variances
are significantly different.

__Hypothesis:__

H0: The variances are not significantly different
Ha: The variances are significantly different

In [62]:
from scipy.stats import f
import numpy as np

def f_test(array1, array2):
    n1 = len(array1)
    n2 = len(array2)
    var1 = np.var(array1, ddof=1)
    var2 = np.var(array2, ddof=1)
    f_value = max(var1, var2) / min(var1, var2)
    p_value = 2 * (1 - f.cdf(f_value, n1-1, n2-1))  ## two tail reason for the multiplication by 2
    return f_value, p_value

GroupA = np.array([80, 85, 90, 92, 87, 83])
GroupB = np.array([75, 78, 82, 79, 81, 84])

f_value, p_value = f_test(GroupA, GroupB)
print(f'F-value: {f_value:.4f}')
print(f'p-value: {p_value:.4f}')

alpha = 0.01

if p_value < alpha:
    print("\nWe reject the null hypothesis and conclude that the variances are significantly different.")
else:
    print("\nWe fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are significantly different.")


F-value: 1.9443
p-value: 0.4831

We fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the variances are significantly different.


In [27]:
crit_val

nan