In [None]:
1. Explain the properties of the F-distribution.
The F-distribution is a continuous probability distribution that is commonly used in statistical hypothesis testing, particularly in the analysis of variance (ANOVA) and regression analysis. Here are the key properties of the F-distribution:

Non-Symmetry: The F-distribution is not symmetric; it is skewed to the right.

Degrees of Freedom: The F-distribution is defined by two parameters, known as degrees of freedom. These are typically denoted as
v
1
v
1
​
  (numerator degrees of freedom) and
v
2
v
2
​
  (denominator degrees of freedom).

Non-Negative: The F-distribution is always non-negative.

Three-Reverse Formula: There is a property known as the "three-reverse formula" which states that
F
α
(
n
1
,
n
2
)
=
1
F
1
−
α
(
n
2
,
n
1
)
F
α
​
 (n
1
​
 ,n
2
​
 )=
F
1−α
​
 (n
2
​
 ,n
1
​
 )
1
​
 .

Python Code to Work with the F-Distribution
To work with the F-distribution in Python, you can use the scipy.stats module, which provides various functions to handle the F-distribution. Here is an example of how to use these functions:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f

# Define the degrees of freedom
v1 = 5  # Numerator degrees of freedom
v2 = 10 # Denominator degrees of freedom

# Generate random samples from the F-distribution
samples = f.rvs(v1, v2, size=1000)

# Plot the histogram of the samples
plt.hist(samples, bins=30, density=True, alpha=0.6, color='g')

# Plot the PDF of the F-distribution
x = np.linspace(f.ppf(0.01, v1, v2), f.ppf(0.99, v1, v2), 100)
plt.plot(x, f.pdf(x, v1, v2), 'r-', lw=2, label='F-distribution PDF')

plt.title('F-Distribution with v1=5 and v2=10')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Calculate the CDF at a specific point
point = 2.0
cdf_value = f.cdf(point, v1, v2)
print(f'CDF at x={point}: {cdf_value}')

# Calculate the PPF (percent point function) at a specific probability
probability = 0.95
ppf_value = f.ppf(probability, v1, v2)
print(f'PPF at p={probability}: {ppf_value}')

2. In which types of statistical tests is the F-distribution used, and why is it appropriate for these tests?
The F-distribution is used in several types of statistical tests, particularly in the context of comparing variances and means across different groups. Here are the key applications and why the F-distribution is appropriate for these tests:

Types of Statistical Tests Using the F-Distribution
ANOVA (Analysis of Variance):

Purpose: ANOVA is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.
Why F-Distribution: The F-distribution is used to compare the variance between the groups to the variance within the groups. A high F-value indicates that the variance between the groups is significantly greater than the variance within the groups, suggesting that the group means are not equal.
F-Test for Equality of Variances:

Purpose: This test is used to determine if the variances of two populations are equal.
Why F-Distribution: The F-distribution is used to compare the ratio of the variances of two samples. If the variances are equal, the F-ratio should be close to 1. A significantly different F-ratio suggests that the variances are not equal.
Regression Analysis:

Purpose: In regression analysis, the F-distribution is used to test the overall significance of the model.
Why F-Distribution: The F-test in regression compares the variance explained by the model to the residual variance. A high F-value indicates that the model explains a significant portion of the variance in the dependent variable.
Python Code Example
Here is a Python code example demonstrating the use of the F-distribution in an ANOVA test:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate sample data for three groups
np.random.seed(0)
group1 = np.random.normal(loc=5, scale=2, size=100)
group2 = np.random.normal(loc=7, scale=2, size=100)
group3 = np.random.normal(loc=9, scale=2, size=100)

# Perform ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Print the results
print(f'F-statistic: {f_statistic}')
print(f'P-value: {p_value}')

# Plot the data
plt.figure(figsize=(10, 6))
plt.hist(group1, bins=20, alpha=0.5, label='Group 1')
plt.hist(group2, bins=20, alpha=0.5, label='Group 2')
plt.hist(group3, bins=20, alpha=0.5, label='Group 3')
plt.legend()
plt.title('Histogram of Group Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

3. What are the key assumptions required for conducting an F-test to compare the variances of two
populations?
Key Assumptions for Conducting an F-Test to Compare the Variances of Two Populations
Normality: The data in both populations should be normally distributed. This is a critical assumption because the F-test is sensitive to deviations from normality.

Independence: The samples from the two populations should be independent of each other. This means that the observations in one sample do not influence the observations in the other sample.

Homogeneity of Variances: The variances of the two populations should be equal under the null hypothesis. This is what the F-test aims to test.

Python Code Example
Here is a Python code example demonstrating how to perform an F-test to compare the variances of two populations:
import numpy as np
import scipy.stats as stats

# Generate sample data for two populations
np.random.seed(0)
population1 = np.random.normal(loc=5, scale=2, size=100)
population2 = np.random.normal(loc=7, scale=2, size=100)

# Perform the F-test
f_statistic, p_value = stats.f_oneway(population1, population2)

# Print the results
print(f'F-statistic: {f_statistic}')
print(f'P-value: {p_value}')

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: The variances are not equal.')
else:
    print('Fail to reject the null hypothesis: The variances are equal.')

    4. What is the purpose of ANOVA, and how does it differ from a t-test?
    Purpose of ANOVA
The purpose of ANOVA (Analysis of Variance) is to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. ANOVA is an omnibus test, meaning it tests for a difference overall between all groups. It compares the variance between the groups to the variance within the groups to determine if the differences in means are significant3512.

Differences Between ANOVA and T-Test
Number of Groups:

ANOVA: Used to compare the means of three or more groups.
T-Test: Used to compare the means of two groups249.
Type of Test:

ANOVA: An omnibus test that provides a single p-value indicating whether there is a significant difference among the group means.
T-Test: Provides a direct comparison between two groups, resulting in a single p-value for that comparison35.
Assumptions:

ANOVA: Assumes normality, independence, and homogeneity of variances across groups.
T-Test: Also assumes normality and independence, but only needs to consider the variances of two groups46.
Python Code Example
Here is a Python code example demonstrating how to perform an ANOVA and a t-test:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate sample data for three groups
np.random.seed(0)
group1 = np.random.normal(loc=5, scale=2, size=100)
group2 = np.random.normal(loc=7, scale=2, size=100)
group3 = np.random.normal(loc=9, scale=2, size=100)

# Perform ANOVA
f_statistic, p_value_anova = stats.f_oneway(group1, group2, group3)

# Perform t-test between group1 and group2
t_statistic, p_value_ttest = stats.ttest_ind(group1, group2)

# Print the results
print(f'ANOVA F-statistic: {f_statistic}')
print(f'ANOVA P-value: {p_value_anova}')
print(f'T-test t-statistic: {t_statistic}')
print(f'T-test P-value: {p_value_ttest}')

# Plot the data
plt.figure(figsize=(10, 6))
plt.hist(group1, bins=20, alpha=0.5, label='Group 1')
plt.hist(group2, bins=20, alpha=0.5, label='Group 2')
plt.hist(group3, bins=20, alpha=0.5, label='Group 3')
plt.legend()
plt.title('Histogram of Group Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

5. Explain when and why you would use a one-way ANOVA instead of multiple t-tests when comparing more
than two groups.
When and Why to Use One-Way ANOVA Instead of Multiple T-Tests
When to Use One-Way ANOVA
One-way ANOVA is used when you need to compare the means of more than two independent groups. It is particularly useful in experimental designs where you have multiple treatment groups and a control group, or when you want to compare the means of different categories of a single independent variable.

Why Use One-Way ANOVA Instead of Multiple T-Tests
Control of Type I Error: Conducting multiple t-tests increases the likelihood of committing a Type I error (incorrectly rejecting the null hypothesis). One-way ANOVA controls the overall Type I error rate across all group comparisons, making it a more reliable method for multiple group comparisons345.

Efficiency: One-way ANOVA provides a single p-value that indicates whether there is a significant difference among the group means. This is more efficient than performing multiple t-tests, which would require multiple comparisons and adjustments for multiple testing26.

Statistical Power: ANOVA is generally more powerful than multiple t-tests when comparing more than two groups. It can detect smaller differences between group means with greater accuracy37.

Python Code Example
Here is a Python code example demonstrating how to perform a one-way ANOVA:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate sample data for three groups
np.random.seed(0)
group1 = np.random.normal(loc=5, scale=2, size=100)
group2 = np.random.normal(loc=7, scale=2, size=100)
group3 = np.random.normal(loc=9, scale=2, size=100)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Print the results
print(f'F-statistic: {f_statistic}')
print(f'P-value: {p_value}')

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: There is a significant difference between the group means.')
else:
    print('Fail to reject the null hypothesis: There is no significant difference between the group means.')

# Plot the data
plt.figure(figsize=(10, 6))
plt.hist(group1, bins=20, alpha=0.5, label='Group 1')
plt.hist(group2, bins=20, alpha=0.5, label='Group 2')
plt.hist(group3, bins=20, alpha=0.5, label='Group 3')
plt.legend()
plt.title('Histogram of Group Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate sample data for three groups
np.random.seed(0)
group1 = np.random.normal(loc=5, scale=2, size=100)
group2 = np.random.normal(loc=7, scale=2, size=100)
group3 = np.random.normal(loc=9, scale=2, size=100)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Print the results
print(f'F-statistic: {f_statistic}')
print(f'P-value: {p_value}')

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: There is a significant difference between the group means.')
else:
    print('Fail to reject the null hypothesis: There is no significant difference between the group means.')

# Plot the data
plt.figure(figsize=(10, 6))
plt.hist(group1, bins=20, alpha=0.5, label='Group 1')
plt.hist(group2, bins=20, alpha=0.5, label='Group 2')
plt.hist(group3, bins=20, alpha=0.5, label='Group 3')
plt.legend()
plt.title('Histogram of Group Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

6. Explain how variance is partitioned in ANOVA into between-group variance and within-group variance.
How does this partitioning contribute to the calculation of the F-statistic?
Variance Partitioning in ANOVA
In ANOVA, the total variance of the data is partitioned into two components:

Between-Group Variance (SSB):

Measures the variability between the group means and the overall mean (grand mean).
Indicates how much the group means differ from one another.
Within-Group Variance (SSW):

Measures the variability within each group around their respective group means.
Reflects the natural variability in the data within groups.
F-statistic Calculation
The F-statistic is the ratio of the between-group variance to the within-group variance:

𝐹
=
MSB
MSW
F=
MSW
MSB
​

Where:

Mean Square Between (MSB) =
SSB
df
between
df
between
​

SSB
​
 , where
df
between
=
Number of Groups
−
1
df
between
​
 =Number of Groups−1.
Mean Square Within (MSW) =
SSW
df
within
df
within
​

SSW
​
 , where
df
within
=
Total Observations
−
Number of Groups
df
within
​
 =Total Observations−Number of Groups.
Steps in Variance Partitioning
Calculate the grand mean of all observations.
Compute SSB (sum of squares between groups) by summing the squared differences between each group mean and the grand mean, weighted by group size.
Compute SSW (sum of squares within groups) by summing the squared differences within each group.
import numpy as np

# Example data: Scores from three groups
group1 = np.array([85, 90, 88, 86, 89])
group2 = np.array([78, 76, 80, 79, 77])
group3 = np.array([92, 95, 93, 91, 94])

# Combine groups and calculate grand mean
groups = [group1, group2, group3]
all_data = np.concatenate(groups)
grand_mean = np.mean(all_data)

# Between-Group Variance (SSB)
n_groups = len(groups)
ssb = sum(len(group) * (np.mean(group) - grand_mean)**2 for group in groups)

# Within-Group Variance (SSW)
ssw = sum(np.sum((group - np.mean(group))**2) for group in groups)

# Degrees of freedom
df_between = n_groups - 1
df_within = len(all_data) - n_groups

# Mean Squares
msb = ssb / df_between
msw = ssw / df_within

# F-statistic
f_statistic = msb / msw

# Print Results
print(f"SSB (Between-Group): {ssb}")
print(f"SSW (Within-Group): {ssw}")
print(f"MSB (Mean Square Between): {msb}")
print(f"MSW (Mean Square Within): {msw}")
print(f"F-statistic: {f_statistic}")

7. Compare the classical (frequentist) approach to ANOVA with the Bayesian approach. What are the key
differences in terms of how they handle uncertainty, parameter estimation, and hypothesis testing?
Comparison of Classical (Frequentist) and Bayesian ANOVA
1. Handling Uncertainty
Classical (Frequentist) Approach:

Assumes data are fixed, and uncertainty arises from sampling variability.
Provides p-values to determine whether the null hypothesis (no differences between groups) can be rejected.
Does not assign probabilities to hypotheses; the null is either rejected or not rejected.
Bayesian Approach:

Treats parameters as random variables with probability distributions.
Uses prior beliefs (prior distributions) and updates them with data (likelihood) to compute posterior distributions.
Allows direct probability statements about hypotheses (e.g., "The probability of group differences is 95%.").
2. Parameter Estimation
Classical Approach:
Estimates parameters (e.g., means, variances) using point estimates like the sample mean and variance.
Provides confidence intervals to express the uncertainty around these estimates.
Bayesian Approach:
Estimates parameters as distributions (posterior distributions).
Produces credible intervals, which directly represent the range of values with a specified probability (e.g., 95%).
3. Hypothesis Testing
Classical Approach:

Uses F-statistics and p-values to determine if group differences are statistically significant.
Hypothesis testing is dichotomous: reject or fail to reject the null hypothesis.
Bayesian Approach:

Directly calculates the posterior probability of the null or alternative hypothesis.
Can compare models using Bayesian model comparison techniques, such as Bayes factors.
Allows for nuanced interpretation without requiring binary decision-making.
Python Code: Classical vs. Bayesian ANOVA
import numpy as np
from scipy.stats import f_oneway

# Example data
group1 = [85, 90, 88, 86, 89]
group2 = [78, 76, 80, 79, 77]
group3 = [92, 95, 93, 91, 94]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(group1, group2, group3)

print("Classical ANOVA Results:")
print(f"F-statistic: {f_stat}")
print(f"P-value: {p_value}")
if p_value < 0.05:
    print("Reject the null hypothesis: significant differences exist.")
else:
    print("Fail to reject the null hypothesis: no significant differences.")
Bayesian ANOVA (Using PyMC for Bayesian Modeling)
import pymc as pm
import arviz as az
import numpy as np

# Example data
data = {
    "group1": [85, 90, 88, 86, 89],
    "group2": [78, 76, 80, 79, 77],
    "group3": [92, 95, 93, 91, 94]
}

# Convert data to a single array with group labels
all_data = np.concatenate([data["group1"], data["group2"], data["group3"]])
group_labels = np.concatenate([[1]*len(data["group1"]), [2]*len(data["group2"]), [3]*len(data["group3"])])

# Bayesian ANOVA Model
with pm.Model() as model:
    # Priors for group means and shared variance
    mu_group = pm.Normal("mu_group", mu=0, sigma=10, shape=3)
    sigma = pm.HalfNormal("sigma", sigma=10)

    # Likelihood
    obs = pm.Normal("obs", mu=mu_group[group_labels - 1], sigma=sigma, observed=all_data)

    # Posterior sampling
    trace = pm.sample(1000, return_inferencedata=True)

# Summary of results
print("Bayesian ANOVA Results:")
print(az.summary(trace, var_names=["mu_group", "sigma"]))

# Posterior plots
az.plot_posterior(trace, var_names=["mu_group"])


8. Question: You have two sets of data representing the incomes of two different professions1
V Profession A: [48, 52, 55, 60, 62'
V Profession B: [45, 50, 55, 52, 47] Perform an F-test to determine if the variances of the two professions'
incomes are equal. What are your conclusions based on the F-test?

Task: Use Python to calculate the F-statistic and p-value for the given data.

Objective: Gain experience in performing F-tests and interpreting the results in terms of variance comparison.
F-Test to Compare Variances
The F-test is used to compare the variances of two samples to determine if they are equal. It involves calculating the F-statistic as:

𝐹
=
Variance of Group A
Variance of Group B
F=
Variance of Group B
Variance of Group A
​

The larger variance is placed in the numerator to ensure
𝐹
≥
1
F≥1.
The p-value is calculated based on the F-distribution with degrees of freedom for the two groups.
import numpy as np
from scipy.stats import f

# Data for the two professions
profession_a = [48, 52, 55, 60, 62]
profession_b = [45, 50, 55, 52, 47]

# Variances of the two groups
var_a = np.var(profession_a, ddof=1)  # Sample variance (use ddof=1 for unbiased estimate)
var_b = np.var(profession_b, ddof=1)

# Calculate the F-statistic
if var_a > var_b:
    f_stat = var_a / var_b
    dfn, dfd = len(profession_a) - 1, len(profession_b) - 1
else:
    f_stat = var_b / var_a
    dfn, dfd = len(profession_b) - 1, len(profession_a) - 1

# Calculate the p-value
p_value = 2 * (1 - f.cdf(f_stat, dfn, dfd))  # Two-tailed test

# Results
print("F-Test Results:")
print(f"Variance of Profession A: {var_a:.2f}")
print(f"Variance of Profession B: {var_b:.2f}")
print(f"F-statistic: {f_stat:.2f}")
print(f"Degrees of Freedom (dfn, dfd): ({dfn}, {dfd})")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: The variances are significantly different.")
else:
    print("Fail to reject the null hypothesis: No significant difference in variances.")

9. Question: Conduct a one-way ANOVA to test whether there are any statistically significant differences in
average heights between three different regions with the following data1
V Region A: [160, 162, 165, 158, 164]
V Region B: [172, 175, 170, 168, 174]
V Region C: [180, 182, 179, 185, 183]
V Task: Write Python code to perform the one-way ANOVA and interpret the results.
V Objective: Learn how to perform one-way ANOVA using Python and interpret F-statistic and p-value.
One-Way ANOVA to Test Differences in Average Heights
One-way ANOVA compares the means of three or more groups to determine if there are statistically significant differences among them. Here’s how to perform the test and interpret the results.
 from scipy.stats import f_oneway

# Data for the three regions
region_a = [160, 162, 165, 158, 164]
region_b = [172, 175, 170, 168, 174]
region_c = [180, 182, 179, 185, 183]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(region_a, region_b, region_c)

# Results
print("One-Way ANOVA Results:")
print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There are statistically significant differences in average heights between the regions.")
else:
    print("Fail to reject the null hypothesis: No statistically significant differences in average heights between the regions.")
Steps in the Code
Data Input:
Heights for each region are provided as lists.
One-Way ANOVA Test:
Use scipy.stats.f_oneway() to compute the F-statistic and p-value.
Results Interpretation:
The F-statistic quantifies the ratio of between-group variance to within-group variance.
The p-value determines if the observed differences are statistically significant (compared to the significance level,
𝛼
=
0.05
α=0.05).
