                                        Statistics Advance -1 Assignment

Q1. Explain the properties of the F-distribution. 

Ans. The F-distribution has these key properties:

1. Non-Negative and Right-Skewed: Takes only positive values and is right-skewed with a long tail.

2. Degrees of Freedom: Shaped by the degrees of freedom of the variances in the numerator and denominator.

3. Uses in Hypothesis Testing: Commonly applied in ANOVA and regression for comparing variances, with critical values used to assess the significance of the F-statistic.

4. Derived from Variance Ratios: It is the ratio of two independent sample variances.

5. Right Tail Test: Hypothesis tests using the F-distribution are generally right-tailed, as the distribution is skewed to the right.

6. As Degrees of Freedom Increase: As the degrees of freedom increase, the F-distribution becomes less skewed and approaches a normal distribution.

7. No Symmetry: Unlike the normal distribution, the F-distribution is asymmetric.

8. Depends on Sample Size: The shape of the distribution changes with the sample sizes of the groups being compared.

9. Strictly Positive Mode: The mode (peak) of the F-distribution is strictly positive and shifts depending on the degrees of freedom.

Q2. In which types of statistical tests is the F-distribution used, and why is it appropriate for these tests ? 

Ans. The F-distribution is used in the following statistical tests:

1. ANOVA: Compares the variance between groups and within groups to test if group means are different. The F-distribution is ideal for comparing variances.

2. Regression Analysis: Tests the overall significance of a regression model by comparing explained to unexplained variance.
    
3. F-Test for Equality of Variances: Compares two sample variances to determine if they come from populations with equal variances.
    
4. Two-Way ANOVA: Analyzes interaction effects between two variables on a dependent variable.

5. Testing Multiple Linear Restrictions: In regression, compares restricted and unrestricted models to test if additional parameters improve the model.

Q3. What are the key assumptions required for conducting an F-test to compare the variances of two
populations ?

Ans. The key assumptions required for conducting an F-test to compare the variances of two populations are:

1. Independence of Samples: The two samples being compared must be independent of each other. The outcome or variance in one sample should not influence the other.

2. Normality of Populations: Both populations from which the samples are drawn should be normally distributed. The F-test is sensitive to deviations from normality, especially with small sample sizes.

3. Random Sampling: The samples should be randomly selected from their respective populations to ensure that the results are unbiased and representative.

4. Ratio of Variances: The F-test compares the ratio of variances. Both populations should have finite, non-zero variances, as the test evaluates whether the two variances are equal.

5. Homoscedasticity (for ANOVA context): In the context of ANOVA, an additional assumption is that the variances of the different groups being compared should be roughly equal, although this assumption is relaxed for the F-test specifically comparing two variances.

Q4. What is the purpose of ANOVA, and how does it differ from a t-test ? 

Ans. Purpose of ANOVA: ANOVA (Analysis of Variance) compares the means of three or more groups to determine if there are significant differences by analyzing variance between and within groups.

Purpose of a t-test: A t-test compares the means of two groups to test if the difference between them is statistically significant.

Key Differences:

1. Number of Groups:

ANOVA: Compares 3 or more groups.

t-test: Compares 2 groups.

2. Variance Assessed:

ANOVA: Examines between- and within-group variance.

t-test: Focuses on the mean difference.

3. Type I Error:

ANOVA: Controls Type I error for multiple comparisons.

t-test: Higher risk of Type I error when comparing multiple pairs.

4. Post-hoc Testing:

ANOVA: Requires post-hoc tests to find specific group differences.

t-test: Directly compares two groups without post-hoc tests.

Q5. Explain when and why you would use a one-way ANOVA instead of multiple t-tests when comparing more
than two groups.

Ans. When to Use One-Way ANOVA Instead of Multiple t-tests: Use one-way ANOVA when comparing the means of three or more groups.

Reasons:

1. Control of Type I Error Rate: Reduces the risk of false positives that accumulate with multiple t-tests.

2. Single Hypothesis Test: Tests one hypothesis about the equality of means across all groups, simplifying analysis.

3. Efficiency: Conducts one analysis instead of multiple tests, reducing computational load.

4. Handling Variance: Compares both between-group and within-group variability, providing a comprehensive analysis.

5. Post-hoc Analysis: Allows for systematic post-hoc testing to identify specific group differences if significant results are found.







Q6. Explain how variance is partitioned in ANOVA into between-group variance and within-group variance.
How does this partitioning contribute to the calculation of the F-statistic?





Ans. In ANOVA (Analysis of Variance), variance is partitioned into two components:

1. Between-Group Variance: This measures how much the group means differ from the overall mean. It reflects the variability due to the differences between the groups being compared.

2. Within-Group Variance: This measures the variability of individual observations within each group around their respective group mean. It indicates how much variation exists within each group.

3. Contribution to the F-statistic: The F-statistic is calculated by comparing the between-group variance to the within-group variance.

4. A higher F-statistic suggests that the variability between group means is greater than the variability within groups, indicating that at least one group mean is significantly different from the others. This helps determine whether the differences observed among the groups are statistically significant.












Q7. Compare the classical (frequentist) approach to ANOVA with the Bayesian approach. What are the key
differences in terms of how they handle uncertainty, parameter estimation, and hypothesis testing?


Ans. Comparison of Classical (Frequentist) and Bayesian Approaches to ANOVA:

1. Handling Uncertainty:
   
Frequentist: Uses confidence intervals and p-values to quantify uncertainty, focusing on long-run frequencies.

Bayesian: Expresses uncertainty with probability distributions, incorporating prior knowledge and updating beliefs with observed data.

2. Parameter Estimation:

Frequentist: Estimates parameters using point estimates (e.g., sample means) and treats them as fixed but unknown.

Bayesian: Treats parameters as random variables with probability distributions, providing estimates from the posterior distribution.


3. Hypothesis Testing:

Frequentist: Employs null hypothesis significance testing (NHST) with fixed thresholds (e.g., alpha = 0.05) to determine if to reject the null hypothesis.

Bayesian: Uses Bayes factors or posterior probabilities to compare the strength of evidence for different hypotheses.

4. Summary:
   
The frequentist approach focuses on long-term properties and fixed estimates, while the Bayesian approach incorporates prior knowledge and provides a more flexible interpretation of uncertainty and hypothesis testing.








Q8. You have two sets of data representing the incomes of two different professions:

. Profession A: [48, 52, 55, 60, 62]

. Profession B: [45, 50, 55, 52, 47] Perform an F-test to determine if the variances of the two professions'
incomes are equal. What are your conclusions based on the F-test?

Task: Use Python to calculate the F-statistic and p-value for the given data.

Objective: Gain experience in performing F-tests and interpreting the results in terms of variance comparison.

In [1]:
import numpy as np
import scipy.stats as stats

# Data for the two professions
profession_A = np.array([48, 52, 55, 60, 62])
profession_B = np.array([45, 50, 55, 52, 47])

# Calculate the variances of the two professions
var_A = np.var(profession_A, ddof=1)  # Sample variance
var_B = np.var(profession_B, ddof=1)  # Sample variance

# Calculate the F-statistic
F_statistic = var_A / var_B

# Degrees of freedom
df_A = len(profession_A) - 1
df_B = len(profession_B) - 1

# Calculate the p-value for the F-test
p_value = stats.f.sf(F_statistic, df_A, df_B)  # Right tail

# Print results
print(f"F-statistic: {F_statistic:.2f}")
print(f"p-value: {p_value:.3f}")

# Conclusion
alpha = 0.05
if p_value > alpha:
    print("Fail to reject the null hypothesis: variances are equal.")
else:
    print("Reject the null hypothesis: variances are not equal.")


F-statistic: 2.09
p-value: 0.247
Fail to reject the null hypothesis: variances are equal.


Q9. Conduct a one-way ANOVA to test whether there are any statistically significant differences in average heights between three different regions with the following data:

. Region A: [160, 162, 165, 158, 164]

. Region B: [172, 175, 170, 168, 174]

. Region C: [180, 182, 179, 185, 183]

. Task: Write Python code to perform the one-way ANOVA and interpret the results.

. Objective: Learn how to perform one-way ANOVA using Python and interpret F-statistic and p-value. 

In [2]:
import numpy as np
import scipy.stats as stats

# Data for the three regions
region_A = np.array([160, 162, 165, 158, 164])
region_B = np.array([172, 175, 170, 168, 174])
region_C = np.array([180, 182, 179, 185, 183])

# Combine the data into a single array for ANOVA
data = [region_A, region_B, region_C]

# Perform one-way ANOVA
F_statistic, p_value = stats.f_oneway(*data)

# Print results
print(f"F-statistic: {F_statistic:.2f}")
print(f"p-value: {p_value:.6e}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: there are significant differences in average heights among the regions.")
else:
    print("Fail to reject the null hypothesis: no significant differences in average heights among the regions.")


F-statistic: 67.87
p-value: 2.870664e-07
Reject the null hypothesis: there are significant differences in average heights among the regions.
