# _STATISTICS ADVANCE 2_

# **_1. Explain the properties of the F-distribution._**

The F-distribution is a probability distribution that arises frequently in statistical tests, particularly in the analysis of variance (ANOVA). Below are its key properties:

- **Non-Negative**: The F-distribution is defined only for positive values. This is because it is a ratio of variances, which are always non-negative.
- **Asymmetry**: The F-distribution is positively skewed, with the degree of skewness depending on the degrees of freedom in the numerator (\(d_1\)) and denominator (\(d_2\)).
- **Degrees of Freedom**: The shape of the F-distribution is determined by two parameters: \(d_1\) (degrees of freedom of the numerator) and \(d_2\) (degrees of freedom of the denominator).
- **Right-Tailed**: Most tests using the F-distribution are right-tailed because we are interested in whether the observed variance is significantly larger than expected.
- **Mean and Variance**:
  - Mean: The mean of the F-distribution is approximately \( \frac{d_2}{d_2 - 2} \) for \(d_2 > 2\).
  - Variance: The variance of the F-distribution is \( \frac{2d_2^2(d_1 + d_2 - 2)}{d_1(d_2 - 2)^2(d_2 - 4)} \) for \(d_2 > 4\).

---

## **_2. In which types of statistical tests is the F-distribution used, and why is it appropriate for these tests?_**

The F-distribution is primarily used in the following statistical tests:

- **Analysis of Variance (ANOVA)**:
  - Used to test whether the means of multiple groups are significantly different.
  - The F-distribution is appropriate because it compares the ratio of between-group variance to within-group variance.

- **F-Test for Variances**:
  - Used to compare the variances of two populations.
  - The F-distribution is suitable because it models the ratio of two sample variances.

- **Regression Analysis**:
  - The F-test in regression evaluates the overall significance of a regression model by comparing the explained variance to unexplained variance.

### Why It’s Appropriate:
The F-distribution is appropriate because it models the ratio of variances, which is central to these tests. It helps determine whether observed differences are due to random chance or true effects.

---

# **_3. What are the key assumptions required for conducting an F-test to compare the variances of two populations?_**

The F-test to compare variances relies on several key assumptions:

1. **Normality**: The data in both populations should follow a normal distribution.
2. **Independence**: The samples from the two populations must be independent of each other.
3. **Random Sampling**: The data should be collected through random sampling.
4. **Homogeneity of Variances**: Although the test checks for this, the assumption applies to certain tests where F-values are used in subsequent steps.

Violations of these assumptions may lead to incorrect conclusions.

---

# **_4. What is the purpose of ANOVA, and how does it differ from a t-test?_**

### Purpose of ANOVA:
ANOVA (Analysis of Variance) is used to test whether there are significant differences among the means of three or more groups. It helps determine if at least one group mean is significantly different without testing every pair individually.

### Differences Between ANOVA and t-Test:
- **Number of Groups**:
  - t-Test: Compares the means of two groups.
  - ANOVA: Compares the means of three or more groups.
- **Error Rate**:
  - t-Test: Conducting multiple t-tests increases the Type I error rate.
  - ANOVA: Controls the Type I error rate by testing all groups simultaneously.
- **Output**:
  - t-Test: Provides a direct comparison of two means.
  - ANOVA: Provides an overall test of differences among group means.

---

# **_5. Explain when and why you would use a one-way ANOVA instead of multiple t-tests when comparing more than two groups._**

### When to Use One-Way ANOVA:
Use one-way ANOVA when comparing the means of three or more groups that are categorized by a single independent variable.

### Why Use One-Way ANOVA:
- **Type I Error Control**: Conducting multiple t-tests increases the risk of Type I error (false positives). One-way ANOVA addresses this issue by testing all group means simultaneously.
- **Efficiency**: One-way ANOVA provides a single test for group differences rather than performing multiple pairwise comparisons.
- **Insights**: ANOVA identifies overall differences among groups, which can be followed up with post-hoc tests if needed.

---


# **_6. Explain how variance is partitioned in ANOVA into between-group variance and within-group variance. How does this partitioning contribute to the calculation of the F-statistic?_**

### Partitioning of Variance:
- **Total Variance**: The total variance in the data is divided into two components:
  - **Between-Group Variance**: Measures the variability due to differences between the group means.
  - **Within-Group Variance**: Measures the variability within each group due to individual differences.

### Contribution to the F-Statistic:
- The F-statistic is calculated as the ratio of between-group variance to within-group variance:
  \[
  F = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}}
  \]
- A large F-value indicates that between-group variability is significantly greater than within-group variability, suggesting that at least one group mean differs significantly.

---

# **_7. Compare the classical (frequentist) approach to ANOVA with the Bayesian approach. What are the key differences in terms of how they handle uncertainty, parameter estimation, and hypothesis testing?_**

### Key Differences:

| Aspect                     | Frequentist ANOVA                       | Bayesian ANOVA                           |
|----------------------------|-----------------------------------------|-----------------------------------------|
| **Uncertainty**            | Uses p-values and confidence intervals to handle uncertainty. | Models uncertainty directly using probability distributions. |
| **Parameter Estimation**   | Estimates fixed parameters (e.g., means, variances) based on sample data. | Uses prior distributions combined with sample data (posterior distributions). |
| **Hypothesis Testing**     | Tests a null hypothesis (e.g., all means are equal) using F-statistics and p-values. | Tests hypotheses by calculating posterior probabilities or Bayes factors. |
| **Interpretation**         | Results are interpreted in terms of rejecting or failing to reject the null hypothesis. | Results are interpreted probabilistically, such as the probability of one model being true compared to another. |
| **Assumptions**            | Relies heavily on assumptions like normality and independence. | Can incorporate prior knowledge and is often more robust to violations of assumptions. |

### Conclusion:
The Bayesian approach provides a more flexible framework by incorporating prior knowledge and offering probabilistic interpretations. However, it is computationally intensive and requires careful specification of priors.

---

# **_8. Question: You have two sets of data representing the incomes of two different professions:_**
- Profession A: [48, 52, 55, 60, 62]
- Profession B: [45, 50, 55, 52, 47]                          

**Perform an F-test to determine if the variances of the two professions' incomes are equal. What are your conclusions based on the F-test?**

**Task:** Use Python to calculate the F-statistic and p-value for the given data.

__Objective:__ Gain experience in performing F-tests and interpreting the results in terms of variance comparison.

In [2]:
import numpy as np
from scipy.stats import f

# Data for both professions
profession_a = [48, 52, 55, 60, 62]
profession_b = [45, 50, 55, 52, 47]

# Step 1: Calculate the variances of both datasets
var_a = np.var(profession_a, ddof=1)  # Variance of Profession A (sample variance)
var_b = np.var(profession_b, ddof=1)  # Variance of Profession B (sample variance)

# Step 2: Calculate the F-statistic
f_statistic = var_a / var_b if var_a >= var_b else var_b / var_a

# Step 3: Degrees of freedom for both groups
dof_a = len(profession_a) - 1
dof_b = len(profession_b) - 1

# Step 4: Calculate the p-value
p_value = 2 * (1 - f.cdf(f_statistic, dof_a, dof_b))  # Two-tailed p-value

# Output the results
print("Variance of Profession A:", var_a)
print("Variance of Profession B:", var_b)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Conclusion: Reject the null hypothesis. Variances are not equal.")
else:
    print("Conclusion: Fail to reject the null hypothesis. Variances are equal.")


Variance of Profession A: 32.8
Variance of Profession B: 15.7
F-Statistic: 2.089171974522293
p-value: 0.49304859900533904
Conclusion: Fail to reject the null hypothesis. Variances are equal.


# **_9. Question: Conduct a one-way ANOVA to test whether there are any statistically significant differences in average heights between three different regions with the following data:_**
- Region A: [160, 162, 165, 158, 164]
- Region B: [172, 175, 170, 168, 174]
- Region C: [180, 182, 179, 185, 183]

__Task: Write Python code to perform the one-way ANOVA and interpret the results__

__Objective: Learn how to perform one-way ANOVA using Python and interpret F-statistic and p-value.__

In [3]:
import numpy as np
from scipy import stats

# Data for the heights in three regions
region_a = [160, 162, 165, 158, 164]
region_b = [172, 175, 170, 168, 174]
region_c = [180, 182, 179, 185, 183]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(region_a, region_b, region_c)

# Output the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpretation of the results
alpha = 0.05  # Significance level
if p_value < alpha:
    print("Conclusion: Reject the null hypothesis. There is a significant difference in the average heights between the regions.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is no significant difference in the average heights between the regions.")


F-Statistic: 67.87330316742101
p-value: 2.8706641879370266e-07
Conclusion: Reject the null hypothesis. There is a significant difference in the average heights between the regions.
