# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical test used to determine if there are significant differences between the means of three or more independent groups. For ANOVA results to be valid, several key assumptions must be met. If these assumptions are violated, the validity of the results can be compromised. The assumptions are:

### 1. **Independence of Observations**
   - **Assumption**: The data points (observations) in each group must be independent of each other. This means the value of one observation should not influence or be influenced by the value of another observation.
   - **Example of Violation**: A violation of this assumption can occur if the same subjects are included in multiple groups (repeated measures), or if subjects within a group are related or influence each other (e.g., family members or co-workers).
   - **Impact**: Lack of independence can lead to underestimated variability within groups, inflating the F-ratio and increasing the risk of false positives (Type I errors).

### 2. **Normality**
   - **Assumption**: The residuals (differences between observed and expected values) for each group should follow a normal distribution. This is most important when sample sizes are small.
   - **Example of Violation**: If the data is heavily skewed or contains extreme outliers, the assumption of normality is violated. For example, income data often violates normality due to large variations in wealth.
   - **Impact**: If the normality assumption is violated, the ANOVA may yield inaccurate p-values. However, ANOVA is robust to moderate violations of normality, especially with larger sample sizes (Central Limit Theorem).

### 3. **Homogeneity of Variances (Homoscedasticity)**
   - **Assumption**: The variance (spread) of the residuals should be approximately equal across all groups.
   - **Example of Violation**: A violation of this assumption occurs if one group has much higher variance than others. For example, in an experiment measuring performance under different levels of stress, one group may have highly variable responses compared to others.
   - **Impact**: Unequal variances can distort the F-statistic and lead to incorrect conclusions. This violation is particularly problematic when sample sizes are unequal across groups (heteroscedasticity).

### 4. **Random Sampling**
   - **Assumption**: The data should be collected through random sampling, ensuring that the sample represents the population being studied.
   - **Example of Violation**: A violation occurs if the sample is not randomly selected, for instance, if a study on consumer preferences is conducted by only surveying people from one neighborhood or demographic.
   - **Impact**: Non-random sampling can lead to biased results that do not accurately reflect the population, affecting the generalizability of the results.

### Violations and Remedies:
- **Independence Violation**: Use a repeated measures ANOVA or mixed-effects models if observations are related.
- **Normality Violation**: Apply data transformation (e.g., log or square root transformations), or use a non-parametric alternative like the Kruskal-Wallis test.
- **Homogeneity of Variances Violation**: Use Welch’s ANOVA, which does not assume equal variances, or apply a correction like the Brown-Forsythe test.

### Conclusion:
Violating these assumptions can undermine the validity of ANOVA results, leading to incorrect conclusions about group differences. Careful attention to the assumptions and appropriate statistical adjustments can mitigate the effects of these violations.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA (Analysis of Variance) is a powerful statistical method used to test for differences among group means in various situations. There are three main types of ANOVA, each designed for specific experimental designs or data structures. Here are the three types and the situations in which each is used:

### 1. **One-Way ANOVA (Single-Factor ANOVA)**
   - **Description**: One-way ANOVA is used to compare the means of three or more independent groups based on a single independent variable (factor). This test determines whether there are statistically significant differences between the group means.
   - **When to Use**: 
     - When there is **one categorical independent variable** (factor) with **two or more levels** (groups).
     - The dependent variable should be continuous (e.g., weight, score, time).
     - Example: A researcher wants to compare the average test scores of students from three different schools (School A, School B, School C) to determine if the school attended has an effect on test performance.
   - **Hypotheses**:
     - Null Hypothesis (\(H_0\)): The means of all groups are equal.
     - Alternative Hypothesis (\(H_a\)): At least one group mean is different from the others.

### 2. **Two-Way ANOVA (Factorial ANOVA)**
   - **Description**: Two-way ANOVA is used to compare group means when there are **two independent variables (factors)**. It allows you to evaluate the individual and interaction effects of the two factors on the dependent variable.
   - **When to Use**: 
     - When you have **two independent variables**, each with two or more levels, and a **continuous dependent variable**.
     - You can assess both **main effects** (the effect of each independent variable on its own) and **interaction effects** (whether the effect of one independent variable depends on the level of the other variable).
     - Example: A researcher wants to investigate the effect of both **diet type** (vegetarian, vegan, omnivore) and **exercise frequency** (low, moderate, high) on weight loss. Two-way ANOVA will help determine if these factors individually or interactively affect weight loss.
   - **Hypotheses**:
     - Null Hypothesis (\(H_0\)): The means for each factor and their interaction are equal across groups.
     - Alternative Hypothesis (\(H_a\)): At least one factor or their interaction significantly affects the dependent variable.

### 3. **Repeated Measures ANOVA**
   - **Description**: Repeated measures ANOVA is used when the same subjects are measured multiple times under different conditions or over time. This test accounts for the **within-subject correlation** (i.e., the fact that multiple observations from the same subject are related) and compares the means across different time points or conditions.
   - **When to Use**: 
     - When you have a **within-subject design**, where each subject is exposed to all levels of the independent variable.
     - The repeated measurements could be over time (e.g., measuring the effect of a drug at multiple time points) or under different conditions (e.g., different tasks performed by the same individuals).
     - Example: A clinical trial measures the blood pressure of patients at three different time points (baseline, 1 month, 3 months) after starting a new medication. Since the same patients are measured at each time point, a repeated measures ANOVA is appropriate to assess if there is a significant change over time.
   - **Hypotheses**:
     - Null Hypothesis (\(H_0\)): The mean responses at each time point (or condition) are equal.
     - Alternative Hypothesis (\(H_a\)): The mean responses at different time points (or conditions) are not equal.

### Situations for Each Type of ANOVA:

1. **One-Way ANOVA**: Use when you have a single factor with multiple groups and you want to compare the means across these groups.
   - **Example**: Testing if different diets lead to different average weight loss in three separate groups.
   
2. **Two-Way ANOVA**: Use when you have two factors and want to assess the individual and combined effects of those factors.
   - **Example**: Investigating the effect of both **diet type** and **exercise level** on weight loss.

3. **Repeated Measures ANOVA**: Use when you measure the same subjects multiple times, either across time or under different conditions.
   - **Example**: Assessing the impact of a training program on performance, measured at three different time intervals for the same group of individuals.

### Summary Table:

| **ANOVA Type**           | **Factors**     | **Use Case** | **Example** |
|--------------------------|-----------------|--------------|-------------|
| **One-Way ANOVA**         | 1 (Single factor) | Compare means of different groups | Compare exam scores across three schools |
| **Two-Way ANOVA**         | 2 (Factorial)   | Compare means across two factors and their interaction | Compare diet and exercise on weight loss |
| **Repeated Measures ANOVA** | 1 (Repeated measures) | Same subjects measured over time or conditions | Measure blood pressure at multiple time points |

Each type of ANOVA is suited for different experimental designs, allowing you to assess multiple factors or time-based changes efficiently.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA (Analysis of Variance) refers to dividing the total variability in the data into different components to identify the sources of variation. This concept is fundamental because it helps to determine whether the differences between group means are due to the experimental factor(s) or merely random noise (within-group variation). Understanding how variance is partitioned is key to interpreting ANOVA results.

### Components of Variance in ANOVA

In ANOVA, the total variability in the data is broken down into two main components:

1. **Between-Group Variance (Explained Variance)**:
   - This represents the variability **due to differences between the group means**. It captures how much of the total variability is explained by the independent variable (factor). If the group means are far apart, the between-group variance will be large.
   - In other words, it reflects the variability **caused by the effect of the independent variable**.
   - Example: If we are comparing test scores of students from three different schools, the between-group variance measures the differences in average scores between the schools.

2. **Within-Group Variance (Unexplained Variance, Error Variance)**:
   - This represents the variability **within each group**. It measures how much the individual observations vary within each group around their respective group mean.
   - This component reflects the random variability or noise that is not explained by the independent variable.
   - Example: In the same test score comparison, within-group variance measures how much test scores vary among students within each school, unrelated to the difference in schools.

3. **Total Variance**:
   - The total variance represents the overall variability in the dataset, combining both between-group and within-group variance. It reflects the total amount of variation among all observations, without considering group membership.
   - Mathematically, it is the sum of the between-group variance and the within-group variance.

### Partitioning Variance in ANOVA

The partitioning of variance in ANOVA can be expressed as:

\[
\text{Total Sum of Squares (SST)} = \text{Between-Group Sum of Squares (SSB)} + \text{Within-Group Sum of Squares (SSW)}
\]

- **Total Sum of Squares (SST)**: Measures the total variation in the data, considering all observations regardless of group membership.
  \[
  SST = \sum (\text{Observation} - \text{Overall Mean})^2
  \]

- **Between-Group Sum of Squares (SSB)**: Measures the variation **between group means**. It quantifies how much group means differ from the overall mean.
  \[
  SSB = \sum (\text{Group Mean} - \text{Overall Mean})^2 \times \text{Group Size}
  \]

- **Within-Group Sum of Squares (SSW)**: Measures the variation **within groups**, i.e., how much individual observations within a group differ from their group mean.
  \[
  SSW = \sum (\text{Observation} - \text{Group Mean})^2
  \]

### Importance of Partitioning of Variance in ANOVA

1. **F-Ratio Calculation**:
   - The partitioning of variance is crucial for calculating the **F-ratio**, which is the test statistic in ANOVA. The F-ratio is computed as the ratio of **between-group variance** (SSB) to **within-group variance** (SSW):
   \[
   F = \frac{\text{Mean Square Between Groups (MSB)}}{\text{Mean Square Within Groups (MSW)}}
   \]
   - If the between-group variance is significantly larger than the within-group variance, the F-ratio will be large, suggesting that the differences between group means are unlikely due to chance and are statistically significant.

2. **Testing Hypotheses**:
   - The goal of ANOVA is to test whether the group means are significantly different. Partitioning variance helps isolate the effect of the independent variable(s) by comparing the between-group variance (explained by the factor) with the within-group variance (random noise or error).
   - **Null Hypothesis (\(H_0\))**: The group means are equal (i.e., the between-group variance is small relative to the within-group variance).
   - **Alternative Hypothesis (\(H_a\))**: At least one group mean is different (i.e., the between-group variance is significantly larger than the within-group variance).

3. **Identifying the Source of Variation**:
   - Partitioning variance helps us understand where the variation in the data is coming from:
     - **Large Between-Group Variance**: Suggests that the factor being tested has a significant effect on the dependent variable.
     - **Large Within-Group Variance**: Suggests that there is a lot of variability within groups that is not explained by the factor being tested, possibly indicating measurement errors or uncontrolled variables.

4. **ANOVA Table**:
   - The results of ANOVA are typically presented in an **ANOVA table**, where the partitioned variance (SSB and SSW) is used to calculate the mean squares (MSB and MSW), the F-ratio, and the corresponding p-value. Understanding the partitioning of variance allows you to interpret the ANOVA table accurately.

| Source of Variation | Sum of Squares | Degrees of Freedom | Mean Square | F-Ratio |
|---------------------|----------------|--------------------|-------------|---------|
| Between Groups       | SSB            | \(k - 1\)          | MSB = SSB / \(k - 1\) | \(F = \frac{\text{MSB}}{\text{MSW}}\) |
| Within Groups        | SSW            | \(n - k\)          | MSW = SSW / \(n - k\) |         |
| Total                | SST            | \(n - 1\)          |             |         |

  Where:
  - \(k\) = Number of groups
  - \(n\) = Total number of observations

### Example of Violations:
- **Unequal Within-Group Variance**: If there is more variability within one group than others (heteroscedasticity), the F-ratio may be distorted, leading to incorrect conclusions.
- **Non-Independence**: If observations within groups are not independent (e.g., repeated measures), the within-group variance will be underestimated, inflating the F-ratio.

### Conclusion

The partitioning of variance in ANOVA is central to identifying whether the differences between group means are significant or due to random variation. It allows for the calculation of the F-statistic, which is used to test hypotheses about the group means, making it a fundamental concept in understanding the logic and results of ANOVA.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

1. Total Sum of Squares (SST):
This represents the total variance in the data, which is the sum of the squared differences between each individual observation and the overall mean.
𝑆
𝑆
𝑇
=
∑
(
Observation
−
Overall Mean
)
2
SST=∑(Observation−Overall Mean) 
2
 
2. Explained Sum of Squares (SSE) (Between-Group Sum of Squares):
This measures the variation between the group means and the overall mean. It quantifies how much of the total variation is explained by the group differences.
𝑆
𝑆
𝐸
=
∑
(
Group Mean
−
Overall Mean
)
2
×
Group Size
SSE=∑(Group Mean−Overall Mean) 
2
 ×Group Size
3. Residual Sum of Squares (SSR) (Within-Group Sum of Squares):
This represents the variation within the groups, or the unexplained variance. It quantifies the random variability or noise within the data that is not explained by the group differences.
𝑆
𝑆
𝑅
=
∑
(
Observation
−
Group Mean
)
2
SSR=∑(Observation−Group Mean) 
2

In [1]:
import numpy as np
import pandas as pd

# Example data for three groups
data = {
    'Group': ['A']*5 + ['B']*5 + ['C']*5,
    'Values': [23, 20, 22, 24, 25, 30, 32, 31, 29, 28, 35, 37, 34, 36, 38]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate the overall mean
overall_mean = df['Values'].mean()

# Calculate group means
group_means = df.groupby('Group')['Values'].mean()

# Calculate Total Sum of Squares (SST)
df['Overall Deviation'] = df['Values'] - overall_mean
SST = np.sum(df['Overall Deviation']**2)

# Calculate Explained Sum of Squares (SSE)
df['Group Mean'] = df['Group'].map(group_means)
df['Group Deviation'] = df['Group Mean'] - overall_mean
SSE = np.sum(df['Group Deviation']**2)

# Calculate Residual Sum of Squares (SSR)
df['Residual'] = df['Values'] - df['Group Mean']
SSR = np.sum(df['Residual']**2)

# Display the results
print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of Squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR): {SSR}")


Total Sum of Squares (SST): 471.6
Explained Sum of Squares (SSE): 436.79999999999995
Residual Sum of Squares (SSR): 34.8


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Steps for Calculating Main and Interaction Effects
Main Effects:
Factor A: The effect of different levels of the first factor on the dependent variable, ignoring the second factor.
Factor B: The effect of different levels of the second factor on the dependent variable, ignoring the first factor.
Interaction Effect:
The effect of both factors combined, i.e., whether the impact of one factor varies depending on the levels of the other factor.
Python Libraries:
We can use statsmodels and pandas to perform a two-way ANOVA and calculate the main and interaction effects.
Example:
Let's assume we have a dataset where we measure a response variable (e.g., plant growth) based on two factors:

Factor A: Type of fertilizer (3 levels: Fertilizer A, Fertilizer B, Fertilizer C)
Factor B: Amount of water (2 levels: Low, High)

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data for two-way ANOVA
data = {
    'Fertilizer': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']*2,
    'Water': ['Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
              'High', 'High', 'High', 'High', 'High', 'High', 'High', 'High', 'High'],
    'PlantGrowth': [12, 15, 14, 10, 13, 11, 9, 8, 7, 18, 20, 17, 16, 15, 19, 14, 13, 12]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model (including interaction)
model = ols('PlantGrowth ~ C(Fertilizer) + C(Water) + C(Fertilizer):C(Water)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)  # Type 2 ANOVA for main effects and interaction

# Display the ANOVA table
print(anova_table)


                            sum_sq    df       F    PR(>F)
C(Fertilizer)            93.000000   2.0  20.925  0.000122
C(Water)                112.500000   1.0  50.625  0.000012
C(Fertilizer):C(Water)    0.333333   2.0   0.075  0.928175
Residual                 26.666667  12.0     NaN       NaN


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

When interpreting the results of a one-way ANOVA, the F-statistic and the p-value are critical in determining whether there are statistically significant differences between the group means. Here’s how you would interpret the results in this scenario:

### Given:
- **F-statistic = 5.23**
- **p-value = 0.02**
- **Common significance level (α) = 0.05**

### Step-by-Step Interpretation:

1. **Null and Alternative Hypotheses**:
   - **Null Hypothesis (\(H_0\))**: There is **no difference** between the group means. In other words, any observed differences in the sample means are due to random chance.
   - **Alternative Hypothesis (\(H_a\))**: At least one group mean is **significantly different** from the others, suggesting that the differences between group means are not due to random variation.

2. **p-Value Interpretation**:
   - The **p-value** represents the probability of obtaining an F-statistic at least as extreme as 5.23, assuming that the null hypothesis is true (i.e., there are no real differences between the groups).
   - **p-value = 0.02** means there is a 2% chance of observing such a difference (or a more extreme one) purely due to random chance if the null hypothesis is true.
   
   Since the **p-value (0.02)** is less than the commonly used significance level **α = 0.05**, you **reject the null hypothesis**. This indicates that there is **evidence to suggest that at least one group mean is significantly different** from the others.

3. **F-Statistic Interpretation**:
   - The **F-statistic (5.23)** is the ratio of the **between-group variance** to the **within-group variance**. A larger F-statistic indicates that the between-group variability is relatively large compared to the within-group variability.
   - In this case, the F-statistic of 5.23 suggests that the differences between the group means are large relative to the variability within the groups.

4. **Conclusion**:
   - Since the **p-value** is less than 0.05, we **reject the null hypothesis**. This implies that there is a statistically significant difference between the group means.
   - However, the ANOVA test **does not tell you which specific groups are different from each other**. It only tells you that at least one group mean differs significantly. To determine which groups differ, you would need to conduct a **post-hoc test** (e.g., Tukey’s HSD test) to compare the groups pairwise.

### Practical Interpretation:
- The ANOVA results suggest that the independent variable (grouping factor) has a significant effect on the dependent variable. This means that the groups are not all the same, and there are real differences between at least some of the groups.
  
### Example Scenario:
Suppose you're testing whether different types of diets (Diet A, Diet B, Diet C) result in different weight loss outcomes. A one-way ANOVA is conducted to compare the mean weight loss between the three diets.

- The F-statistic of 5.23 and the p-value of 0.02 indicate that there is a statistically significant difference in weight loss between at least one pair of diets. However, to identify which diets differ, a post-hoc test would be needed.

### Summary:
- **F-statistic = 5.23** and **p-value = 0.02** suggest significant differences between the group means.
- Since **p < 0.05**, you reject the null hypothesis and conclude that at least one group mean is significantly different.
- To find out which groups differ specifically, you would conduct a **post-hoc test** (e.g., Tukey’s HSD).

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In a **repeated measures ANOVA**, dealing with missing data is particularly important because each subject is measured multiple times, and missing values can cause issues with the analysis. There are several methods for handling missing data, each with its potential consequences. Let's explore these methods and their implications.

### 1. **Common Methods for Handling Missing Data**:

#### a. **Listwise Deletion (Complete Case Analysis)**:
- **Description**: Exclude any subject who has missing data for any of the repeated measures.
- **Consequences**:
  - **Loss of Power**: You lose statistical power because you are reducing the sample size by removing subjects with missing data.
  - **Bias**: If the missing data are not missing completely at random (MCAR), this method can introduce bias.
  - **Simplicity**: This method is easy to implement but only works well when the amount of missing data is very small.

#### b. **Pairwise Deletion**:
- **Description**: Use all available data without excluding entire subjects. Only the specific missing values are excluded for individual comparisons.
- **Consequences**:
  - **Inconsistent Sample Sizes**: Different analyses may be based on different subsets of data, which can lead to inconsistencies.
  - **Potential for Bias**: Similar to listwise deletion, if data are not MCAR, this method can lead to biased results.
  - **Complexity**: Can be more complex to implement and interpret since different analyses may be based on different sets of data points.

#### c. **Mean Imputation**:
- **Description**: Replace missing values with the mean of the observed data for that variable.
- **Consequences**:
  - **Underestimation of Variability**: Imputing the mean artificially reduces variability in the data, which can bias the results and increase the Type I error rate (false positives).
  - **Bias**: Mean imputation does not preserve the natural relationships in the data, and it can distort the results.

#### d. **Last Observation Carried Forward (LOCF)**:
- **Description**: For longitudinal or time-based data, replace the missing value with the last observed value for that subject.
- **Consequences**:
  - **Bias**: LOCF assumes that the subject’s condition has not changed since the last measurement, which can be unrealistic, leading to biased estimates.
  - **Inappropriate for Repeated Measures**: This method is generally not recommended in repeated measures ANOVA because it can artificially smooth over important variations in the data over time.

#### e. **Multiple Imputation (MI)**:
- **Description**: Generate several possible values for each missing data point based on the observed data and average the results. This method accounts for the uncertainty around missing data.
- **Consequences**:
  - **More Accurate**: Multiple imputation preserves the variability in the data and typically leads to less bias compared to single imputation methods.
  - **Complexity**: Requires specialized software and expertise. It involves creating multiple datasets with imputed values, performing the analysis on each dataset, and then combining the results.
  - **Computationally Intensive**: Multiple imputation can be computationally expensive, especially with large datasets.

#### f. **Maximum Likelihood Estimation (MLE)**:
- **Description**: Estimate the parameters of the model based on the available data, without imputing missing values directly.
- **Consequences**:
  - **Efficient Use of Data**: MLE makes full use of the available data and tends to produce less biased estimates compared to other methods.
  - **Assumptions**: MLE relies on the assumption that the data are missing at random (MAR), which may not always be the case.
  - **Computational Complexity**: MLE can be more computationally demanding, especially with large datasets or complex models.

### 2. **Consequences of Using Different Methods**:

#### a. **Bias**:
   - Some methods, such as **mean imputation** or **LOCF**, can introduce bias by making strong assumptions about the data (e.g., that the mean is a good estimate or that no changes occur over time).
   - **Listwise deletion** or **pairwise deletion** can also introduce bias if the missing data are not MCAR (Missing Completely at Random). If the missingness is related to the outcome, excluding subjects with missing data may result in non-representative data.

#### b. **Loss of Power**:
   - **Listwise deletion** reduces the sample size, which reduces the statistical power of the repeated measures ANOVA, making it harder to detect real effects.
   - In contrast, methods like **multiple imputation** and **MLE** make full use of the available data and tend to preserve statistical power.

#### c. **Type I and Type II Errors**:
   - Imputation methods that reduce variability (like mean imputation or LOCF) can increase the likelihood of **Type I errors** (false positives), leading you to incorrectly conclude that there is a significant effect.
   - Conversely, excluding data (listwise deletion) can increase the likelihood of **Type II errors** (false negatives), where you fail to detect a real effect due to a reduced sample size.

#### d. **Complexity**:
   - Some methods, such as **multiple imputation** or **MLE**, require more sophisticated statistical knowledge and software implementation but provide more accurate and reliable results.
   - **Listwise deletion** and **mean imputation** are simpler but can lead to incorrect conclusions if the assumptions behind the missing data are not met.

### 3. **Best Practices**:

- **Check Missing Data Patterns**: Before deciding on a method, it’s crucial to check whether the data are MCAR, MAR, or MNAR (Missing Not at Random). If data are missing completely at random (MCAR), simpler methods like listwise deletion can be acceptable. If data are MAR, more sophisticated methods like multiple imputation or MLE are better.
  
- **Consider the Assumptions**: Methods like multiple imputation or maximum likelihood tend to be more robust under different missing data assumptions but are also more complex.

- **Avoid Simple Imputation**: Single imputation methods like mean imputation or LOCF should generally be avoided because they underestimate variability and can introduce bias.

### Example Using Python:

For handling missing data in repeated measures ANOVA using **Multiple Imputation**:



In [6]:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Example DataFrame with missing values
data = {
    'Subject': [1, 2, 3, 4, 5],
    'Time1': [5, 3, 4, None, 5],
    'Time2': [6, 2, None, 7, 6],
    'Time3': [7, None, 6, 8, 7]
}
df = pd.DataFrame(data)

# Multiple imputation
imputer = IterativeImputer(max_iter=10, random_state=0)
imputed_data = imputer.fit_transform(df.drop('Subject', axis=1))

# Replace missing data with imputed values
df_imputed = df.copy()
df_imputed.iloc[:, 1:] = imputed_data

# Reshape data for repeated measures ANOVA
df_long = pd.melt(df_imputed, id_vars=['Subject'], value_vars=['Time1', 'Time2', 'Time3'],
                  var_name='Time', value_name='Score')

# Fit repeated measures ANOVA
model = AnovaRM(df_long, 'Score', 'Subject', within=['Time']).fit()

# Output results
print(model.summary())

              Anova
     F Value Num DF Den DF Pr > F
---------------------------------
Time  9.1255 2.0000 8.0000 0.0086



# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after a significant ANOVA result to determine **which specific group means are significantly different** from each other. While ANOVA can tell you that there is a significant difference among the groups, it does not tell you **which groups** differ. Post-hoc tests help address this.

### Common Post-Hoc Tests:

1. **Tukey’s Honestly Significant Difference (HSD) Test**:
   - **When to Use**: 
     - Use this when you have **equal sample sizes** in each group and you're comparing all possible pairs of means. It is a good all-purpose test for comparing multiple groups after a one-way or two-way ANOVA.
   - **Purpose**: 
     - Tukey’s HSD controls the family-wise error rate (the probability of making at least one Type I error) and is widely used for pairwise comparisons of group means.
   - **Example**: 
     - If you conduct a one-way ANOVA to test whether three different teaching methods lead to different student performance and find a significant F-statistic, you can use Tukey’s HSD to determine which specific teaching methods differ.

2. **Bonferroni Correction**:
   - **When to Use**:
     - When you want to perform multiple pairwise comparisons, and you want a **conservative correction** to control for the increased risk of Type I errors. It is suitable for any test where you perform multiple comparisons, and it can be used with unequal sample sizes.
   - **Purpose**:
     - Bonferroni adjusts the p-values by dividing the significance level (α) by the number of comparisons to reduce the likelihood of false positives.
   - **Example**: 
     - Suppose you compare the effects of five different diets on weight loss and find a significant ANOVA result. You could use Bonferroni correction to adjust for the multiple comparisons between the diet groups.

3. **Scheffé Test**:
   - **When to Use**:
     - The Scheffé test is useful when you want to make **complex comparisons** (not just pairwise) between groups. It is also appropriate when your sample sizes are unequal.
   - **Purpose**:
     - This test is more conservative than Tukey’s HSD or Bonferroni, making it less likely to detect differences, but it allows for comparisons beyond just pairwise contrasts (e.g., comparing the mean of one group to the combined mean of two others).
   - **Example**: 
     - In an experiment comparing the effects of four fertilizers on plant growth, if you wanted to compare the average growth of Fertilizers A and B combined to that of Fertilizers C and D, the Scheffé test would allow this.

4. **Dunnett’s Test**:
   - **When to Use**:
     - Use this when you are comparing multiple treatment groups against a **single control group**.
   - **Purpose**:
     - Dunnett’s test controls for the Type I error rate and is more powerful than other post-hoc tests when the goal is to compare multiple treatments to a control rather than comparing all groups against each other.
   - **Example**: 
     - If you have a control group and three different drug treatments and the ANOVA shows a significant difference, you would use Dunnett’s test to determine which drug treatments differ from the control.

5. **Fisher’s Least Significant Difference (LSD) Test**:
   - **When to Use**:
     - Use this when you have **a priori** hypotheses and you expect that some groups will differ from others. It does not control for family-wise error rate, so it is less conservative and can lead to more Type I errors.
   - **Purpose**:
     - Fisher’s LSD allows pairwise comparisons without adjusting for multiple comparisons. It is only recommended if you are confident that there are true differences between groups.
   - **Example**: 
     - After finding a significant ANOVA result for different teaching methods, Fisher’s LSD could be used to test pairwise comparisons, but with the risk of inflating the Type I error rate.

6. **Holm-Bonferroni Method**:
   - **When to Use**:
     - This is a more **powerful alternative** to the Bonferroni correction. It is used when you need to control for multiple comparisons but want a less conservative approach than Bonferroni.
   - **Purpose**:
     - Holm-Bonferroni controls the family-wise error rate but does so in a stepwise manner, making it more powerful than the Bonferroni correction.
   - **Example**: 
     - You’ve run a one-way ANOVA comparing five different diet interventions and want to perform post-hoc tests with more power than Bonferroni. The Holm-Bonferroni method would adjust the significance levels more flexibly.

### Example Scenario:

#### **Scenario**:
You conduct a study to compare the average test scores of students in four different classrooms that use different teaching methods (Method A, B, C, and D). You run a one-way ANOVA and find a significant F-statistic, meaning that there is a difference in test scores between at least one pair of teaching methods.

#### **Post-Hoc Test Application**:
Since you want to find out **which specific teaching methods lead to different test scores**, you choose a post-hoc test:
- **Tukey’s HSD** would be appropriate here because it allows for **all pairwise comparisons** of group means, and the sample sizes are equal across the groups.

If you instead had a **control group** (e.g., students taught with no specific method) and you were interested in comparing each teaching method against the control, you would use **Dunnett’s test**.

### Summary of Post-Hoc Tests:
| Post-Hoc Test          | When to Use                                              | Strengths                                             | Weaknesses                                               |
|------------------------|----------------------------------------------------------|-------------------------------------------------------|----------------------------------------------------------|
| **Tukey's HSD**         | Pairwise comparisons with equal sample sizes             | Controls Type I error; good for general pairwise tests | Assumes equal sample sizes                               |
| **Bonferroni Correction** | Conservative correction for multiple comparisons        | Simple to apply; highly conservative                   | Reduces power; may lead to Type II errors                 |
| **Scheffé Test**        | Comparing groups in complex ways (beyond pairwise)       | Flexible; allows for complex comparisons               | Very conservative; less power                            |
| **Dunnett’s Test**      | Comparing multiple groups to a single control            | Controls Type I error well in comparisons to control   | Not suitable for all-pairwise comparisons                 |
| **Fisher’s LSD**        | When you have specific hypotheses about group differences | Simple to apply; powerful when prior hypotheses exist  | Does not control for multiple comparisons; increased Type I error risk |
| **Holm-Bonferroni**     | More flexible, less conservative than Bonferroni         | Controls family-wise error but retains more power      | Still more conservative than other methods               |

### Conclusion:
Post-hoc tests are essential when you need to explore **which groups differ after a significant ANOVA** result. The choice of post-hoc test depends on the **research design**, **number of comparisons**, and the **desired balance** between controlling Type I errors and maintaining statistical power.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

To perform a one-way ANOVA in Python to compare the mean weight loss across three different diets (A, B, and C), we will follow these steps:

1. Simulate or use given weight loss data for participants on each diet.
2. Perform the one-way ANOVA using Python's `scipy.stats` or `statsmodels` package.
3. Report the F-statistic and p-value.
4. Interpret the results.

### Step 1: Generate or use weight loss data

Let’s assume the weight loss data (in kg) for the 50 participants across three diets looks like this:

```python
import numpy as np
import pandas as pd

# Simulate some data for weight loss for diets A, B, and C
np.random.seed(42)  # for reproducibility
diet_A = np.random.normal(loc=5, scale=1.5, size=17)  # 17 participants on diet A
diet_B = np.random.normal(loc=6, scale=1.2, size=17)  # 17 participants on diet B
diet_C = np.random.normal(loc=4, scale=1.4, size=16)  # 16 participants on diet C

# Create a DataFrame
data = pd.DataFrame({
    'weight_loss': np.concatenate([diet_A, diet_B, diet_C]),
    'diet': ['A'] * 17 + ['B'] * 17 + ['C'] * 16
})
```

### Step 2: Conduct a one-way ANOVA

We will use the `scipy.stats.f_oneway` function to conduct the ANOVA.

```python
from scipy import stats

# Perform one-way ANOVA
F_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the results
print(f"F-statistic: {F_statistic}")
print(f"p-value: {p_value}")
```

### Step 3: Report F-statistic and p-value

This code will print the F-statistic and p-value. Let’s assume the output is as follows:

```
F-statistic: 8.236
p-value: 0.00123
```

### Step 4: Interpretation of Results

- **F-statistic**: The F-statistic is 8.236, which indicates the ratio of variance between the diet groups to the variance within the groups.
- **p-value**: The p-value is 0.00123.

#### Interpretation:
Since the **p-value (0.00123) is less than the significance level (0.05)**, we reject the null hypothesis. This means that there is a **significant difference** in mean weight loss between at least two of the diets. However, the ANOVA test does not tell us which specific diets differ. To determine that, we would need to perform a **post-hoc test**, such as Tukey’s HSD.

### Full Python Code Example:

```python
import numpy as np
import pandas as pd
from scipy import stats

# Simulate some data for weight loss for diets A, B, and C
np.random.seed(42)  # for reproducibility
diet_A = np.random.normal(loc=5, scale=1.5, size=17)  # 17 participants on diet A
diet_B = np.random.normal(loc=6, scale=1.2, size=17)  # 17 participants on diet B
diet_C = np.random.normal(loc=4, scale=1.4, size=16)  # 16 participants on diet C

# Perform one-way ANOVA
F_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the results
print(f"F-statistic: {F_statistic}")
print(f"p-value: {p_value}")

# Interpretation:
if p_value < 0.05:
    print("There is a significant difference between the diets.")
else:
    print("There is no significant difference between the diets.")
```

This code generates the data, conducts the ANOVA, and prints the F-statistic, p-value, and an interpretation of whether there is a significant difference between the diets based on the p-value.

# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

To conduct a two-way ANOVA in Python for this scenario (comparing the time taken to complete a task using three different software programs and considering the interaction between software programs and employee experience level), we'll follow these steps:

1. Simulate or use given data for task completion times based on the software program and employee experience level (novice vs. experienced).
2. Perform a two-way ANOVA using Python (`statsmodels`).
3. Report the F-statistics and p-values for the main effects (software program and experience level) and interaction effects.
4. Interpret the results.

### Step 1: Generate or Use Task Completion Time Data

Assume we have 30 employees, with 15 novices and 15 experienced workers, randomly assigned to use one of three software programs (A, B, or C). We'll simulate the data for task completion times based on these factors.

```python
import numpy as np
import pandas as pd

# Simulate task completion time data
np.random.seed(42)

# Novices
novice_A = np.random.normal(loc=25, scale=5, size=5)  # 5 novice employees using Program A
novice_B = np.random.normal(loc=30, scale=6, size=5)  # 5 novice employees using Program B
novice_C = np.random.normal(loc=28, scale=4, size=5)  # 5 novice employees using Program C

# Experienced
exp_A = np.random.normal(loc=20, scale=4, size=5)  # 5 experienced employees using Program A
exp_B = np.random.normal(loc=22, scale=5, size=5)  # 5 experienced employees using Program B
exp_C = np.random.normal(loc=18, scale=3, size=5)  # 5 experienced employees using Program C

# Combine the data into a DataFrame
data = pd.DataFrame({
    'completion_time': np.concatenate([novice_A, novice_B, novice_C, exp_A, exp_B, exp_C]),
    'software': ['A'] * 5 + ['B'] * 5 + ['C'] * 5 + ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
    'experience': ['novice'] * 15 + ['experienced'] * 15
})

# Display the first few rows of data
print(data.head())
```

### Step 2: Perform a Two-Way ANOVA Using Python

We will use the `statsmodels` package to perform the two-way ANOVA and test for both main effects (software program and experience level) and interaction effects (software * experience).

```python
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the two-way ANOVA model
model = ols('completion_time ~ C(software) + C(experience) + C(software):C(experience)', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)
```

### Step 3: Report F-Statistics and p-Values

Assume the ANOVA table looks like this:

```
                              sum_sq   df          F        PR(>F)
C(software)              452.333333    2   8.264706   0.002351
C(experience)            686.533333    1  25.030612   0.000014
C(software):C(experience) 96.800000    2   1.765306   0.194519
Residual                 659.200000   24        NaN        NaN
```

- **F-statistic for software (C(software))**: 8.26, p-value = 0.0024
- **F-statistic for experience (C(experience))**: 25.03, p-value = 0.000014
- **F-statistic for interaction (C(software):C(experience))**: 1.77, p-value = 0.1945

### Step 4: Interpretation of Results

1. **Software (C(software))**:
   - The p-value for the **main effect of software program** is 0.0024, which is less than the significance level of 0.05.
   - **Conclusion**: There is a **significant difference** in the average task completion time between the three software programs.

2. **Experience Level (C(experience))**:
   - The p-value for the **main effect of experience level** is 0.000014, which is much less than 0.05.
   - **Conclusion**: There is a **significant difference** in the average task completion time between novice and experienced employees.

3. **Interaction (C(software):C(experience))**:
   - The p-value for the **interaction effect** between software program and experience level is 0.1945, which is greater than 0.05.
   - **Conclusion**: There is **no significant interaction** between software program and employee experience level. This means that the difference in task completion time between software programs does not depend on whether the employee is novice or experienced.

### Full Python Code Example

```python
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulate task completion time data
np.random.seed(42)

# Novices
novice_A = np.random.normal(loc=25, scale=5, size=5)
novice_B = np.random.normal(loc=30, scale=6, size=5)
novice_C = np.random.normal(loc=28, scale=4, size=5)

# Experienced
exp_A = np.random.normal(loc=20, scale=4, size=5)
exp_B = np.random.normal(loc=22, scale=5, size=5)
exp_C = np.random.normal(loc=18, scale=3, size=5)

# Combine the data into a DataFrame
data = pd.DataFrame({
    'completion_time': np.concatenate([novice_A, novice_B, novice_C, exp_A, exp_B, exp_C]),
    'software': ['A'] * 5 + ['B'] * 5 + ['C'] * 5 + ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
    'experience': ['novice'] * 15 + ['experienced'] * 15
})

# Create the two-way ANOVA model
model = ols('completion_time ~ C(software) + C(experience) + C(software):C(experience)', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)
```

### Conclusion:
- The software program and employee experience level **independently** influence task completion time.
- There is **no significant interaction** between software program and experience level.

# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [7]:
import numpy as np
import pandas as pd
from scipy import stats

# Simulate test scores for control and experimental groups
np.random.seed(42)

# Control group (traditional teaching method)
control_scores = np.random.normal(loc=75, scale=10, size=50)  # Mean = 75, SD = 10

# Experimental group (new teaching method)
experimental_scores = np.random.normal(loc=80, scale=10, size=50)  # Mean = 80, SD = 10

# Create a DataFrame
data = pd.DataFrame({
    'score': np.concatenate([control_scores, experimental_scores]),
    'group': ['Control'] * 50 + ['Experimental'] * 50
})

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Output the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Calculate Cohen's d
mean_control = np.mean(control_scores)
mean_experimental = np.mean(experimental_scores)
std_control = np.std(control_scores, ddof=1)
std_experimental = np.std(experimental_scores, ddof=1)

pooled_std = np.sqrt(((std_control**2) + (std_experimental**2)) / 2)
cohens_d = (mean_experimental - mean_control) / pooled_std

print(f"Cohen's d: {cohens_d}")

# Interpretation of the results
if p_value < 0.05:
    print("There is a significant difference between the test scores of the two groups.")
else:
    print("There is no significant difference between the test scores of the two groups.")


T-statistic: -4.108723928204809
P-value: 8.261945608702613e-05
Cohen's d: 0.8217447856409618
There is a significant difference between the test scores of the two groups.


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [8]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from scipy import stats

# Set a random seed for reproducibility
np.random.seed(42)

# Simulate daily sales data for three stores over 30 days
days = 30
store_A_sales = np.random.normal(loc=200, scale=20, size=days)
store_B_sales = np.random.normal(loc=220, scale=25, size=days)
store_C_sales = np.random.normal(loc=210, scale=15, size=days)

# Create a DataFrame
data = pd.DataFrame({
    'Day': np.arange(1, days + 1),
    'Store_A': store_A_sales,
    'Store_B': store_B_sales,
    'Store_C': store_C_sales
})

# Reshape the data into long format
data_long = pd.melt(data, id_vars=['Day'], value_vars=['Store_A', 'Store_B', 'Store_C'],
                    var_name='Store', value_name='Sales')

# Perform repeated measures ANOVA
anova_results = AnovaRM(data_long, 'Sales', 'Day', within=['Store']).fit()

# Output the ANOVA table
print(anova_results)

# Conduct pairwise comparisons using t-tests
results = {}
stores = ['Store_A', 'Store_B', 'Store_C']

for i in range(len(stores)):
    for j in range(i + 1, len(stores)):
        t_stat, p_val = stats.ttest_rel(data[stores[i]], data[stores[j]])
        results[(stores[i], stores[j])] = (t_stat, p_val)

# Output pairwise t-test results
for (store1, store2), (t_stat, p_val) in results.items():
    print(f"Comparison between {store1} and {store2} -> T-statistic: {t_stat:.4f}, P-value: {p_val:.4f}")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  9.6931 2.0000 58.0000 0.0002

Comparison between Store_A and Store_B -> T-statistic: -4.0584, P-value: 0.0003
Comparison between Store_A and Store_C -> T-statistic: -3.2560, P-value: 0.0029
Comparison between Store_B and Store_C -> T-statistic: 1.3634, P-value: 0.1832
