#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
#### Assumptions for ANOVA:
1. **Independence of Observations**: Each group's data should be collected independently of the others.
    - **Violation Example**: Using repeated measures on the same subjects without accounting for the repeated nature of the measurements can violate this assumption.
  
2. **Normality**: The data in each group should be normally distributed.
    - **Violation Example**: A heavily skewed distribution (e.g., salary data with a few high-income outliers) might violate this assumption.
  
3. **Homogeneity of Variances (Homoscedasticity)**: The variances across the groups should be approximately equal.
    - **Violation Example**: If one group's variance is significantly higher or lower than others (e.g., one group has much more variability in test scores), this assumption is violated.

#### Impact of Violations:
1. **Independence**: Violating independence can lead to incorrect inferences, as it may inflate Type I error rates (false positives).
2. **Normality**: If the normality assumption is violated, the ANOVA test may become less reliable, particularly with small sample sizes.
3. **Homogeneity of Variances**: Violating this assumption can lead to incorrect conclusions because the F-statistic can be biased if the variances differ substantially.

---

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

1. **One-Way ANOVA**: Used when comparing the means of three or more independent groups based on one factor or independent variable.
   - **Situation**: Testing if there is a difference in average exam scores among students from three different schools.

2. **Two-Way ANOVA**: Used to evaluate the effect of two factors simultaneously and determine if there is an interaction between them.
   - **Situation**: Assessing the impact of two independent factors (e.g., teaching method and gender) on students' test scores.

3. **Repeated Measures ANOVA**: Used when the same subjects are measured more than once (e.g., across different time points or conditions).
   - **Situation**: Evaluating the effectiveness of a diet plan by measuring the weight of participants at multiple time points.

---

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#### Partitioning of Variance in ANOVA:
1. **Total Variance (SST)**: The overall variability in the data, which is the sum of the variability between groups (SSB) and the variability within groups (SSW).
2. **Between-Group Variance (SSB)**: The portion of the total variance that is due to the differences between group means.
3. **Within-Group Variance (SSW)**: The portion of the total variance that is due to the variability within each group.

#### Importance:
- Understanding the partitioning of variance is crucial because it helps to determine whether the observed differences in group means are statistically significant or could have occurred by random chance. It also provides insight into the proportion of total variability explained by the grouping factor, which is important for interpreting the effect size in an ANOVA.


#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

#### Sum of Squares in One-Way ANOVA

To calculate the **Total Sum of Squares (SST)**, **Explained Sum of Squares (SSE)**, and **Residual Sum of Squares (SSR)** in a one-way ANOVA, you can use the following formulas:

1. **Total Sum of Squares (SST):**
   The total variation in the data. It is the sum of the squared differences between each observation and the overall mean.
   
   $$ SST = \sum_{i=1}^{N} (y_i - \bar{y})^2 $$

2. **Explained Sum of Squares (SSE):**
   The variation explained by the groups. It is the sum of the squared differences between the group means and the overall mean.
   
   $$ SSE = \sum_{j=1}^{k} n_j (\bar{y}_j - \bar{y})^2 $$

3. **Residual Sum of Squares (SSR):**
   The variation that is not explained by the groups. It is the sum of the squared differences between each observation and its group mean.
   
   $$ SSR = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (y_{ij} - \bar{y}_j)^2 $$

#### Explanation:
- $ y_i $ is an individual observation.
- $ \bar{y} $ is the overall mean of the data.
- $ \bar{y}_j $ is the mean of group $ j $.
- $ n_j $ is the number of observations in group $ j $.
- $ N $ is the total number of observations across all groups.
- $ k $ is the number of groups.


In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Example data
data = {
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [4, 5, 6, 5, 6, 7, 6, 7, 8]
}
df = pd.DataFrame(data)

# Group by 'Group' and calculate the group means and overall mean
group_means = df.groupby('Group')['Value'].mean()
overall_mean = df['Value'].mean()

# Calculate SST (Total Sum of Squares)
sst = np.sum((df['Value'] - overall_mean) ** 2)

# Calculate SSE (Sum of Squares Between Groups)
sse = np.sum(df.groupby('Group').size() * (group_means - overall_mean) ** 2)

# Calculate SSR (Sum of Squares Within Groups)
ssr = np.sum((df['Value'] - df.groupby('Group')['Value'].transform('mean')) ** 2)

print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSE): {sse}")
print(f"Residual Sum of Squares (SSR): {ssr}")


Total Sum of Squares (SST): 12.0
Explained Sum of Squares (SSE): 6.0
Residual Sum of Squares (SSR): 6.0


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

##### Two-Way ANOVA with Main and Interaction Effects using Python

In this guide, we'll calculate the main effects and interaction effects in a two-way ANOVA using Python's `statsmodels` library. A two-way ANOVA helps in understanding how two categorical independent variables impact a continuous dependent variable, both individually (main effects) and together (interaction effects).

##### Step-by-Step Process

###### 1. Prepare the Data
Make sure your data is organized into a Pandas DataFrame, with columns for the dependent variable and the two independent variables (factors).

###### 2. Fit the ANOVA Model
Use the `ols` (Ordinary Least Squares) function from `statsmodels.formula.api` to specify the model formula, and fit the model.

###### 3. Calculate the ANOVA Table
Use the `anova_lm` function from `statsmodels` to compute the ANOVA table, which will include the main and interaction effects.

##### Example Code

Here's how to calculate the main and interaction effects using Python:

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example Data: Create a dataset with two factors (FactorA and FactorB) and a dependent variable (Y)
data = pd.DataFrame({
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2'],
    'FactorB': ['B1', 'B1', 'B1', 'B1', 'B1', 'B1', 'B2', 'B2', 'B2', 'B2', 'B2', 'B2'],
    'Y': [8, 9, 6, 7, 9, 6, 7, 8, 9, 6, 7, 8]
})

# Fit the two-way ANOVA model
model = ols('Y ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)', data=data).fit()

# Calculate the ANOVA table
anova_results = sm.stats.anova_lm(model, typ=2)

# Display the results
print(anova_results)

                             sum_sq   df             F    PR(>F)
C(FactorA)             1.333333e+00  1.0  8.000000e-01  0.397204
C(FactorB)             5.365644e-30  1.0  3.219386e-30  1.000000
C(FactorA):C(FactorB)  3.333333e-01  1.0  2.000000e-01  0.666581
Residual               1.333333e+01  8.0           NaN       NaN


#### Explanation:
##### OLS Function

Specifies and fits the OLS regression model. The model is defined as follows:

`Y ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)`

Where:
- **Y**: Dependent variable.
- **C(FactorA)**: Categorical variable for Factor A.
- **C(FactorB)**: Categorical variable for Factor B.
- **C(FactorA):C(FactorB)**: Interaction term between Factor A and Factor B.

##### anova_lm Function

Generates the ANOVA table from the fitted model. `typ=2` specifies the use of Type II sum of squares, commonly used in two-way ANOVA.

##### Output

The output will be an ANOVA table containing:

- **Sum of Squares (SS)**: For each effect (Factor A, Factor B, Interaction, and Residual).
- **Degrees of Freedom (DF)**: For each effect.
- **F-statistic**: The F-value for the hypothesis test for each effect.
- **p-value**: The significance level for each effect.

By interpreting this table, you can determine the main effects of each factor and their interaction effect on the dependent variable.

#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

##### F-Statistic (5.23)

The F-statistic is a ratio that compares the variance between the group means to the variance within the groups. An F-statistic of 5.23 indicates that there is more variance between the group means than would be expected by chance alone.

##### p-value (0.02)

The p-value tells us the probability of observing an F-statistic as extreme as 5.23 if the null hypothesis were true. The null hypothesis in a one-way ANOVA typically states that all group means are equal. A p-value of 0.02 suggests that there is only a 2% chance that the observed differences in group means occurred by random chance under the null hypothesis.

##### Conclusion

Since the p-value (0.02) is less than the common significance level (α = 0.05), you would reject the null hypothesis.  
This means there is sufficient evidence to conclude that at least one group mean is significantly different from the others.

##### Interpretation

The results suggest that there are statistically significant differences between the means of the groups being compared.  
However, the ANOVA does not indicate which specific groups are different from each other; it only tells you that at least one group is different. To determine which specific groups differ, you would need to conduct a post hoc test (such as Tukey's HSD or Bonferroni correction).

##### In summary:

There is significant evidence to suggest differences in the means between at least some of the groups, with the probability of this finding occurring by chance being 2%.


#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data? 
Handling missing data in repeated measures ANOVA is critical to ensure accurate results. Several methods can be applied to handle missing data, each with its own implications:

##### Methods for Handling Missing Data:

- **Listwise Deletion:**  
  Removes all cases with missing data from the analysis. This method is straightforward but can reduce the sample size and statistical power, especially if many observations are missing. It assumes that data is missing completely at random (MCAR), which may not always be the case.

- **Pairwise Deletion:**  
  Uses all available data for each analysis without excluding cases entirely. This can retain more data but may introduce bias if the data is not missing at random (MAR or MNAR). It may also result in varying sample sizes for different comparisons.

- **Mean Imputation:**  
  Replaces missing values with the mean of the observed values for that variable. While easy to implement, this method can underestimate variability and lead to biased parameter estimates, especially if the data is not MCAR.

- **Last Observation Carried Forward (LOCF):**  
  Fills in missing values with the participant's last observed data point. This method assumes no change after the last observation, which may not be valid and can introduce bias if the missing data is not random.

- **Multiple Imputation:**  
  Generates multiple datasets by replacing missing values with plausible estimates, and then combines the results from these datasets. This approach accounts for the uncertainty due to missing data and is generally recommended as it provides more accurate parameter estimates, assuming data is MAR.

##### Consequences of Different Methods:

- **Loss of Power:**  
  Listwise deletion can significantly reduce sample size, leading to a loss of statistical power.

- **Bias in Estimates:**  
  Methods like mean imputation or LOCF can introduce bias if the missing data is not random.

- **Incorrect Variance Estimates:**  
  Some methods may underestimate the true variability in the data, resulting in incorrect inferences.

- **Reduced Validity of Results:**  
  Multiple imputation is preferred as it typically leads to more valid statistical inferences by accounting for the uncertainty due to missing data.
  


#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after a significant ANOVA result to identify which specific groups differ from each other. Here are some common post-hoc tests and their uses:

##### Tukey's Honest Significant Difference (HSD):
- **When to Use:**  
  Appropriate when you have equal group sizes and want to control for the Type I error rate while comparing all possible pairs of means. It is a common choice for pairwise comparisons.

- **Example:**  
  After finding a significant difference in the mean scores of students taught using four different teaching methods, Tukey's HSD can determine which specific teaching methods differ.

##### Bonferroni Correction:
- **When to Use:**  
  Suitable when conducting multiple pairwise comparisons to control the family-wise error rate. It is more conservative and adjusts the significance level for the number of comparisons.

- **Example:**  
  When comparing the effectiveness of five different diets on weight loss, Bonferroni correction reduces the chance of Type I errors from multiple comparisons.

##### Scheffé's Test:
- **When to Use:**  
  Useful for making complex comparisons, including non-pairwise comparisons (e.g., comparing combinations of groups). It is more conservative and flexible than Tukey's HSD.

- **Example:**  
  If researchers want to compare not only individual medications but also combinations of different treatments, Scheffé's test would be suitable.

##### Dunnett's Test:
- **When to Use:**  
  Ideal when comparing multiple treatment groups to a single control group. It controls the Type I error rate while making these specific comparisons.

- **Example:**  
  In a drug trial comparing a new drug to a placebo and a standard drug, Dunnett's test would be used to compare the new drug directly to the placebo and standard drug.

##### Example Situation for Post-hoc Tests:  
After performing a one-way ANOVA to assess the effect of four different fertilizers on plant growth, you find a significant difference among the groups. Since ANOVA only indicates that at least one group differs, a post-hoc test, such as Tukey's HSD, would be necessary to determine which specific fertilizers differ in their effect on plant growth.


#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

- To conduct a one-way ANOVA to compare the mean weight loss of three diets (A, B, and C), we first need to simulate or assume some data, as you didn't provide specific values. The one-way ANOVA test is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups.

Let's assume the weight loss data for 50 participants divided into three groups (each following a different diet). I'll create some sample data and perform the ANOVA test using Python.

Here is the code to perform the analysis:

In [3]:
import numpy as np
import scipy.stats as stats

# Simulate some data for weight loss for three different diets
np.random.seed(42)  # For reproducibility

# Simulated weight loss data for diets A, B, and C (in kg)
diet_A = np.random.normal(loc=5, scale=2, size=16)  # Mean = 5, Std = 2, n = 16
diet_B = np.random.normal(loc=4.5, scale=2.5, size=17)  # Mean = 4.5, Std = 2.5, n = 17
diet_C = np.random.normal(loc=6, scale=1.8, size=17)  # Mean = 6, Std = 1.8, n = 17

# Perform one-way ANOVA
F_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Output the results
print(f"F-statistic: {F_statistic:.3f}")
print(f"p-value: {p_value:.3f}")


F-statistic: 2.174
p-value: 0.125


#### Interpreting the Results
- F-statistic: Measures the ratio of between-group variance to within-group variance. A higher F-statistic indicates greater variance among the group means.
- p-value: Tells us whether the observed differences are statistically significant. Typically, a p-value less than 0.05 is considered significant, indicating that there is a statistically significant difference between at least two of the diet groups.

#### Q10. A company wants to know if there are any significant differences in the average time it takes tocomplete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

- To conduct a two-way ANOVA, we need to examine the effects of two independent variables on a dependent variable. In this case, we have two independent variables: the software program (Program A, B, and C) and the employee experience level (novice vs. experienced). The dependent variable is the time it takes to complete the task.

- To perform a two-way ANOVA in Python, we can use the statsmodels library, which provides tools for conducting ANOVA and regression analysis. Since no specific data is provided, we will create a simulated dataset.

Here is the code to perform the two-way ANOVA:

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulate data
np.random.seed(42)  # For reproducibility

# Creating data for 30 employees (15 novice, 15 experienced)
data = {
    'Software': np.repeat(['A', 'B', 'C'], 10),
    'Experience': ['Novice'] * 5 + ['Experienced'] * 5 + ['Novice'] * 5 + ['Experienced'] * 5 + ['Novice'] * 5 + ['Experienced'] * 5,
    'Time': np.concatenate([
        np.random.normal(30, 5, 5),  # Program A - Novice
        np.random.normal(25, 5, 5),  # Program A - Experienced
        np.random.normal(35, 5, 5),  # Program B - Novice
        np.random.normal(30, 5, 5),  # Program B - Experienced
        np.random.normal(40, 5, 5),  # Program C - Novice
        np.random.normal(35, 5, 5)   # Program C - Experienced
    ])
}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)

                               sum_sq    df          F    PR(>F)
C(Software)                357.276664   2.0  10.947864  0.000418
C(Experience)              194.061665   1.0  11.893084  0.002092
C(Software):C(Experience)    3.360483   2.0   0.102974  0.902547
Residual                   391.612478  24.0        NaN       NaN


#### Interpreting the Results
The ANOVA table generated by the code will contain:

- F-statistics: For each factor (Software, Experience) and their interaction (Software * Experience), it represents the ratio of the variance between the groups to the variance within the groups.
- p-values: These values indicate whether the main effects (Software, Experience) or interaction effect (Software * Experience) are statistically significant. If a p-value is less than 0.05, the corresponding effect is considered significant.

#### Q11. An educational researcher is interested in whether a new teaching method improves student testscores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

To conduct a two-sample t-test and follow up with a post-hoc test if needed, you can use Python with libraries such as `scipy` and `statsmodels`. Here’s how you can do it:

**1. Two-Sample T-Test**
First, let’s perform the two-sample t-test using `scipy.stats.ttest_ind`.

In [5]:
import numpy as np
from scipy import stats

# Generate example data
# Assume we have test scores for each group
np.random.seed(42)  # For reproducibility
control_scores = np.random.normal(loc=75, scale=10, size=50)  # Traditional method
experimental_scores = np.random.normal(loc=78, scale=10, size=50)  # New method

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


T-statistic: -3.0031208261723967
P-value: 0.003391318551039432


**2. Post-Hoc Test**
For the post-hoc test, if the two-sample t-test shows a significant difference, you might want to perform a more detailed analysis to determine which groups differ. If you have more than two groups, a common approach is to use ANOVA followed by Tukey's HSD test. Since we only have two groups here, the t-test is sufficient.

If you had multiple groups, you’d use `statsmodels` for ANOVA and Tukey's HSD as follows:

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data into a DataFrame
data = pd.DataFrame({
    'scores': np.concatenate([control_scores, experimental_scores]),
    'group': ['Control'] * 50 + ['Experimental'] * 50
})

# Perform ANOVA
model = ols('scores ~ group', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# If ANOVA is significant, perform Tukey's HSD
tukey = pairwise_tukeyhsd(endog=data['scores'], groups=data['group'], alpha=0.05)
print(tukey)


               sum_sq    df         F    PR(>F)
group      737.814378   1.0  9.018735  0.003391
Residual  8017.289732  98.0       NaN       NaN
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental   5.4325 0.0034 1.8427 9.0224   True
---------------------------------------------------------


#### Explanation:
**1.Two-Sample T-Test:** This test determines if there is a significant difference between the means of two independent groups. The null hypothesis is that the means of both groups are equal.

**2. Post-Hoc Test:** If you have more than two groups and ANOVA shows significant differences, Tukey's HSD helps identify which specific groups have significant differences.

In this case, with only two groups, the t-test should suffice. If the p-value is less than your significance level (usually 0.05), you can conclude there is a significant difference between the two teaching methods.

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

To perform a repeated measures ANOVA in Python and follow up with a post-hoc test, you can use libraries such as `pandas` for data manipulation, `scipy` for statistical tests, and `statsmodels` for ANOVA and post-hoc tests. Here’s a step-by-step guide to achieving this:

**1. Prepare Your Data**
Assume you have a DataFrame `df` with columns: `Day`, `Store`, and `Sales`.

In [7]:
import pandas as pd

# Example DataFrame creation with sample data
data = {
    'Day': list(range(1, 31)) * 3,
    'Store': ['Store A']*30 + ['Store B']*30 + ['Store C']*30,
    'Sales': list(range(100, 130)) * 3  # Replace this with actual sales data
}
df = pd.DataFrame(data)

**2. Conduct Repeated Measures ANOVA**

You can use the `statsmodels` library for this. First, make sure to install `statsmodels` if you haven’t already:

pip install statsmodels
Then, perform the ANOVA:

In [8]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Sales ~ C(Store)', data=df).fit()

# Perform repeated measures ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                sum_sq    df             F  PR(>F)
C(Store)  2.466068e-26   2.0  1.591011e-28     1.0
Residual  6.742500e+03  87.0           NaN     NaN


**3. Check Results and Perform Post-Hoc Test**

If the p-value from the ANOVA table is less than 0.05, you’ll need to perform a post-hoc test to determine which stores differ significantly. The Tukey's HSD test is commonly used for this.

Install the `statsmodels` library if you haven’t already:

pip install statsmodels
Here’s how you can perform the Tukey's HSD test:

In [9]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['Sales'], groups=df['Store'], alpha=0.05)
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
 group1  group2 meandiff p-adj lower upper reject
-------------------------------------------------
Store A Store B      0.0   1.0 -5.42  5.42  False
Store A Store C      0.0   1.0 -5.42  5.42  False
Store B Store C      0.0   1.0 -5.42  5.42  False
-------------------------------------------------


**Summary of Steps:**
1. Prepare the Data: Ensure it’s in the format needed for analysis.
2. Fit the Repeated Measures ANOVA Model: Use statsmodels to fit and analyze the model.
3. Perform Post-Hoc Test: If ANOVA results are significant, use Tukey’s HSD to identify specific differences.