# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical method used to compare means among three or more groups to determine if there are any statistically significant differences between the means. However, certain assumptions must be met to ensure the validity of the ANOVA results. Here are the main assumptions along with potential violations and their impacts:

### 1. Independence of Observations
**Assumption**: The samples must be independent of one another. This means that the data points in one group should not influence or affect the data points in another group.

**Example of Violation**: If participants in a study are allowed to interact or influence one another (e.g., participants from the same family), this independence is compromised.

**Impact**: Violating this assumption can lead to an underestimation of the variability within groups and can increase the likelihood of Type I errors, resulting in incorrect conclusions about group differences.

### 2. Normality
**Assumption**: The data in each group should be approximately normally distributed. This is particularly important for small sample sizes, as ANOVA is robust to deviations from normality with larger samples.

**Example of Violation**: If the data is heavily skewed (e.g., a distribution with many outliers) or if there are extreme values that deviate from the normal distribution.

**Impact**: When normality is violated, it can lead to inaccurate p-values and increase the risk of Type I or Type II errors, resulting in incorrect conclusions about differences between groups.

### 3. Homogeneity of Variances (Homoscedasticity)
**Assumption**: The variances among the groups should be approximately equal. This can be checked using tests such as Levene's test or Bartlett's test.

**Example of Violation**: If one group has much higher variability than another (e.g., the variance of test scores among different classes varies significantly).

**Impact**: If variances are unequal, the F-test used in ANOVA may yield biased results, leading to incorrect conclusions regarding group differences.

### 4. Scale of Measurement
**Assumption**: The dependent variable should be measured on an interval or ratio scale, allowing for meaningful comparisons of means.

**Example of Violation**: If the dependent variable is measured on an ordinal scale (e.g., survey responses on a Likert scale).

**Impact**: Using inappropriate scales can lead to misleading interpretations, as means may not accurately represent the central tendency of the data.



# Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA (Analysis of Variance) is a statistical method used to compare means among three or more groups. There are several types of ANOVA, each suited for different experimental designs and hypotheses. Here are the three primary types of ANOVA and the situations in which they would be used:

### 1. One-Way ANOVA
**Definition**: One-Way ANOVA is used to compare the means of three or more independent (unrelated) groups based on one independent variable (factor).

**Situations for Use**:
- When you want to test the effect of a single categorical independent variable with two or more levels (groups) on a continuous dependent variable.
- For example, comparing the average test scores of students from three different teaching methods (e.g., Method A, Method B, Method C).

**Example**: A researcher wants to know if there are differences in the average weight loss among participants following three different diets. The independent variable is the type of diet (Diet A, Diet B, Diet C), and the dependent variable is the weight loss amount.

### 2. Two-Way ANOVA
**Definition**: Two-Way ANOVA is used to compare the means of groups based on two independent variables (factors). It can assess the individual and interactive effects of both factors on the dependent variable.

**Situations for Use**:
- When examining the influence of two categorical independent variables on a continuous dependent variable, especially when you want to understand potential interactions between the two factors.
- For example, testing how two different teaching methods (Method A vs. Method B) and the level of education (Undergraduate vs. Graduate) affect student performance.

**Example**: A study aims to evaluate the effects of different fertilizers (Fertilizer A, Fertilizer B) and watering frequencies (Daily, Weekly) on plant growth. Here, the two factors are fertilizer type and watering frequency, and the dependent variable is plant growth.

### 3. Repeated Measures ANOVA
**Definition**: Repeated Measures ANOVA is used when the same subjects are used for each treatment (group) and the same dependent variable is measured multiple times. This design helps account for variability among subjects.

**Situations for Use**:
- When you want to compare the means of three or more groups where the same subjects are measured under different conditions or over time.
- For example, assessing how a group of participants' stress levels change before, during, and after a particular intervention.

**Example**: A researcher is interested in the effect of a training program on employee productivity measured at three time points: before the training, immediately after, and three months post-training. Here, the same employees are measured at different times, making repeated measures ANOVA appropriate.

### Summary
- **One-Way ANOVA**: Used for comparing means among three or more independent groups based on one factor.
- **Two-Way ANOVA**: Used for comparing means based on two factors and assessing interaction effects.
- **Repeated Measures ANOVA**: Used for comparing means when the same subjects are measured multiple times under different conditions.

Choosing the appropriate type of ANOVA depends on the research design, the number of factors, and whether the same subjects are involved in multiple measurements.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### Partitioning of Variance in ANOVA

In ANOVA (Analysis of Variance), partitioning of variance refers to the process of dividing the total variability observed in a dataset into distinct components that can be attributed to different sources of variation. This is fundamental to understanding how different factors contribute to the overall variance in the data.

### Key Components of Variance Partitioning

1. **Total Variance**:
   - The total variance (\(S_{total}^2\)) represents the overall variability in the dataset and is calculated as the sum of the squared deviations of each observation from the overall mean.

   \[
   S_{total}^2 = \sum_{i=1}^{n} (X_i - \bar{X})^2
   \]

   Where:
   - \(X_i\) = Individual observation
   - \(\bar{X}\) = Overall mean of the dataset
   - \(n\) = Total number of observations

2. **Between-Group Variance**:
   - This component reflects the variability due to the differences between the group means. It indicates how much the group means differ from the overall mean.

   \[
   S_{between}^2 = \sum_{j=1}^{k} n_j (\bar{X}_j - \bar{X})^2
   \]

   Where:
   - \(k\) = Number of groups
   - \(n_j\) = Number of observations in group \(j\)
   - \(\bar{X}_j\) = Mean of group \(j\)

3. **Within-Group Variance**:
   - This component measures the variability within each group and indicates how much individual observations differ from their respective group means.

   \[
   S_{within}^2 = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (X_{ij} - \bar{X}_j)^2
   \]

   Where:
   - \(X_{ij}\) = Individual observation in group \(j\)
   - \(\bar{X}_j\) = Mean of group \(j\)

### The Relationship

The relationship among these components can be expressed as:

\[
S_{total}^2 = S_{between}^2 + S_{within}^2
\]

### Importance of Understanding Variance Partitioning

1. **Understanding Sources of Variation**:
   - Partitioning variance allows researchers to identify and quantify the sources of variability in the data. This helps in understanding how much of the variability is due to differences between groups versus variability within groups.

2. **Assessing Treatment Effects**:
   - By comparing the between-group variance to the within-group variance, ANOVA tests whether the group means are significantly different. A larger \(S_{between}^2\) relative to \(S_{within}^2\) indicates a significant treatment effect.

3. **Hypothesis Testing**:
   - Variance partitioning is fundamental in hypothesis testing within ANOVA. The F-statistic, used in ANOVA, is derived from these variance components:

   \[
   F = \frac{S_{between}^2}{S_{within}^2}
   \]

   This statistic assesses whether the means of different groups are significantly different based on the ratio of between-group to within-group variability.

4. **Modeling and Interpretation**:
   - Understanding how variance is partitioned helps in model development and interpretation. It aids in recognizing whether a factor contributes meaningfully to the outcome of interest and informs decisions about potential interventions or changes.

### Conclusion

In summary, the partitioning of variance in ANOVA is a crucial concept that enables researchers to analyze the sources of variability in their data systematically. By distinguishing between between-group and within-group variances, ANOVA provides insights into the effects of different factors on the response variable, supporting sound statistical inference and decision-making.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In a one-way ANOVA, the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) are calculated as follows:

1. **Total Sum of Squares (SST)**: This measures the total variability in the data.
2. **Explained Sum of Squares (SSE)**: This measures the variability explained by the group means.
3. **Residual Sum of Squares (SSR)**: This measures the variability within the groups, or the variability that cannot be explained by the group means.

### Calculation Steps

1. **Calculate the Overall Mean**:
   - Compute the mean of all data points.
   
2. **Calculate SST**:
   - SST is the sum of the squared differences between each observation and the overall mean.
   
3. **Calculate SSE**:
   - SSE is the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.
   
4. **Calculate SSR**:
   - SSR is calculated as the total variance minus the explained variance, or as the sum of the squared differences between each observation and its group mean.

### Python Code Example

Here is an example code snippet that demonstrates how to calculate SST, SSE, and SSR using Python:

```python
import numpy as np
import pandas as pd
from scipy import stats

# Sample data: replace this with your actual data
data = {
    'Group1': [20, 21, 22, 23, 24],
    'Group2': [30, 31, 32, 29, 28],
    'Group3': [40, 41, 42, 39, 38]
}

# Convert the data into a DataFrame
df = pd.DataFrame(data)

# Step 1: Calculate the overall mean
overall_mean = df.values.flatten().mean()

# Step 2: Calculate SST
SST = ((df.values.flatten() - overall_mean) ** 2).sum()

# Step 3: Calculate SSE
group_means = df.mean()
n = df.count()  # number of observations in each group
SSE = ((group_means - overall_mean) ** 2 * n).sum()

# Step 4: Calculate SSR
SSR = SST - SSE

# Print the results
print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of Squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR): {SSR}")
```

### Explanation of the Code

- **Data Preparation**: The data is stored in a dictionary and converted to a pandas DataFrame. Each key represents a group.
- **Overall Mean Calculation**: The overall mean is calculated by flattening the DataFrame values into a single array.
- **SST Calculation**: SST is computed as the sum of squared differences between each observation and the overall mean.
- **SSE Calculation**: The means of each group are calculated, and SSE is computed by summing the squared differences between each group mean and the overall mean, multiplied by the number of observations in each group.
- **SSR Calculation**: Finally, SSR is obtained by subtracting SSE from SST.

### Conclusion

This approach allows you to quantify the sources of variance in a one-way ANOVA, facilitating statistical analysis of group differences.

# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In a two-way ANOVA, you analyze the impact of two independent categorical variables (factors) on a continuous dependent variable. You can assess both the main effects of each factor and their interaction effect.

### Definitions

- **Main Effects**: The direct influence of each factor on the dependent variable.
- **Interaction Effects**: The combined effect of both factors on the dependent variable that is not simply the sum of their individual effects.

### Calculation Steps

1. **Organize the Data**: Ensure your data is in a suitable format (e.g., a DataFrame).
2. **Fit the ANOVA Model**: Use a statistical library to fit a two-way ANOVA model.
3. **Analyze the Results**: Extract and interpret the main effects and interaction effects from the ANOVA table.

### Python Code Example

Here's how you can perform a two-way ANOVA and calculate the main and interaction effects using Python with the `statsmodels` library:

```python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data: replace this with your actual data
# 'Factor1' and 'Factor2' are the two categorical independent variables
# 'Response' is the continuous dependent variable
data = {
    'Factor1': ['A', 'A', 'A', 'B', 'B', 'B'] * 5,
    'Factor2': ['X', 'Y'] * 15,
    'Response': np.random.rand(30) * 100  # Example response variable
}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the model
model = ols('Response ~ C(Factor1) * C(Factor2)', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Display the results
print(anova_results)
```

### Explanation of the Code

1. **Data Preparation**:
   - The data is organized into a dictionary format, with two categorical independent variables (`Factor1` and `Factor2`) and a continuous dependent variable (`Response`).
   - The `Response` variable is generated randomly for illustration; replace it with your actual data.

2. **Model Fitting**:
   - The `ols` function from `statsmodels` is used to fit a linear model where the response variable is modeled as a function of both factors and their interaction (`C(Factor1) * C(Factor2)`).

3. **ANOVA Calculation**:
   - The `anova_lm` function computes the ANOVA table from the fitted model.

4. **Results Interpretation**:
   - The ANOVA table contains the sums of squares, degrees of freedom, F-statistic, and p-values for each main effect and the interaction effect.
   - The main effects can be found under `C(Factor1)` and `C(Factor2)`, while the interaction effect is under `C(Factor1):C(Factor2)`.

### Results Interpretation

- **Significance**: A low p-value (typically < 0.05) indicates that the corresponding factor (main or interaction) has a statistically significant effect on the dependent variable.
- **Effect Size**: The sums of squares can be used to assess the proportion of variance explained by each factor and their interaction.

### Conclusion

This approach allows you to quantify the effects of multiple factors in your data and determine if any interactions exist, providing insights into the relationships between the variables in your analysis.

# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these  results?

When interpreting the results of a one-way ANOVA, the F-statistic and p-value are crucial for understanding the differences among group means. Here’s how to interpret the given results of an F-statistic of 5.23 and a p-value of 0.02:

### Interpretation of Results

1. **Null Hypothesis (H0)**:
   - The null hypothesis in a one-way ANOVA states that there are no significant differences between the means of the groups being compared.

2. **Alternative Hypothesis (H1)**:
   - The alternative hypothesis states that at least one group mean is different from the others.

3. **F-Statistic**:
   - The F-statistic of 5.23 indicates the ratio of the variance between the groups to the variance within the groups. A higher F-statistic suggests that the variability among the group means is greater than the variability within the groups.

4. **P-Value**:
   - The p-value of 0.02 indicates the probability of observing an F-statistic as extreme as 5.23, assuming the null hypothesis is true. A p-value of 0.02 is less than the common significance level of 0.05.

### Conclusion

- **Reject the Null Hypothesis**:
  - Since the p-value (0.02) is less than the significance level (0.05), you reject the null hypothesis. This suggests that there are statistically significant differences between the group means.

- **Statistical Significance**:
  - The result implies that at least one group is different from the others in terms of the dependent variable measured. However, the ANOVA does not indicate which specific groups are different.

### Next Steps

- **Post-Hoc Tests**:
  - Since you have found significant differences, it is often recommended to conduct post-hoc tests (e.g., Tukey's HSD, Bonferroni correction) to identify which specific groups are significantly different from each other.

### Practical Interpretation

- **Contextual Understanding**:
  - Depending on the context of your study (e.g., comparing treatment effects, group performance), a significant result suggests that the treatments or groups being compared have a meaningful effect. For example, if you were comparing test scores across different teaching methods, a significant difference could imply that one teaching method is more effective than others.

### Summary

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you conclude that there are significant differences among the group means. Further analysis is necessary to pinpoint the specific differences among the groups involved.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is crucial as it can significantly impact the validity and reliability of the results. Here are some common methods for dealing with missing data, along with their potential consequences:

### Methods for Handling Missing Data

1. **Complete Case Analysis (Listwise Deletion)**:
   - **Description**: Only complete cases (participants with no missing values) are included in the analysis.
   - **Consequences**:
     - **Loss of Power**: Reducing the sample size can decrease the statistical power, making it harder to detect significant effects.
     - **Bias**: If the missing data is not random (i.e., missingness is related to the observed data), this method can introduce bias into the results.

2. **Mean Imputation**:
   - **Description**: Missing values are replaced with the mean of the observed values for that participant or group.
   - **Consequences**:
     - **Underestimation of Variability**: This method can lead to a reduction in variance, resulting in overly optimistic estimates of statistical significance.
     - **Bias**: Mean imputation assumes that the missing data is similar to the mean, which may not be valid.

3. **Last Observation Carried Forward (LOCF)**:
   - **Description**: The last available observation for a participant is used to fill in missing data points.
   - **Consequences**:
     - **Inflated Type I Error Rates**: This approach may artificially maintain values, leading to an increased likelihood of falsely detecting significant effects.
     - **Loss of Information**: It does not account for the possibility of change over time and can distort the interpretation of results.

4. **Maximum Likelihood Estimation (MLE)**:
   - **Description**: This statistical method estimates parameters in such a way that the observed data is most probable under the assumed model.
   - **Consequences**:
     - **Increased Complexity**: MLE can be computationally intensive and may require specialized software.
     - **Robustness**: MLE provides unbiased parameter estimates and accounts for the uncertainty of missing data.

5. **Multiple Imputation**:
   - **Description**: Multiple datasets are created by filling in missing values using predictions from the observed data. Each dataset is analyzed separately, and the results are pooled.
   - **Consequences**:
     - **Increased Computational Burden**: This method is more complex and requires careful implementation.
     - **Preserves Variability**: It can produce estimates that are more reflective of the true data distribution and account for the uncertainty associated with missing data.

### Potential Consequences of Different Methods

- **Bias**: The choice of method can introduce bias in estimates, affecting the validity of conclusions.
- **Efficiency**: Some methods lead to reduced sample sizes and statistical power, impacting the ability to detect true effects.
- **Interpretation of Results**: The chosen method may alter the results and interpretations, leading to different conclusions about the effects being studied.
- **Generalizability**: Handling missing data improperly may affect the generalizability of the findings to the broader population.

### Recommendations

- **Assess the Missing Data Mechanism**: Understanding whether data is missing completely at random (MCAR), missing at random (MAR), or not missing at random (NMAR) is essential for choosing the right method.
- **Use Robust Methods**: Whenever possible, use methods like MLE or multiple imputation that handle missing data more robustly while preserving the integrity of the analysis.
- **Sensitivity Analysis**: Conduct sensitivity analyses to assess how different methods of handling missing data affect the results and conclusions of your study.

By carefully considering the method chosen to handle missing data, researchers can maintain the validity and reliability of their findings in repeated measures ANOVA.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are statistical tests performed after an ANOVA to determine which specific group means are significantly different from each other. Here are some common post-hoc tests, along with their appropriate use cases and examples:

### Common Post-Hoc Tests

1. **Tukey's Honestly Significant Difference (HSD) Test**:
   - **Description**: Tukey's HSD compares all possible pairs of group means while controlling for the family-wise error rate.
   - **When to Use**: Use this test when you have equal or unequal sample sizes and want to make pairwise comparisons among all groups.
   - **Example**: After conducting a one-way ANOVA to test the effectiveness of three different diets on weight loss, you find a significant effect. You would use Tukey's HSD to determine which specific diets (e.g., Diet A, Diet B, and Diet C) lead to different amounts of weight loss.

2. **Bonferroni Correction**:
   - **Description**: This method adjusts the significance level (alpha) for multiple comparisons by dividing the desired alpha level by the number of comparisons.
   - **When to Use**: Use when you have a small number of comparisons and want to control for type I error.
   - **Example**: If you are comparing the mean scores of three teaching methods and conducting three pairwise comparisons, the Bonferroni correction would allow you to adjust your alpha level to reduce the risk of false positives.

3. **Scheffé's Test**:
   - **Description**: Scheffé's test is more conservative and can be used for complex comparisons (not just pairwise) and is suitable for unequal sample sizes.
   - **When to Use**: Use when you need to compare specific combinations of group means, not just pairs.
   - **Example**: In a study comparing different types of exercise programs on fitness levels, if you want to compare the mean of one group against the combined means of two other groups, Scheffé's test would be appropriate.

4. **Dunnett's Test**:
   - **Description**: This test compares all treatment groups to a single control group, controlling the family-wise error rate.
   - **When to Use**: Use when you want to compare multiple experimental groups against a control group specifically.
   - **Example**: If you want to test several new drugs against a standard treatment, and you have a control group receiving the standard treatment, Dunnett's test would help identify which drugs perform significantly better or worse compared to the control.

5. **Newman-Keuls Test**:
   - **Description**: This method is less conservative than Tukey's HSD and allows for comparisons in a stepwise fashion, but it does not control the family-wise error rate as well as Tukey's test.
   - **When to Use**: Use when you want to make pairwise comparisons among group means but are less concerned about controlling type I error.
   - **Example**: In an agricultural study evaluating different fertilizers, after finding a significant effect of fertilizer type on crop yield, you might use the Newman-Keuls test to see which fertilizers are different from each other.

### Example Scenario for Post-Hoc Test Necessity

Suppose a researcher conducts a one-way ANOVA to compare the average test scores of students taught using three different teaching methods: Method A, Method B, and Method C. The ANOVA results show significant differences among the groups. However, the researcher now needs to determine which specific teaching methods differ in effectiveness.

In this scenario, a post-hoc test, such as Tukey's HSD, is necessary to identify the pairs of teaching methods (e.g., Method A vs. Method B, Method A vs. Method C, and Method B vs. Method C) that lead to statistically significant differences in student performance. Without a post-hoc test, the researcher wouldn't be able to specify which teaching method(s) are more effective or less effective, even though the overall ANOVA indicates a significant effect.

In summary, post-hoc tests are essential for exploring specific differences between groups after establishing that there is a significant overall effect in ANOVA.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Simulate data
np.random.seed(0)  # For reproducibility
n = 50  # Number of participants per diet
diet_A = np.random.normal(loc=5, scale=1, size=n)  # Diet A: Mean weight loss of 5 kg
diet_B = np.random.normal(loc=7, scale=1, size=n)  # Diet B: Mean weight loss of 7 kg
diet_C = np.random.normal(loc=6, scale=1, size=n)  # Diet C: Mean weight loss of 6 kg

# Create a DataFrame
data = pd.DataFrame({
    'Weight Loss': np.concatenate([diet_A, diet_B, diet_C]),
    'Diet': ['A'] * n + ['B'] * n + ['C'] * n
})

# Conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Report the results
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences between the mean weight loss of the diets.")
else:
    print("Fail to reject the null hypothesis: No significant differences between the mean weight loss of the diets.")


F-statistic: 40.8937
P-value: 0.0000
Reject the null hypothesis: There are significant differences between the mean weight loss of the diets.


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Simulate data
np.random.seed(0)  # For reproducibility
n = 30  # Number of employees per group
experience_levels = ['Novice', 'Experienced']
programs = ['A', 'B', 'C']

# Generate data
data = []
for program in programs:
    for experience in experience_levels:
        if program == 'A':
            # Mean completion times for Program A
            mean_time = 20 if experience == 'Novice' else 18
        elif program == 'B':
            # Mean completion times for Program B
            mean_time = 22 if experience == 'Novice' else 19
        else:
            # Mean completion times for Program C
            mean_time = 21 if experience == 'Novice' else 17

        # Generate times with some random noise
        times = np.random.normal(loc=mean_time, scale=1, size=n)
        data.extend(zip(times, [program] * n, [experience] * n))

# Create DataFrame
df = pd.DataFrame(data, columns=['Time', 'Program', 'Experience'])

# Perform two-way ANOVA
model = ols('Time ~ C(Program) * C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report the results
print(anova_table)


                              sum_sq     df           F        PR(>F)
C(Program)                108.082607    2.0   56.355329  1.349493e-19
C(Experience)             427.955404    1.0  446.280270  6.671234e-50
C(Program):C(Experience)   27.303396    2.0   14.236258  1.879039e-06
Residual                  166.855327  174.0         NaN           NaN


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [3]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Simulate data
np.random.seed(0)  # For reproducibility
n_control = 50  # Number of students in the control group
n_experimental = 50  # Number of students in the experimental group

# Simulate test scores
control_scores = np.random.normal(loc=75, scale=10, size=n_control)  # Traditional method
experimental_scores = np.random.normal(loc=82, scale=10, size=n_experimental)  # New method

# Create a DataFrame for the test scores
df = pd.DataFrame({
    'Scores': np.concatenate([control_scores, experimental_scores]),
    'Group': ['Control'] * n_control + ['Experimental'] * n_experimental
})

# Conduct a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Output the results of the t-test
print(f'T-statistic: {t_statistic}, P-value: {p_value}')

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print('Reject the null hypothesis: There is a significant difference in test scores.')
else:
    print('Fail to reject the null hypothesis: There is no significant difference in test scores.')

# If significant, perform a post-hoc test (here, using Tukey's HSD since we have two groups)
if p_value < alpha:
    from statsmodels.stats.multicomp import pairwise_tukeyhsd

    tukey_results = pairwise_tukeyhsd(endog=df['Scores'], groups=df['Group'], alpha=0.05)
    print(tukey_results)


T-statistic: -2.6531104281067357, P-value: 0.009305166985773114
Reject the null hypothesis: There is a significant difference in test scores.
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental    5.385 0.0093 1.3571 9.4128   True
---------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

To conduct a repeated measures ANOVA to evaluate significant differences in average daily sales among three retail stores (Store A, Store B, and Store C), we can use Python with the `statsmodels` library.

Here’s a step-by-step guide on how to perform the analysis:

1. **Simulate the data** for daily sales of three stores.
2. **Conduct the repeated measures ANOVA** to compare the means of the sales across the stores.
3. **Perform a post-hoc test** (if the results are significant) to determine which specific stores differ.

### Step-by-Step Implementation in Python

Here’s how to implement this in Python:

```python
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulate data
np.random.seed(42)  # For reproducibility
n_days = 30

# Simulated sales data for three stores
store_a_sales = np.random.normal(loc=200, scale=30, size=n_days)
store_b_sales = np.random.normal(loc=220, scale=30, size=n_days)
store_c_sales = np.random.normal(loc=210, scale=30, size=n_days)

# Create a DataFrame
data = pd.DataFrame({
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales]),
    'Store': ['A'] * n_days + ['B'] * n_days + ['C'] * n_days,
    'Day': np.tile(np.arange(1, n_days + 1), 3)
})

# Conduct repeated measures ANOVA
anova_results = AnovaRM(data, 'Sales', 'Day', within=['Store']).fit()

# Output the ANOVA results
print(anova_results)

# Check if the results are significant
if anova_results.pvalues['Store'] < 0.05:
    print('Reject the null hypothesis: There is a significant difference in sales between stores.')
    
    # Post-hoc test
    posthoc = pairwise_tukeyhsd(endog=data['Sales'], groups=data['Store'], alpha=0.05)
    print(posthoc)
else:
    print('Fail to reject the null hypothesis: There is no significant difference in sales between stores.')
```

### Explanation of the Code

1. **Data Simulation**:
   - We simulate the daily sales for three stores using a normal distribution. Store A has a mean sales of 200, Store B has a mean of 220, and Store C has a mean of 210.

2. **DataFrame Creation**:
   - We create a DataFrame that contains sales data for each store across the 30 days. Each store's data is labeled appropriately.

3. **Conducting Repeated Measures ANOVA**:
   - We use `AnovaRM` from `statsmodels` to conduct a repeated measures ANOVA on the sales data. The dependent variable is `Sales`, and we consider `Day` as a subject identifier.

4. **Interpreting ANOVA Results**:
   - We check the p-value from the ANOVA results. If it's below the significance level (0.05), we conclude that there are significant differences in sales between the stores.

5. **Post-Hoc Test**:
   - If the ANOVA result is significant, we perform a Tukey's HSD test to identify which specific stores' sales means are significantly different from each other.

### Sample Output

The output would look something like this (actual values may vary):

```plaintext
                 Anova
==========================================
           F Value   Num DF  Den DF Pr > F
------------------------------------------
Store     5.6212       2.0     87.0 0.0052
------------------------------------------
Reject the null hypothesis: There is a significant difference in sales between stores.
Multiple Comparison of Means - Tukey HSD, FWER=0.05
====================================================
 group1 group2 meandiff p-adj   lower    upper  reject
----------------------------------------------------
      A      B   19.8941 0.0064  7.4945  32.2936   True
      A      C    9.2585 0.2438 -3.1406 21.6577  False
      B      C  -10.6356 0.1335 -23.0348  1.7636  False
----------------------------------------------------
```

### Interpretation of Results

- **ANOVA Results**: The F-statistic and p-value indicate whether there is a significant difference in average daily sales among the three stores. If the p-value is less than 0.05, it suggests that at least one store has different average sales compared to the others.
- **Post-Hoc Test Results**: The Tukey HSD results provide detailed pairwise comparisons between the stores, showing which stores differ significantly from each other.

### Conclusion

Using this method, you can determine whether there are significant differences in average daily sales between the three stores and identify specific pairs of stores that differ significantly. Adjust the parameters of the simulation as necessary to reflect your specific context or expected outcomes.