### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### Assumptions Required to Use ANOVA

Analysis of Variance (ANOVA) is a statistical technique used to compare means across multiple groups. For the results of an ANOVA to be valid, certain assumptions must be met:

1. **Independence of Observations**:
   - Each observation should be independent of others.
   - **Violation Example**: In a study where participants are measured multiple times, the observations are not independent.

2. **Normality**:
   - The data in each group should be approximately normally distributed.
   - **Violation Example**: If the data in one group is heavily skewed, the normality assumption is violated.

3. **Homogeneity of Variances (Homoscedasticity)**:
   - The variances among the groups should be approximately equal.
   - **Violation Example**: If one group's variance is much larger or smaller than the others, this assumption is violated.

4. **Random Sampling**:
   - The data should be collected from a random sample of the population.
   - **Violation Example**: If a convenience sample is used instead of a random sample, this assumption is violated.

### Examples of Violations and Their Impact

1. **Independence of Observations**:
   - **Example**: In a classroom study where students' test scores are used, scores of students from the same group or classroom may not be independent.
   - **Impact**: Violation of this assumption can lead to incorrect conclusions because the ANOVA test assumes that each data point is independent.

2. **Normality**:
   - **Example**: In a medical study measuring blood pressure, if the data for one group is highly skewed due to an outlier, this assumption is violated.
   - **Impact**: Non-normal data can affect the Type I error rate, making it more likely to incorrectly reject the null hypothesis.

3. **Homogeneity of Variances**:
   - **Example**: In an educational study comparing test scores across different schools, if one school has much more variability in scores than others, this assumption is violated.
   - **Impact**: Unequal variances can lead to an increased Type I error rate, affecting the reliability of the test results.

4. **Random Sampling**:
   - **Example**: In a market research study, if participants are selected based on convenience rather than randomly, this assumption is violated.
   - **Impact**: Non-random sampling can introduce bias, making it difficult to generalize the results to the broader population.

### How to Check for Assumptions

1. **Independence**:
   - This is usually determined by the study design.
   - Ensure the data collection process is structured to avoid dependency among observations.

2. **Normality**:
   - Use graphical methods such as Q-Q plots or statistical tests like the Shapiro-Wilk test to check for normality.

3. **Homogeneity of Variances**:
   - Use Levene's test or Bartlett's test to assess the equality of variances.

4. **Random Sampling**:
   - Ensure the sampling method used is truly random and representative of the population.

### What to Do if Assumptions are Violated

1. **Independence**:
   - Use a different statistical method that accounts for dependencies, such as mixed-effects models.

2. **Normality**:
   - Transform the data (e.g., log transformation) or use non-parametric tests like the Kruskal-Wallis test.

3. **Homogeneity of Variances**:
   - Use a different test such as Welch's ANOVA, which does not assume equal variances.

4. **Random Sampling**:
   - Improve the sampling method or use caution when interpreting the results, acknowledging the potential bias.


### Q2. What are the three types of ANOVA, and in what situations would each be used?

### Types of ANOVA and Their Uses

### Main Points:

1. **One-Way ANOVA**
   - Compares means of three or more independent groups based on one independent variable.
   - Used when there is one categorical independent variable with three or more levels (groups) and one continuous dependent variable.
   - Example: Comparing test scores of students from different teaching methods.

2. **Two-Way ANOVA**
   - Compares means of groups based on two independent variables and assesses the interaction effect between them.
   - Used when there are two categorical independent variables and one continuous dependent variable.
   - Example: Studying effects of different diets and exercise routines on weight loss.

3. **Repeated Measures ANOVA**
   - Used when the same subjects are measured multiple times under different conditions.
   - Used when there is one categorical independent variable with repeated measures and one continuous dependent variable.
   - Example: Measuring blood pressure of patients at different times after administering a drug.

### Summary Table:

| Type of ANOVA          | Number of Independent Variables | Situation Example                             |
|------------------------|---------------------------------|----------------------------------------------|
| One-Way ANOVA          | 1                               | Comparing test scores from different teaching methods |
| Two-Way ANOVA          | 2                               | Studying the effects of different diets and exercise routines on weight loss |
| Repeated Measures ANOVA| 1 (with repeated measures)      | Measuring blood pressure of patients at different times after administering a drug |


### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### Partitioning of Variance in ANOVA

### What is Partitioning of Variance?

Partitioning of variance in ANOVA involves breaking down the total variability in the data into different components attributable to different sources. This process helps in understanding how much of the total variability is explained by the factors being studied and how much is due to random error.

### Components of Variance

1. **Total Sum of Squares (SST)**
   - Represents the total variability in the data.
   - Calculated as the sum of the squared differences between each observation and the overall mean.
   - Formula: 
     \[
     $\text{SST} = \sum_{i=1}^{N} (X_i - \bar{X})^2$
     \]

2. **Between-Group Sum of Squares (SSB)**
   - Represents the variability due to the differences between group means.
   - Calculated as the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.
   - Formula: 
     \[
     $\text{SSB} = \sum_{j=1}^{k} n_j (\bar{X}_j - \bar{X})^2$
     \]

3. **Within-Group Sum of Squares (SSW)**
   - Represents the variability within each group.
   - Calculated as the sum of the squared differences between each observation and its respective group mean.
   - Formula: 
     \[
     $\text{SSW} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (X_{ij} - \bar{X}_j)^2$
     \]

### Importance of Understanding Partitioning of Variance

1. **Identifying Sources of Variability**
   - Helps in determining how much of the total variability is due to differences between groups (explained variability) and how much is due to random error (unexplained variability).

2. **Hypothesis Testing**
   - The partitioning of variance is crucial for conducting hypothesis tests in ANOVA. It allows for the calculation of the F-statistic, which is used to test if the group means are significantly different.

3. **Interpreting Results**
   - Understanding the proportion of total variability explained by the factors being studied (effect size) can provide insights into the strength and importance of the factors.

4. **Model Assessment**
   - Helps in assessing the goodness of fit of the model. A large between-group sum of squares relative to the within-group sum of squares indicates that the model explains a significant portion of the variability in the data.

### Summary

- **Total Sum of Squares (SST)**: Total variability in the data.
- **Between-Group Sum of Squares (SSB)**: Variability due to differences between group means.
- **Within-Group Sum of Squares (SSW)**: Variability within each group.

Understanding the partitioning of variance is essential for conducting ANOVA, interpreting the results, and assessing the significance and impact of the factors being studied.


### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

#### To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can follow these steps:

- Total Sum of Squares (SST): Measures the total variation in the data.
- Explained Sum of Squares (SSE): Measures the variation explained by the groups (also known as sum of squares between groups).
- Residual Sum of Squares (SSR): Measures the variation within the groups (also known as sum of squares within groups).

In [2]:
import numpy as np

# Sample data for three groups
group1 = np.array([23, 25, 27, 30, 22])
group2 = np.array([31, 29, 27, 36, 38])
group3 = np.array([19, 20, 23, 21, 24])

# Combine all groups into a single array
all_groups = np.concatenate([group1, group2, group3])

# Overall mean
overall_mean = np.mean(all_groups)

# Group means
mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

# Total sum of squares (SST)
sst = np.sum((all_groups - overall_mean) ** 2)

# Explained sum of squares (SSE)
sse = (len(group1) * (mean_group1 - overall_mean) ** 2 +
       len(group2) * (mean_group2 - overall_mean) ** 2 +
       len(group3) * (mean_group3 - overall_mean) ** 2)

# Residual sum of squares (SSR)
ssr = (np.sum((group1 - mean_group1) ** 2) +
       np.sum((group2 - mean_group2) ** 2) +
       np.sum((group3 - mean_group3) ** 2))

print(f'Total Sum of Squares (SST): {sst}')
print(f'Explained Sum of Squares (SSE): {sse}')
print(f'Residual Sum of Squares (SSR): {ssr}')


Total Sum of Squares (SST): 443.33333333333337
Explained Sum of Squares (SSE): 298.13333333333355
Residual Sum of Squares (SSR): 145.2


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

# Two-Way ANOVA: Calculating Main Effects and Interaction Effects Using Python

In a two-way ANOVA, the goal is to assess the main effects of two independent variables and their interaction effect on a dependent variable. Here's how you can calculate these effects using Python:

1. **Main Effects**: The effect of each independent variable on the dependent variable.
2. **Interaction Effect**: The combined effect of the two independent variables on the dependent variable.

We'll use the `statsmodels` library, which provides tools for conducting ANOVA.

First, install the `statsmodels` library if you haven't already:

```bash
pip install statsmodels


In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
# Suppose we have two factors: Factor A (with levels A1, A2) and Factor B (with levels B1, B2)
# and a dependent variable (DV)
data = {
    'FactorA': np.repeat(['A1', 'A2'], 10),
    'FactorB': np.tile(np.repeat(['B1', 'B2'], 5), 2),
    'DV': [10, 15, 14, 10, 12, 20, 23, 21, 19, 22, 12, 14, 13, 11, 16, 21, 22, 20, 18, 21]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('DV ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                       sum_sq    df          F        PR(>F)
C(FactorA)                0.2   1.0   0.058394  8.121220e-01
C(FactorB)              320.0   1.0  93.430657  4.397672e-08
C(FactorA):C(FactorB)     3.2   1.0   0.934307  3.481307e-01
Residual                 54.8  16.0        NaN           NaN


### Explanation:

#### Data Preparation:

We create a dataset with two factors (FactorA and FactorB) and a dependent variable (DV).
#### Model Fitting:

We use the ols function to fit an ordinary least squares regression model. The formula 'DV ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)' specifies that we want to include main effects for FactorA and FactorB, as well as their interaction effect.
#### ANOVA Table:

We use sm.stats.anova_lm(model, typ=2) to perform the ANOVA and get the ANOVA table.
### Output:
 The ANOVA table will contain the following columns:
- sum_sq: Sum of squares for each source of variation.
- df: Degrees of freedom for each source of variation.
- F: F-statistic for each source of variation.
- PR(>F): p-value for the F-statistic.
This table provides information about the main effects of FactorA and FactorB, as well as their interaction effect. The significance of each effect can be determined by examining the p-values.

### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

### Interpretation of One-Way ANOVA Results

Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how you can interpret these results:

### Hypotheses in One-Way ANOVA

- **Null Hypothesis ($H_0$)**: The means of all groups are equal. There are no differences between the groups.
- **Alternative Hypothesis ($ H_a $)**: At least one group mean is different from the others.

### F-Statistic and P-Value

- **F-Statistic (5.23)**: This value indicates the ratio of the variance between the group means to the variance within the groups. A higher F-statistic suggests greater variability between group means compared to within-group variability.
- **P-Value (0.02)**: The p-value indicates the probability of observing an F-statistic as extreme as 5.23, assuming the null hypothesis is true. A lower p-value suggests stronger evidence against the null hypothesis.

### Interpretation

- **Significance Level ($\alpha$)**: Commonly, a significance level of 0.05 is used.
- **P-Value Comparison**: The obtained p-value (0.02) is less than the significance level (0.05).

### Conclusion

Since the p-value (0.02) is less than 0.05, we reject the null hypothesis. This means there is statistically significant evidence to conclude that there are differences between the group means. In other words, at least one group mean is significantly different from the others.

### Next Steps

- **Post-Hoc Tests**: To determine which specific groups differ from each other, you can perform post-hoc tests such as Tukey's HSD (Honestly Significant Difference) test.
- **Effect Size**: Consider calculating the effect size (e.g., eta squared) to understand the magnitude of the differences between groups.

### Summary

The one-way ANOVA results indicate that there are significant differences between the group means, given an F-statistic of 5.23 and a p-value of 0.02. Further analysis through post-hoc tests can help identify the specific groups that differ.


### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

### Handling Missing Data in Repeated Measures ANOVA

In a repeated measures ANOVA, missing data can be a challenge because the same subjects are measured under different conditions or over time. Handling missing data appropriately is crucial to maintaining the integrity of the analysis.

### Methods to Handle Missing Data

1. **Listwise Deletion (Complete Case Analysis)**
    - **Description**: Exclude any subject with missing data from the analysis.
    - **Pros**: Simple to implement; maintains consistency in sample size across measurements.
    - **Cons**: Reduces sample size and statistical power; can introduce bias if the missing data are not completely random.

2. **Pairwise Deletion**
    - **Description**: Use all available data by excluding missing data on a case-by-case basis.
    - **Pros**: Uses more data compared to listwise deletion; can be more efficient.
    - **Cons**: Can lead to inconsistencies in the analysis; more complex to implement and interpret.

3. **Mean Imputation**
    - **Description**: Replace missing values with the mean of the observed values for that variable.
    - **Pros**: Simple to implement; retains all subjects in the analysis.
    - **Cons**: Underestimates variability; can bias parameter estimates downward.

4. **Last Observation Carried Forward (LOCF)**
    - **Description**: Replace missing values with the last observed value for that subject.
    - **Pros**: Retains all subjects in the analysis; useful in longitudinal studies.
    - **Cons**: Assumes stability of the variable over time; can bias results if this assumption is not met.

5. **Multiple Imputation**
    - **Description**: Replace missing values with multiple sets of simulated values to create several complete datasets, analyze each one, and then combine the results.
    - **Pros**: Accounts for uncertainty in the imputation process; provides more accurate standard errors and confidence intervals.
    - **Cons**: Computationally intensive; requires more advanced statistical knowledge and software.

6. **Maximum Likelihood Estimation (MLE)**
    - **Description**: Use all available data to estimate the model parameters directly.
    - **Pros**: Efficient use of all data; provides unbiased estimates under the assumption of missing at random (MAR).
    - **Cons**: Requires sophisticated software and understanding of likelihood-based methods.

### Potential Consequences of Different Methods

1. **Listwise Deletion**: Reduces the sample size, which can lead to a loss of statistical power and potentially biased results if the data are not missing completely at random (MCAR).

2. **Pairwise Deletion**: Can introduce inconsistencies in sample size across different analyses, complicating interpretation and possibly leading to biased results.

3. **Mean Imputation**: Reduces variability and can bias parameter estimates downward, potentially leading to incorrect conclusions.

4. **LOCF**: Assumes no change over time, which can lead to biased results if the assumption is incorrect. It can also underestimate variability.

5. **Multiple Imputation**: More accurate and robust, but computationally intensive and requires more advanced statistical knowledge.

6. **Maximum Likelihood Estimation**: Efficient and unbiased under the MAR assumption but requires sophisticated software and statistical understanding.

### Recommendations

- **Assess Missing Data Mechanism**: Before choosing a method, assess the pattern and mechanism of the missing data (e.g., MCAR, MAR, or not missing at random (NMAR)).
- **Use Advanced Methods**: If possible, prefer multiple imputation or MLE, as they provide more accurate and reliable results by accounting for the uncertainty in the missing data.
- **Sensitivity Analysis**: Perform sensitivity analyses to check how different methods of handling missing data affect the results.

### Summary

Handling missing data in repeated measures ANOVA is crucial for maintaining the validity of the analysis. Different methods have various pros and cons, and the choice of method can significantly impact the results. Advanced methods like multiple imputation or maximum likelihood estimation are generally preferred for their accuracy and robustness.


### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
an example of a situation where a post-hoc test might be necessary.

### Common Post-Hoc Tests Used After ANOVA

Post-hoc tests are performed after an ANOVA when the null hypothesis is rejected, indicating that at least one group mean is different from the others. These tests help determine which specific groups differ from each other. Here are some common post-hoc tests and when to use them:

### 1. Tukey's Honestly Significant Difference (HSD) Test
- **Use Case**: When you want to compare all possible pairs of group means.
- **Assumptions**: Equal variances and sample sizes, although robust to minor deviations.
- **Example**: Comparing the effectiveness of different teaching methods (e.g., traditional, online, hybrid) on student performance.

### 2. Bonferroni Correction
- **Use Case**: When you want to control the family-wise error rate by adjusting the significance level.
- **Assumptions**: Same as the original ANOVA test.
- **Example**: Multiple comparisons in a clinical trial where several treatments are tested against a control.

### 3. Scheffé's Method
- **Use Case**: When you need a more conservative test that can handle unequal sample sizes and variances.
- **Assumptions**: Suitable for complex comparisons involving linear combinations of group means.
- **Example**: Comparing different diet plans on weight loss where variances and sample sizes might differ.

### 4. Dunnett's Test
- **Use Case**: When comparing multiple treatment groups to a single control group.
- **Assumptions**: Assumes equal variances across groups.
- **Example**: Testing new drug formulations against a standard control drug.

### 5. Holm's Sequential Bonferroni Procedure
- **Use Case**: When you need a stepwise method to control the family-wise error rate.
- **Assumptions**: Similar to the Bonferroni correction.
- **Example**: Comparing different fertilizer types on plant growth.

### 6. Fisher's Least Significant Difference (LSD) Test
- **Use Case**: When you want to perform multiple comparisons without adjusting for multiple testing (more liberal approach).
- **Assumptions**: Equal variances and normally distributed errors.
- **Example**: Initial screening of potential factors affecting product quality in a manufacturing process.

### Example Situation

Suppose you conducted an ANOVA to evaluate the effectiveness of different study techniques (Group A: Flashcards, Group B: Highlighting, Group C: Summarization, Group D: Rereading) on exam scores. The ANOVA results show a significant difference between the groups (p-value < 0.05). To determine which specific groups differ, you can perform a post-hoc test.

### Performing Tukey's HSD Test in Python

Here's how you can ups=df['StudyTechnique'], alpha=0.05)

print(tukey_results)


In [2]:
import numpy as np
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data
data = {
    'StudyTechnique': np.repeat(['Flashcards', 'Highlighting', 'Summarization', 'Rereading'], 10),
    'ExamScore': [78, 85, 84, 79, 88, 85, 82, 86, 87, 84, 75, 73, 78, 77, 80, 79, 82, 81, 83, 84,
                  88, 85, 87, 86, 90, 88, 87, 89, 88, 90, 72, 70, 75, 74, 73, 75, 74, 76, 73, 72]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(endog=df['ExamScore'], groups=df['StudyTechnique'], alpha=0.05)

print(tukey_results)

       Multiple Comparison of Means - Tukey HSD, FWER=0.05        
   group1        group2    meandiff p-adj   lower    upper  reject
------------------------------------------------------------------
  Flashcards  Highlighting     -4.6 0.0026  -7.8319 -1.3681   True
  Flashcards     Rereading    -10.4    0.0 -13.6319 -7.1681   True
  Flashcards Summarization      4.0 0.0103   0.7681  7.2319   True
Highlighting     Rereading     -5.8 0.0001  -9.0319 -2.5681   True
Highlighting Summarization      8.6    0.0   5.3681 11.8319   True
   Rereading Summarization     14.4    0.0  11.1681 17.6319   True
------------------------------------------------------------------


### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Pytho 
to determine if there are any significant differences between the mean weight loss of the three diet .
Report the F-statistic and p-value, and interpret the results.

### One-Way ANOVA: Comparing Mean Weight Loss of Three Diets

A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. We'll conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.

### Step-by-Step Process

1. **Import Libraries**: We'll use `pandas`, `numpy`, and `statsmodels` for the analysis.
2. **Prepare Data**: Create a dataset with the weight loss data for the three diets.
3. **Fit ANOVA Model**: Use `statsmodels` to fit the ANOVA model.
4. **Perform ANOVA**: Get the ANOVA table and extract the F-statistic and p-value.
5. **Interpret Results**: Determine if there are significant differences between the diets.

### int the ANOVA table
print(anova_table)


In [3]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
np.random.seed(0)
diet_A = np.random.normal(loc=5, scale=1.5, size=17)  # Mean weight loss = 5, SD = 1.5
diet_B = np.random.normal(loc=6, scale=1.5, size=17)  # Mean weight loss = 6, SD = 1.5
diet_C = np.random.normal(loc=4.5, scale=1.5, size=16)  # Mean weight loss = 4.5, SD = 1.5

# Combine data into a DataFrame
data = {
    'WeightLoss': np.concatenate([diet_A, diet_B, diet_C]),
    'Diet': ['A'] * 17 + ['B'] * 17 + ['C'] * 16
}
df = pd.DataFrame(data)

# Fit the one-way ANOVA model
model = ols('WeightLoss ~ C(Diet)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


              sum_sq    df         F    PR(>F)
C(Diet)    37.529415   2.0  7.175732  0.001907
Residual  122.906107  47.0       NaN       NaN


### Interpretation
The F-statistic is 7.1757, and the p-value is 0.0019. Since the p-value is less than the common significance level of 0.05, we reject the null hypothesis. This indicates that there are significant differences in mean weight loss among the three diets.

### Conclusion
The one-way ANOVA results suggest that the mean weight loss differs significantly between at least two of the diets (A, B, and C). To determine which specific diets differ, a post-hoc test such as Tukey's HSD can be performed.

### Performing Post-Hoc Test (Tukey's HSD)
To identify which specific diets differ, you can perform Tukey's HSD test:

In [4]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(endog=df['WeightLoss'], groups=df['Diet'], alpha=0.05)

print(tukey_results)


Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     A      B  -0.1749 0.9467 -1.5172  1.1674  False
     A      C  -1.9383 0.0034 -3.3014 -0.5751   True
     B      C  -1.7634 0.0083 -3.1265 -0.4002   True
----------------------------------------------------


### Summary
By conducting a one-way ANOVA, we found significant differences in mean weight loss between the three diets. The F-statistic was 7.1757, and the p-value was 0.0019. A post-hoc test like Tukey's HSD can be used to identify which specific diets have different mean weight losses.

### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


### To conduct a two-way ANOVA using Python, we'll follow these steps:

- **Create a synthetic dataset:** We will simulate data for 30 employees using three different software programs and classify them as novice or experienced.
- **Perform the two-way ANOVA:** We will use the statsmodels library to perform the two-way ANOVA and analyze the main effects and interaction effects.
- **Interpret the results:** We'll report the F-statistics and p-values and interpret them.
### Step 1: Create a synthetic dataset
Let's create a synthetic dataset where 30 employees are randomly assigned to one of three software programs and classified as novice or experienced. We'll also generate random task completion times for each combination of software and experience level.

In [1]:
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate data
data = {
    'Employee': np.arange(1, 31),
    'Program': np.random.choice(['Program A', 'Program B', 'Program C'], 30),
    'Experience': np.random.choice(['Novice', 'Experienced'], 30),
    'Time': np.random.normal(loc=50, scale=10, size=30)
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Adjust task completion times for interaction effects
for i in df.index:
    if df.loc[i, 'Program'] == 'Program A':
        df.loc[i, 'Time'] += 5 if df.loc[i, 'Experience'] == 'Novice' else -5
    elif df.loc[i, 'Program'] == 'Program B':
        df.loc[i, 'Time'] += 3 if df.loc[i, 'Experience'] == 'Novice' else -3
    else:  # Program C
        df.loc[i, 'Time'] += 0 if df.loc[i, 'Experience'] == 'Novice' else 0

df.head()


Unnamed: 0,Employee,Program,Experience,Time
0,1,Program C,Experienced,43.993613
1,2,Program A,Experienced,42.083063
2,3,Program C,Experienced,43.982934
3,4,Program C,Experienced,68.522782
4,5,Program A,Experienced,44.865028


### Step 2: Perform the two-way ANOVA
We will use the statsmodels library to perform the two-way ANOVA and analyze the main effects and interaction effects.

In [2]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()

# Perform two-way ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Program),63.506916,2.0,0.336109,0.717855
C(Experience),117.699761,1.0,1.245847,0.275398
C(Program):C(Experience),221.449736,2.0,1.172018,0.326849
Residual,2267.368865,24.0,,


### Step 3: Interpret the results
We will report the F-statistics and p-values from the ANOVA table and interpret them.
## ANOVA Results

\[
$\begin{array}{|c|c|c|c|}
\hline
\text{Source} & \text{Sum of Squares} & \text{Df} & \text{F-statistic} & \text{p-value} \\
\hline
\text{Program} & 63.506916	 & 2 & 0.336109 & 0.717855 \\
\text{Experience} & 117.699761 & 1 & 1.245847 & 0.275398 \\
\text{Program:Experience} & 221.449736 & 2 & 1.172018 & 0.326849 \\
\text{Residual} & 2267.368865 & 24 & \text{NA} & \text{NA} \\
\hline
\end{array}$
\]

### Interpretation

- The F-statistic for `Program` is 0.34 with a p-value of 0.718, indicating that there is no significant main effect of the software program on task completion time.
- The F-statistic for `Experience` is 1.25 with a p-value of 0.275, suggesting that there is no significant main effect of employee experience on task completion time.
- The F-statistic for the interaction effect `Program:Experience` is 1.17 with a p-value of 0.327, indicating that there is no significant interaction effect between the software program and employee experience level on task completion time.

Based on these results, we conclude that there are no significant differences in the average time it takes to complete a task using the three different software programs, and there is no significant interaction effect between the software programs and employee experience level

### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

### To conduct a two-sample t-test in Python, we'll follow these steps:

- **Create a synthetic dataset:** Simulate test scores for 100 students assigned to either a control group (traditional teaching method) or an experimental group (new teaching method).
- **Perform the two-sample t-test:** Use the scipy.stats library to perform the t-test.
- **Interpret the results:** Report the t-statistic and p-value and interpret them.

Follow up with a post-hoc test if results are significant: Perform additional analysis to identify which groups differ significantly if the initial test is significant.
### Step 1: Create a Synthetic Dataset
We'll create a synthetic dataset for 100 students with their test scores.

In [3]:
import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate data
data = {
    'Student': np.arange(1, 101),
    'Group': np.random.choice(['Control', 'Experimental'], 100),
    'Score': np.random.normal(loc=75, scale=10, size=100)
}

# Add some differences between groups
for i in range(100):
    if data['Group'][i] == 'Experimental':
        data['Score'][i] += 5

# Convert to DataFrame
df = pd.DataFrame(data)

df.head()


Unnamed: 0,Student,Group,Score
0,1,Control,82.384666
1,2,Experimental,81.713683
2,3,Control,73.843517
3,4,Control,71.988963
4,5,Control,60.21478


### Step 2: Perform the Two-Sample T-Test
We'll use the `scipy.stats` library to perform the t-test.

In [4]:
from scipy.stats import ttest_ind

# Separate the scores by group
control_scores = df[df['Group'] == 'Control']['Score']
experimental_scores = df[df['Group'] == 'Experimental']['Score']

# Perform the two-sample t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)

t_stat, p_value


(-1.7936665923984396, 0.07595057557272443)

### Interpretation
Based on the results of the two-sample t-test:

- The t-statistic is approximately -1.79.
- The p-value is approximately 0.076.

Since the p-value (0.076) is greater than 0.05, we fail to reject the null hypothesis. This indicates that there is no statistically significant difference in test scores between the control group (traditional teaching method) and the experimental group (new teaching method).

### Conclusion
There are no significant differences in test scores between the two groups based on the provided data. Therefore, no post-hoc test is necessary in this case

### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

### To conduct a repeated measures ANOVA in Python, we'll follow these steps:

- **Create a synthetic dataset:** Simulate daily sales data for 30 days for three stores (Store A, Store B, Store C).
- **Perform the repeated measures ANOVA:** Use the statsmodels library to perform the analysis.
- **Interpret the results:** Report the F-statistic and p-value and interpret them.
Follow up with a post-hoc test if results are significant: Use a post-hoc test to determine which store(s) differ significantly if the initial test is significant.

In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set seed for reproducibility
np.random.seed(42)

# Generate data
days = np.arange(1, 31)
store_a_sales = np.random.normal(loc=200, scale=20, size=30)
store_b_sales = np.random.normal(loc=220, scale=20, size=30)
store_c_sales = np.random.normal(loc=210, scale=20, size=30)

# Create DataFrame
df = pd.DataFrame({
    'Day': days,
    'Store_A': store_a_sales,
    'Store_B': store_b_sales,
    'Store_C': store_c_sales
})

# Reshape the DataFrame to long format
df_long = pd.melt(df, id_vars=['Day'], value_vars=['Store_A', 'Store_B', 'Store_C'], 
                var_name='Store', value_name='Sales')

# Perform the repeated measures ANOVA
aovrm = AnovaRM(df_long, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()

# Perform the post-hoc test if the results are significant
if res.anova_table['Pr > F'][0] < 0.05:
    posthoc = pairwise_tukeyhsd(df_long['Sales'], df_long['Store'])
    posthoc_results = posthoc.summary()
else:
    posthoc_results = "No significant differences found, no post-hoc test needed."

df.head(), res.anova_table, posthoc_results


  if res.anova_table['Pr > F'][0] < 0.05:


(   Day     Store_A     Store_B     Store_C
 0    1  209.934283  207.965868  200.416515
 1    2  197.234714  257.045564  206.286820
 2    3  212.953771  219.730056  187.873301
 3    4  230.460597  198.845781  186.075868
 4    5  195.316933  236.450898  226.250516,
          F Value  Num DF  Den DF    Pr > F
 Store  10.340843     2.0    58.0  0.000144,
 <class 'statsmodels.iolib.table.SimpleTable'>)