In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
ANOVA (Analysis of Variance) is a statistical method used to compare means among three or more groups. 
To ensure the validity of the results obtained from ANOVA, several assumptions must be met:

### Assumptions of ANOVA

1. **Independence of Observations**:
   - Each sample must be independent of the others. This means the data collected from one group should not influence 
the data collected from another group.
   - **Violation Example**: If participants in one group influence or communicate with participants in another group 
    (e.g., in a study comparing teaching methods, if students from different groups share study materials), the 
    independence assumption is violated.

2. **Normality**:
   - The data in each group should be approximately normally distributed. This assumption is particularly important 
for smaller sample sizes.
   - **Violation Example**: If the data for one or more groups are heavily skewed or have outliers (e.g., a group of 
    students scoring exceptionally high or low on a test), this can affect the normality assumption.

3. **Homogeneity of Variances (Homoscedasticity)**:
   - The variances among the groups should be approximately equal. This ensures that the comparison of means is valid.
   - **Violation Example**: If one group has a much larger variance than another (e.g., one group has very consistent 
    scores while another group has widely varying scores), this can lead to inaccurate results. Levene's test can be 
    used to check for this assumption.

### Impact of Violations

- **Independence**: Violating this assumption can lead to underestimated standard errors and inflated Type I error 
rates, meaning you might incorrectly reject the null hypothesis.
  
- **Normality**: If the normality assumption is violated, particularly with small sample sizes, the F-test may become
less reliable. With larger samples, the Central Limit Theorem provides some robustness, but extreme deviations can 
still impact results.

- **Homogeneity of Variances**: If this assumption is violated, the ANOVA F-statistic may be biased, leading to 
unreliable results. This can result in either an increased risk of Type I errors (falsely rejecting the null hypothesis)
or Type II errors (failing to reject the null hypothesis when it is false).

### Remedies for Violations

- **For Independence**: Ensure proper experimental design, such as random assignment.
  
- **For Normality**: Use transformations (e.g., log, square root) on the data, or use non-parametric alternatives 
like the Kruskal-Wallis test if normality cannot be achieved.

- **For Homogeneity of Variances**: If variances are unequal, you can use Welch’s ANOVA, which is robust to violations
of this assumption.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
There are three main types of ANOVA, each suited for different experimental designs and research questions.
Here’s a brief overview of each type and when to use them:

### 1. One-Way ANOVA

**Definition**: One-way ANOVA is used to compare the means of three or more independent groups based on one 
    categorical independent variable (factor).

**When to Use**:
- When you have one independent variable with three or more levels (groups) and you want to see if there are 
significant differences in the dependent variable.
- Example Situations:
  - Comparing the average test scores of students from three different teaching methods (e.g., traditional, 
    online, hybrid).
  - Assessing the effect of three different diets on weight loss.

### 2. Two-Way ANOVA

**Definition**: Two-way ANOVA is used to examine the influence of two independent categorical variables on a 
    dependent variable. It also allows for the examination of interaction effects between the two factors.

**When to Use**:
- When you have two independent variables and you want to see how they individually and interactively affect the 
dependent variable.
- Example Situations:
  - Investigating the effects of both teaching method (traditional vs. online) and student gender (male vs. female) 
on test scores.
  - Analyzing the impact of different fertilizers (factor 1) and watering frequency (factor 2) on plant growth.

### 3. Repeated Measures ANOVA

**Definition**: Repeated measures ANOVA is used when the same subjects are measured multiple times under different 
    conditions. It accounts for the correlation between repeated measurements on the same subjects.

**When to Use**:
- When you have one or more independent variables and the same subjects are measured at multiple time points or 
under different conditions.
- Example Situations:
  - Assessing the effectiveness of a new drug by measuring patients’ blood pressure at several time points before 
and after treatment.
  - Comparing the performance of athletes on a fitness test before training, after a month of training, and after 
    two months of training.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
Partitioning of variance in ANOVA is a fundamental concept that helps explain how the total variability in a dataset
can be divided into different components. Understanding this concept is crucial for interpreting the results of ANOVA
and for identifying the sources of variation in your data.

### Components of Variance in ANOVA

1. **Total Variance (SST)**:
   - This is the overall variability in the data, calculated as the sum of the squared differences between each
observation and the grand mean (the mean of all observations).
   - Formula: 
     \[
     SST = \sum (X_{ij} - \bar{X})^2
     \]
   where \(X_{ij}\) is the individual observation, and \(\bar{X}\) is the grand mean.

2. **Between-Group Variance (SSB)**:
   - This component measures the variability due to the differences between the group means. It reflects how much 
the group means vary from the grand mean.
   - Formula:
     \[
     SSB = \sum n_k (\bar{X}_k - \bar{X})^2
     \]
   where \(n_k\) is the number of observations in group \(k\), \(\bar{X}_k\) is the mean of group \(k\), and \(\bar{X}\)
is the grand mean.

3. **Within-Group Variance (SSW)**:
   - This component measures the variability within each group. It reflects how much the individual observations 
within each group deviate from their respective group means.
   - Formula:
     \[
     SSW = \sum (X_{ij} - \bar{X}_k)^2
     \]
   where \(X_{ij}\) is the individual observation in group \(k\) and \(\bar{X}_k\) is the mean of group \(k\).

### Relationship Between Components

The total variance can be expressed as the sum of the between-group variance and the within-group variance:
\[
SST = SSB + SSW
\]

### Importance of Understanding Variance Partitioning

1. **Identifying Sources of Variation**:
   - Understanding how much of the total variability is attributed to differences between groups versus variability 
within groups helps researchers identify the sources of effects observed in the data.

2. **Hypothesis Testing**:
   - In ANOVA, the F-statistic is calculated by comparing the mean square between groups (MSB = SSB / degrees of 
freedom between) to the mean square within groups (MSW = SSW / degrees of freedom within). A larger MSB relative to 
MSW suggests that the group means are significantly different.

3. **Model Interpretation**:
   - The partitioning of variance allows researchers to better understand the effectiveness of experimental treatments 
or conditions. It helps in interpreting how much of the observed effect is due to the treatment versus random 
variability.

4. **Assumptions Check**:
   - Variance partitioning can also help in checking the assumptions of ANOVA, such as homogeneity of variances. 
If the within-group variance is significantly larger for some groups, it may indicate a violation of this assumption.

5. **Improving Experimental Design**:
   - Understanding how variance is partitioned can inform the design of experiments, leading to better controls and
potentially more powerful tests.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
import numpy as np
import pandas as pd

# Sample data: groups and their observations
data = {
    'Group_A': [23, 21, 22, 24, 25],
    'Group_B': [30, 29, 31, 32, 28],
    'Group_C': [20, 19, 21, 22, 20]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Calculate overall mean
overall_mean = df.values.flatten().mean()

# Total Sum of Squares (SST)
SST = np.sum((df.values.flatten() - overall_mean) ** 2)

# Explained Sum of Squares (SSE)
group_means = df.mean(axis=0)
n = df.count()  # Number of observations in each group
SSE = np.sum(n * (group_means - overall_mean) ** 2)

# Residual Sum of Squares (SSR)
SSR = np.sum((df.values - group_means.values.reshape(1, -1)) ** 2)

# Output results
print(f"Total Sum of Squares (SST): {SST:.2f}")
print(f"Explained Sum of Squares (SSE): {SSE:.2f}")
print(f"Residual Sum of Squares (SSR): {SSR:.2f}")

In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data
data = {
    'Factor_A': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A1', 'A1', 'A1', 'A2', 'A2', 'A2'],
    'Factor_B': ['B1', 'B1', 'B2', 'B1', 'B1', 'B2', 'B2', 'B2', 'B1', 'B1', 'B1', 'B2'],
    'Response': [5, 6, 7, 6, 7, 8, 7, 8, 6, 5, 6, 7]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Response ~ C(Factor_A) * C(Factor_B)', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Output results
print(anova_results)

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
When conducting a one-way ANOVA, the F-statistic and p-value provide important insights into the differences between 
group means. Here’s how to interpret the results given an F-statistic of 5.23 and a p-value of 0.02:

### Interpretation of the Results

1. **F-statistic**:
   - The F-statistic measures the ratio of the variance between the groups to the variance within the groups. 
A higher F-statistic indicates that the group means are more spread out than would be expected by chance alone.

2. **p-value**:
   - The p-value indicates the probability of observing the data (or something more extreme) if the null hypothesis 
(which states that all group means are equal) is true. A smaller p-value suggests that there is evidence against the
null hypothesis.

### Conclusion Based on p-value

- **Significance Level**: Common significance levels are 0.05, 0.01, and 0.10. In this case, with a p-value of 0.02:
  - If you are using a significance level of 0.05, you would **reject the null hypothesis**. This means there is 
statistically significant evidence to conclude that at least one group mean is different from the others.
  - If using a more stringent significance level (e.g., 0.01), you would fail to reject the null hypothesis since 
    0.02 is greater than 0.01, indicating that the results may not be considered significant at that level.

### Practical Interpretation

- **Differences Between Groups**: Since you reject the null hypothesis at the 0.05 significance level, you can 
    conclude that there are significant differences in the means of the groups being compared. However, it does 
    not indicate which specific groups are different. Additional post-hoc tests (like Tukey's HSD) would be needed 
    to identify which pairs of groups are significantly different.
  
- **Effect Size**: It may also be helpful to calculate the effect size (e.g., partial eta-squared) to understand the
    magnitude of the differences, as the p-value only indicates whether the differences are statistically significant,
    not how meaningful they are.


In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
Handling missing data in a repeated measures ANOVA is crucial because the nature of this type of analysis involves 
measurements taken on the same subjects under different conditions or over time. There are several approaches to 
dealing with missing data, each with its own implications.

### Methods for Handling Missing Data

1. **Listwise Deletion**:
   - This approach involves excluding any participant with missing data from the analysis. Only complete cases are 
included.
   - **Pros**: Simple to implement and keeps the analysis straightforward.
   - **Cons**: Can lead to a significant reduction in sample size, which may affect the power of the analysis and 
    could introduce bias if the missingness is not random (e.g., if certain types of participants are more likely 
    to have missing data).

2. **Pairwise Deletion**:
   - This method uses all available data for each analysis, allowing different participants to be included in 
different comparisons based on what data they have.
   - **Pros**: Retains more data than listwise deletion, which can improve the power of the analysis.
   - **Cons**: Can lead to inconsistencies in sample size across different tests, making it harder to interpret 
    results and potentially leading to biased estimates.

3. **Mean Imputation**:
   - Missing values are replaced with the mean of the observed values for that variable.
   - **Pros**: Simple and maintains the sample size.
   - **Cons**: Underestimates variability, can bias results, and may distort relationships among variables, 
    leading to incorrect conclusions.

4. **Last Observation Carried Forward (LOCF)**:
   - For longitudinal data, the last available measurement is used to fill in missing values.
   - **Pros**: Preserves the continuity of data for repeated measures.
   - **Cons**: Assumes that the last observation is a reasonable estimate of the missing value, which may not be true,
    leading to biased results.

5. **Multiple Imputation**:
   - Missing values are estimated multiple times to create several complete datasets, which are then analyzed 
separately and combined.
   - **Pros**: Provides a more robust estimate of uncertainty and preserves relationships among variables.
   - **Cons**: More complex to implement and requires appropriate modeling assumptions.

6. **Maximum Likelihood Estimation (MLE)**:
   - This method estimates parameters based on the available data without imputing missing values.
   - **Pros**: Utilizes all available information and can produce unbiased estimates under certain conditions.
   - **Cons**: May require more complex modeling and assumptions about the data distribution.

### Consequences of Different Methods

- **Bias and Accuracy**: Methods that ignore or poorly estimate missing values can introduce bias into the results. 
    For instance, listwise deletion may produce biased results if the missing data are not random (e.g., if certain 
    demographic groups are more likely to have missing data).
  
- **Statistical Power**: Reducing sample size through deletion methods can decrease the power of the analysis, 
    increasing the risk of Type II errors (failing to detect a true effect).

- **Variability**: Imputation methods that underestimate variability (like mean imputation) can lead to 
    overconfident conclusions about the significance of effects.

- **Generalizability**: The way missing data are handled can affect the generalizability of findings. 
    For example, if missingness is systematic (e.g., patients dropping out of a treatment study), the results may 
    not be applicable to the entire population.

### Best Practices

- **Assess Missingness**: Determine if the missing data are random (Missing Completely at Random - MCAR, Missing at 
Random - MAR, or Not Missing at Random - NMAR) to choose the appropriate method.
  
- **Consider Multiple Approaches**: It may be beneficial to try several methods for handling missing data and compare
results to assess the robustness of your findings.

- **Report Handling Method**: Clearly report how missing data were handled in any analysis to allow for proper 
interpretation of results.


In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Post-hoc tests are conducted after an ANOVA when the null hypothesis is rejected, indicating that there are
significant differences among group means. These tests help identify which specific groups are different from 
each other. Here are some common post-hoc tests, along with when to use each one:

### Common Post-Hoc Tests

1. **Tukey's Honestly Significant Difference (HSD)**:
   - **When to Use**: Ideal for comparing all possible pairs of group means while controlling for the Type I error rate. It assumes equal variances and is best when the sample sizes are equal or nearly equal.
   - **Example**: If you conducted a one-way ANOVA to compare the mean test scores of students taught using three 
    different teaching methods (Method A, Method B, Method C), and found significant differences, Tukey's HSD could
    be used to determine which teaching methods differ from each other.

2. **Bonferroni Correction**:
   - **When to Use**: Suitable when making a limited number of comparisons, particularly when you want to control for 
    Type I errors in a conservative way. It adjusts the significance level based on the number of comparisons being 
    made.
   - **Example**: If you have three groups and you are testing specific hypotheses about the means (e.g., Group 1 vs. 
    Group 2, Group 1 vs. Group 3), Bonferroni can be applied to adjust the significance level accordingly.

3. **Scheffé's Test**:
   - **When to Use**: Appropriate for comparing complex contrasts (not just simple pairwise comparisons). It is more 
        flexible but also more conservative than Tukey's HSD, making it less powerful in some situations.
   - **Example**: If you have more than three groups and want to test specific hypotheses about combinations of groups,
    Scheffé's Test can help you identify significant differences while controlling for Type I error.

4. **Dunnett's Test**:
   - **When to Use**: Used when comparing multiple groups to a single control group. It is specifically designed to 
        control the Type I error rate when making these comparisons.
   - **Example**: If you are testing the effects of different drugs on blood pressure compared to a control (no drug),
    Dunnett's Test would help identify which drugs significantly differ from the control.

5. **Newman-Keuls Test**:
   - **When to Use**: Useful for ordered comparisons among means, allowing for a stepwise comparison that can identify 
    differences while controlling the Type I error rate. It is less conservative than Tukey’s HSD.
   - **Example**: If you have a treatment that varies in intensity (low, medium, high) and you want to see how these
    intensities compare, Newman-Keuls can be applied.

### Example Situation for Post-Hoc Tests

Imagine a researcher conducts an experiment to test the effectiveness of three different diets (Diet A, Diet B, Diet C)
on weight loss over eight weeks. After performing a one-way ANOVA, the researcher finds a significant difference in 
the average weight loss among the three diets.

To determine which specific diets differ from one another, the researcher would use a post-hoc test. For instance:

- If the researcher wants to compare all pairs of diets to see which ones lead to different amounts of weight loss, 
**Tukey's HSD** would be appropriate.
- If the researcher was only interested in comparing each diet to a control diet (like a standard diet with no 
restrictions), **Dunnett's Test** would be more suitable.


In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data: weight loss for three diets
np.random.seed(0)  # For reproducibility
diet_a = np.random.normal(loc=5, scale=1.5, size=20)  # Diet A
diet_b = np.random.normal(loc=7, scale=1.5, size=15)  # Diet B
diet_c = np.random.normal(loc=6, scale=1.5, size=15)  # Diet C

# Combine into a DataFrame
data = 
    'WeightLoss': np.concatenate([diet_a, diet_b, diet_c]),
    'Diet': ['A'] * 20 + ['B'] * 15 + ['C'] * 15
}
df = pd.DataFrame(data)

# Fit the one-way ANOVA model
model = ols('WeightLoss ~ C(Diet)', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Output the results
print(anova_results)

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Set the seed for reproducibility
np.random.seed(0)

# Sample data: completion times for three programs and two experience levels
program_a_novice = np.random.normal(loc=20, scale=5, size=10)  # Novice with Program A
program_a_experienced = np.random.normal(loc=15, scale=5, size=10)  # Experienced with Program A
program_b_novice = np.random.normal(loc=25, scale=5, size=10)  # Novice with Program B
program_b_experienced = np.random.normal(loc=18, scale=5, size=10)  # Experienced with Program B
program_c_novice = np.random.normal(loc=30, scale=5, size=10)  # Novice with Program C
program_c_experienced = np.random.normal(loc=22, scale=5, size=10)  # Experienced with Program C

# Combine into a DataFrame
data = {
    'Time': np.concatenate([program_a_novice, program_a_experienced,
                            program_b_novice, program_b_experienced,
                            program_c_novice, program_c_experienced]),
    'Program': ['A'] * 10 + ['A'] * 10 + ['B'] * 10 + ['B'] * 10 + ['C'] * 10 + ['C'] * 10,
    'Experience': ['Novice'] * 20 + ['Experienced'] * 20
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ C(Program) * C(Experience)', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model)

# Output the results
print(anova_results)

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Set the seed for reproducibility
np.random.seed(0)

# Sample data: test scores for control and experimental groups
control_group = np.random.normal(loc=75, scale=10, size=50)  # Traditional method
experimental_group = np.random.normal(loc=80, scale=10, size=50)  # New method

# Combine into a DataFrame
data = {
    'Score': np.concatenate([control_group, experimental_group]),
    'Group': ['Control'] * 50 + ['Experimental'] * 50
}
df = pd.DataFrame(data)

# Conduct a two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Output t-test results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Check if the result is significant
if p_value < 0.05:
    print("The results are significant, proceeding with post-hoc tests.")
    
    # Since there are only two groups, we can directly report the difference
    mean_control = np.mean(control_group)
    mean_experimental = np.mean(experimental_group)
    print(f"Mean Control Group Score: {mean_control}")
    print(f"Mean Experimental Group Score: {mean_experimental}")

else:
    print("The results are not significant.")

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set the seed for reproducibility
np.random.seed(0)

# Sample data: daily sales for three stores over 30 days
store_a_sales = np.random.normal(loc=200, scale=30, size=30)  # Store A sales
store_b_sales = np.random.normal(loc=220, scale=30, size=30)  # Store B sales
store_c_sales = np.random.normal(loc=210, scale=30, size=30)  # Store C sales

# Create a DataFrame in long format for repeated measures
data = {
    'Day': np.tile(np.arange(1, 31), 3),
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales]),
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30
}
df = pd.DataFrame(data)

# Conduct repeated measures ANOVA
anova_results = AnovaRM(df, 'Sales', 'Day', within=['Store']).fit()

# Output the ANOVA results
print(anova_results)

# Check if the results are significant
if anova_results.pvalues['Store'] < 0.05:
    print("The results are significant, proceeding with post-hoc tests.")
    
    # Perform post-hoc test (Tukey's HSD)
    tukey_results = pairwise_tukeyhsd(endog=df['Sales'], groups=df['Store'], alpha=0.05)
    print(tukey_results)
else:
    print("The results are not significant.")