# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Analysis of Variance (ANOVA) is a statistical test used to compare the means of three or more groups. To ensure the validity of the results and accurate interpretations, several assumptions need to be met when using ANOVA. These assumptions include:

1. Independence of Observations: The observations in each group must be independent of each other. This means that the data points in one group should not be related or dependent on the data points in other groups.

2. Normality: The data within each group should follow a normal distribution. ANOVA assumes that the populations from which the samples are drawn are normally distributed. Violation of this assumption may lead to inaccurate results and misleading conclusions.

3. Homogeneity of Variance (Homoscedasticity): The variance of the dependent variable should be approximately equal across all groups. In other words, the spread of data points should be consistent in each group. Homogeneity of variance ensures that the groups have a similar level of variability. If the variance differs significantly among groups, the results of ANOVA may be unreliable.

4. Homogeneity of Regression Slopes (Only for Two-Way ANOVA with Interaction): If conducting a two-way ANOVA with interaction, an additional assumption is that the slopes of the regression lines (relationship between the dependent variable and the independent variable) should be equal across groups.

Examples of violations of these assumptions that could impact the validity of ANOVA results:

1. Violation of Normality: If the data within each group is not normally distributed, it can lead to biased results and incorrect conclusions. For example, if the data is highly skewed or has heavy tails, ANOVA may produce inaccurate estimates of group means and standard errors.

2. Violation of Homoscedasticity: If the variability in the data differs significantly among groups, the assumptions of ANOVA are not met. This can result in incorrect assessments of group differences and confidence intervals.

3. Violation of Independence: If the data points within groups are not independent, such as in a repeated measures design or clustered data, it violates the assumption of independence. This can lead to underestimation or overestimation of the significance of group differences.

4. Violation of Homogeneity of Regression Slopes: In a two-way ANOVA with interaction, if the slopes of the regression lines for the dependent variable and independent variables differ across groups, it can impact the interpretation of the interaction effect and the validity of the ANOVA results.



# Q2. What are the three types of ANOVA, and in what situations would each be used?
The three types of ANOVA are:

1. One-Way ANOVA (Analysis of Variance): One-Way ANOVA is used when we have one categorical independent variable (also called a factor) with three or more levels (groups), and we want to compare the means of the dependent variable across these groups. This test helps determine if there are any significant differences in the means of the dependent variable between the groups. One-Way ANOVA is appropriate when we have a single independent variable and want to investigate its impact on the dependent variable.

Example: A researcher wants to compare the test scores of students from three different schools to see if there are any significant differences in academic performance.

2. Two-Way ANOVA: Two-Way ANOVA is used when we have two categorical independent variables (factors) and one dependent variable. It allows us to examine the main effects of each independent variable separately, as well as their interaction effect (how the two independent variables together influence the dependent variable). This test is suitable when we want to investigate the combined influence of two factors on the dependent variable.

Example: A researcher wants to study the effects of both gender and teaching method on the test scores of students. The independent variables are gender (male or female) and teaching method (traditional or online).

3. Three-Way ANOVA: Three-Way ANOVA extends the concept of Two-Way ANOVA by adding a third categorical independent variable (factor) along with one dependent variable. It allows us to explore the main effects of each independent variable and their interaction effects in the context of three factors.

Example: A researcher wants to study the effects of temperature, humidity, and light intensity on plant growth. The independent variables are temperature (low, medium, high), humidity (low, medium, high), and light intensity (low, medium, high).



# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Partitioning of variance in ANOVA refers to the process of breaking down the total variance in the data into different components, each associated with a specific source of variation. ANOVA accomplishes this by partitioning the total variance into variance components attributable to various factors or sources of variability in the study.

In a typical One-Way ANOVA, the total variance in the data is partitioned into two main components:

1. Between-Groups Variance (or Treatment Variance): This component measures the variability between the group means. It represents the differences in the dependent variable caused by the independent variable (treatment) being studied. Larger between-groups variance indicates that the groups have different means.

2. Within-Groups Variance (or Error Variance): This component measures the variability within each group. It represents the random fluctuations or variability in the data that cannot be explained by the independent variable. It includes individual differences, measurement error, and any other unaccounted sources of variability.

The partitioning of variance allows us to assess the relative contributions of the treatment effect and random variability to the total variability in the data. By comparing the magnitude of these variance components, ANOVA helps us determine if the differences between group means (between-groups variance) are significantly larger than what we would expect due to random chance (within-groups variance).

Understanding the concept of partitioning of variance is important for several reasons:

1. Identifying Significant Effects: ANOVA helps us determine if the variation between groups (treatment effect) is statistically significant. If the between-groups variance is much larger than the within-groups variance, it suggests that the independent variable (treatment) has a significant effect on the dependent variable.

2. Interpreting Results: By understanding the partitioning of variance, we can interpret the results of ANOVA correctly. We can infer whether the observed differences between groups are likely due to the treatment effect or simply due to random fluctuations.

3. Quantifying Effect Size: The ratio of between-groups variance to total variance (also known as eta-squared) provides a measure of effect size, indicating how much of the total variability in the data can be attributed to the independent variable.

4. Model Improvement: Understanding the partitioning of variance can guide researchers in refining their models and experimental designs. By understanding the sources of variability in the data, researchers can optimize the design and control factors that influence the dependent variable.



# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?



In [3]:
import numpy as np

group_A = [25, 28, 32, 29, 31]
group_B = [21, 20, 24, 23, 25]
group_C = [27, 29, 28, 26, 30]


data = np.concatenate([group_A, group_B, group_C])

overall_mean = np.mean(data)

SST = np.sum((data - overall_mean) ** 2)

group_A_mean = np.mean(group_A)
group_B_mean = np.mean(group_B)
group_C_mean = np.mean(group_C)

SSE = len(group_A) * (group_A_mean - overall_mean) ** 2 + \
      len(group_B) * (group_B_mean - overall_mean) ** 2 + \
      len(group_C) * (group_C_mean - overall_mean) ** 2
        

SSR = np.sum((group_A - group_A_mean) ** 2) + \
      np.sum((group_B - group_B_mean) ** 2) + \
      np.sum((group_C - group_C_mean) ** 2)

print('SST :',SST)
print('SSE :',SSE)
print('SSR :',SSR)

SST : 175.73333333333335
SSE : 118.53333333333327
SSR : 57.2


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {
    'group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'factor1': [10, 12, 15, 8, 9, 11, 13, 14, 16],
    'factor2': [20, 21, 23, 18, 19, 22, 24, 25, 27]
}

df = pd.DataFrame(data)

formula = 'factor1 + factor2 + factor1:factor2'

model = ols('group ~ ' + formula, data=df).fit()

main_effects = model.params.drop(['Intercept'])
interaction_effects = model.params['factor1:factor2']

print("Main Effects:")
print(main_effects)

print("\nInteraction Effect:")
print(interaction_effects)

Main Effects:
factor1           -1.130586
factor2            0.300133
factor1:factor2    0.027313
dtype: float64

Interaction Effect:
0.02731269707332209


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?



Interpretation:
1. F-statistic: The F-statistic of 5.23 represents the test statistic that compares the variance between groups to the variance within groups. A larger F-statistic indicates that there is more variability between the group means relative to the variability within the groups.

2. p-value: The p-value of 0.02 represents the probability of observing an F-statistic as extreme as 5.23 under the assumption that there are no real differences between the group means (null hypothesis is true). In this case, the p-value is below the commonly used significance level of 0.05 (or 5%), suggesting that the observed differences between the groups are statistically significant.

Conclusions:
Based on the F-statistic and the p-value obtained from the one-way ANOVA:
- We can conclude that there are significant differences between the means of the groups being compared. 
- The differences observed in the group means are unlikely to be due to random chance alone, as the p-value is less than the chosen significance level (0.05). Therefore, we reject the null hypothesis that the means of all groups are equal.


# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is a crucial step to ensure the validity and accuracy of the analysis. Missing data can occur when participants have incomplete responses or drop out of the study before all measurements are collected. There are several methods to handle missing data, each with its advantages and potential consequences:

1. Complete Case Analysis (Listwise Deletion):
   - This method involves excluding all participants with missing data on any variable used in the analysis. Only the complete cases are retained for the analysis.
   - Advantage: It is straightforward to implement and does not require any additional assumptions.
   - Consequences: Complete case analysis can lead to a reduction in sample size and may introduce bias if the missing data are related to the outcome or predictors.

2. Mean Imputation:
   - In this method, missing values are replaced with the mean value of the variable across all participants.
   - Advantage: Mean imputation is simple and can preserve the sample size.
   - Consequences: Mean imputation can underestimate the variability in the data, leading to an inflated Type I error rate. It may also distort the true relationships between variables, especially if missingness is related to specific subgroups.

3. Last Observation Carried Forward (LOCF):
   - LOCF involves using the last observed value for a participant for subsequent missing time points.
   - Advantage: LOCF is simple to apply and can be appropriate when missing data occur sporadically and are likely to be missing at random.
   - Consequences: LOCF can introduce bias if the last observed value is not a good representation of the participant's true response. It can also underestimate the variability and lead to inaccurate estimates.

4. Multiple Imputation (MI):
   - Multiple imputation creates multiple plausible imputations for the missing data based on a statistical model, and then combines the results from each imputed dataset.
   - Advantage: MI provides valid and efficient estimates while accounting for the uncertainty in the imputation process. It yields unbiased parameter estimates and accurate standard errors.
   - Consequences: MI can be computationally intensive and requires careful specification of the imputation model. However, it is considered a robust method for handling missing data.

5. Maximum Likelihood Estimation (MLE):
   - MLE is a statistical technique that uses all available data and estimates model parameters while accounting for the missing data mechanism.
   - Advantage: MLE is a principled approach that provides unbiased estimates under the missing at random (MAR) assumption.
   - Consequences: MLE may be sensitive to the specific assumptions made about the missing data mechanism, and its performance may degrade if the missing data are not missing at random.

The choice of missing data handling method should be based on the underlying missing data mechanism, the amount of missingness, and the goals of the analysis. It is essential to consider the potential consequences of each method and perform sensitivity analyses to assess the robustness of the results to different missing data approaches. Additionally, researchers should be cautious in interpreting the results of the analysis and acknowledge the potential limitations associated with missing data.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.


Some common post-hoc tests used after ANOVA include:

1. Tukey's Honestly Significant Difference (HSD) Test:
   - Tukey's HSD test compares all possible pairs of group means and identifies which pairs have significantly different means. It is suitable for situations where you have multiple groups and want to determine the specific differences between them.
   - Example: Suppose you conducted a one-way ANOVA to compare the test scores of students from four different schools. The ANOVA showed a significant difference in means. To identify which schools have significantly different test scores, you can use Tukey's HSD test.

2. Bonferroni Correction:
   - The Bonferroni correction adjusts the significance level for multiple comparisons to maintain an overall family-wise error rate. It divides the desired significance level (e.g., 0.05) by the number of pairwise comparisons. If the p-value for a pairwise comparison is less than the adjusted significance level, then that comparison is considered significant.
   - Example: If you have conducted multiple t-tests to compare the means of different groups, the Bonferroni correction can be applied to control the overall Type I error rate.

3. Scheffé's Test:
   - Scheffé's test is a more conservative post-hoc test that controls the family-wise error rate for all possible comparisons. It is suitable for situations where you have a smaller sample size and want to make a large number of comparisons.
   - Example: In a study with multiple treatment groups and a small sample size, you can use Scheffé's test to compare the means of all groups.

4. Dunnett's Test:
   - Dunnett's test is used when you have a control group and want to compare the means of other treatment groups against the control group mean. It controls the Type I error rate for the multiple comparisons while taking into account the control group comparison.
   - Example: In a medical trial, you have a control group receiving a placebo and several treatment groups receiving different doses of a drug. Dunnett's test can be used to compare each treatment group to the control group.




# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [1]:
from scipy import stats
import numpy as np

diet_A = [2, 3, 4, 5, 2, 3, 4, 3, 2, 1, 2, 4, 3, 3, 2, 3, 1, 2, 3, 4, 5, 3, 2, 3, 4, 3, 2, 4, 5, 2, 3, 4, 3, 3, 2, 1, 2, 3, 4, 5, 3, 2, 3, 4, 3, 2, 4, 5]
diet_B = [3, 4, 5, 6, 3, 4, 5, 3, 3, 2, 4, 5, 4, 4, 2, 4, 3, 3, 4, 5, 6, 4, 3, 4, 5, 4, 4, 5, 6, 3, 4, 5, 6, 4, 3, 2, 4, 5, 4, 4, 3, 2, 4, 5, 4, 4, 5, 6]
diet_C = [4, 5, 6, 7, 4, 5, 6, 4, 3, 4, 5, 6, 5, 5, 3, 5, 4, 4, 5, 6, 7, 5, 4, 5, 6, 5, 5, 6, 7, 4, 5, 6, 7, 5, 4, 3, 5, 6, 5, 5, 4, 3, 5, 6, 5, 5, 6, 7]

data = np.concatenate([diet_A, diet_B, diet_C])



f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in the mean weight loss between the diets (p < 0.05).")
else:
    print("There is no significant difference in the mean weight loss between the diets (p >= 0.05).")

F-statistic: 41.423629324341206
p-value: 7.0525624646654424e-15
There is a significant difference in the mean weight loss between the diets (p < 0.05).


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.



In [2]:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols


data = {
    'Software': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C'],
    'Experience': ['Novice', 'Experienced'] * 15,
    'Time': [10, 12, 8, 9, 13, 14, 11, 10, 9, 11, 14, 13, 12, 11, 10, 9, 15, 14, 11, 12, 10, 11, 15, 16, 12, 10, 9, 10, 15, 14]
}

df = pd.DataFrame(data)


formula = 'Time ~ Software + Experience + Software:Experience'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)



                         sum_sq    df          F        PR(>F)
Software             115.266667   2.0  64.037037  2.387112e-10
Experience             0.133333   1.0   0.148148  7.037012e-01
Software:Experience    1.666667   2.0   0.925926  4.098595e-01
Residual              21.600000  24.0        NaN           NaN


Interpretation:


- If the p-value for the main effect of software is less than the chosen significance level (e.g., 0.05), it indicates that there is a significant difference in the average time to complete the task across the three software programs.
- If the p-value for the main effect of experience level is less than the chosen significance level, it indicates that there is a significant difference in the average time to complete the task between novice and experienced employees.
- If the p-value for the interaction effect between software and experience level is less than the chosen significance level, it indicates that there is a significant interaction effect between the software programs and experience level. This means that the effect of software on the time to complete the task may be different for novice and experienced employees.



# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [3]:

import numpy as np
import pandas as pd
import scipy.stats as stats


control_group = [75, 80, 85, 78, 82, 79, 81, 84, 76, 80, 77, 81, 83, 78, 82, 80, 79, 81, 85, 79, 82, 78, 81, 80, 84, 78, 79, 80, 83, 82, 79, 81, 80, 78, 79, 82, 80, 81, 78, 80, 82, 85, 79, 81, 80, 78, 79, 82, 80, 81]

experimental_group = [85, 89, 91, 86, 90, 88, 92, 87, 86, 88, 90, 87, 89, 88, 87, 86, 89, 91, 88, 90, 87, 89, 90, 88, 86, 87, 90, 89, 91, 87, 89, 88, 90, 87, 86, 89, 90, 88, 92, 87, 89, 90, 88, 86, 87, 90, 89, 91, 88, 87]

t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)


print("Two-sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups (p < 0.05).")
else:
    print("There is no significant difference in test scores between the control and experimental groups (p >= 0.05).")



import statsmodels.stats.multicomp as mc


all_data = np.concatenate([control_group, experimental_group])


group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)


posthoc = mc.pairwise_tukeyhsd(all_data, group_labels)

print("\nPost-hoc test (Tukey's HSD):")
print(posthoc)


Two-sample t-test:
t-statistic: -20.366190118761846
p-value: 5.41551424565112e-37
There is a significant difference in test scores between the control and experimental groups (p < 0.05).

Post-hoc test (Tukey's HSD):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental      8.1   0.0 7.3107 8.8893   True
--------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.stats.multicomp as mc


data = {
    'Store': ['A', 'B', 'C'] * 30,
    'Sales': np.random.randint(1000, 2000, 90)
}


df = pd.DataFrame(data)

formula = 'Sales ~ C(Store)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)



posthoc = mc.pairwise_tukeyhsd(df['Sales'], df['Store'])

print("\nPost-hoc test (Tukey's HSD):")
print(posthoc)


             sum_sq    df         F    PR(>F)
C(Store)    55629.6   2.0  0.367618  0.693452
Residual  6582609.3  87.0       NaN       NaN

Post-hoc test (Tukey's HSD):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
     A      B     -5.8 0.9963 -175.1509 163.5509  False
     A      C    -55.4 0.7162 -224.7509 113.9509  False
     B      C    -49.6 0.7651 -218.9509 119.7509  False
-------------------------------------------------------
