**Q1.** Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

**Answer 1** -

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups on a continuous outcome variable. The following are the assumptions required to use ANOVA:

1. Independence: The observations within each group should be independent of each other. This means that the value of one observation should not be influenced by the value of another observation.

2. Normality: The outcome variable should follow a normal distribution within each group. This means that the distribution of the outcome variable should be approximately bell-shaped and symmetric.

3. Homogeneity of variance: The variance of the outcome variable should be approximately equal across all groups. This means that the variability of the outcome variable should be similar in each group.

Violations of these assumptions could impact the validity of the results of ANOVA. 

For example:

1. Violation of independence: This can occur when observations within a group are correlated, which can lead to biased estimates of group means and wider confidence intervals. For example, in a study of the effect of teaching style on student grades, if students are assigned to classes based on their ability level, there may be correlation among students in the same class.

2. Violation of normality: This can occur when the outcome variable is not normally distributed within a group, which can affect the accuracy of the p-value and confidence interval. For example, in a study of the effect of a drug on blood pressure, if the blood pressure measurements are skewed, ANOVA may not be appropriate.

3. Violation of homogeneity of variance: This can occur when the variance of the outcome variable is not equal across groups, which can lead to incorrect conclusions about group differences. For example, in a study of the effect of fertilizer on crop yield, if the variance of the yield is higher for one type of fertilizer, ANOVA may not be appropriate.

In summary, ANOVA requires the assumptions of independence, normality, and homogeneity of variance. Violations of these assumptions can impact the validity of the results. It is important to check for these violations and, if present, consider alternative statistical methods.

**Q2.** What are the three types of ANOVA, and in what situations would each be used?

**Answer** -

The three types of ANOVA are:

1. One-way ANOVA: This is used when we want to compare the means of three or more groups with one independent variable. For example, we might want to compare the mean test scores of students who have studied for different durations of time (e.g., 1 hour, 2 hours, 3 hours).

2. Two-way ANOVA: This is used when we want to compare the means of three or more groups with two independent variables. For example, we might want to compare the mean test scores of students who have studied for different durations of time (e.g., 1 hour, 2 hours, 3 hours) and who come from different schools (e.g., School A, School B).

3. Three-way ANOVA: This is used when we want to compare the means of three or more groups with three independent variables. For example, we might want to compare the mean test scores of students who have studied for different durations of time (e.g., 1 hour, 2 hours, 3 hours), who come from different schools (e.g., School A, School B), and who have different ages (e.g., 10 years old, 11 years old, 12 years old).

In summary, we use one-way ANOVA when we have one independent variable, two-way ANOVA when we have two independent variables, and three-way ANOVA when we have three independent variables.

**Q3.** What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

**Answer 3** 

The partitioning of variance in ANOVA refers to the process of dividing the total variance observed in a dataset into two or more components. The components are then used to test hypotheses about the effects of different factors on the dependent variable. 

The total variance in a dataset is the sum of the variance within groups (the variability within each group of the independent variable) and the variance between groups (the variability between the different groups of the independent variable). 

By partitioning the variance, ANOVA helps to determine the relative contributions of different sources of variability to the observed differences in the means of the groups. This can be useful in determining the factors that are most important in explaining the variability in the data, and in identifying potential sources of error or bias in the analysis.

Understanding the concept of partitioning of variance is important because it provides a framework for analyzing the results of an ANOVA and interpreting the significance of different factors on the dependent variable. Additionally, it helps to identify which factors may be contributing more to the observed differences in means, and can be used to guide further investigation or analysis.

**Q4.** How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

**Answer -**

In a one-way ANOVA, the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) can be calculated using Python. Here's how:

Assuming you have imported the necessary libraries (numpy and pandas) and read in your data as a dataframe named "df", you can calculate SST, SSE, and SSR as follows:

In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas dataframe with a dependent variable and an independent variable
dependent_variable = np.array([2, 4, 5, 3, 6, 7, 8, 9, 10, 11])
group_variable = np.array(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'])
df = pd.DataFrame({'dependent_variable': dependent_variable, 'group_variable': group_variable})

# Define the ANOVA model
model = ols('dependent_variable ~ group_variable', data=df).fit()

# Calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)

SST: 64.16666666666669
SSE: 18.333333333333332
SSR: 45.83333333333336


In this code, "dependent_variable" refers to the variable you are measuring (the "y" variable), and "independent_variable" refers to the factor you are testing (the "x" variable). The code calculates the grand mean of the dependent variable, and then calculates SST as the sum of the squared deviations of each observation from the grand mean. SSE is calculated as the sum of the squared deviations of each group mean from the grand mean, weighted by the size of each group. Finally, SSR is calculated as the sum of the squared deviations of each observation from its group mean. The code also calculates the degrees of freedom for each component, which are used in further ANOVA calculations.

Note that this code assumes that the data are balanced (i.e., each group has the same number of observations) and that the variances of the groups are equal. If these assumptions are violated, more complex ANOVA models may be necessary.

**Q5.** In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

**Answer -** 

In a two-way ANOVA, the main effects of each factor and the interaction effect between the factors can be calculated using the statsmodels library in Python.

To calculate the main effects, we can use the ols function to fit a linear model with the response variable and each factor separately. For example, if we have two factors A and B, we can calculate the main effect of A as follows:

In [7]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas dataframe with two independent variables and a dependent variable
independent_variable_1 = np.array(['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'])
independent_variable_2 = np.array(['X', 'X', 'X', 'Y', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'Y', 'Y'])
dependent_variable = np.array([10, 12, 8, 14, 16, 12, 18, 20, 16, 22, 24, 20])
df = pd.DataFrame({'independent_variable_1': independent_variable_1, 'independent_variable_2': independent_variable_2, 'dependent_variable': dependent_variable})


# Define the ANOVA model with interaction
model_interaction = ols('dependent_variable ~ independent_variable_1 * independent_variable_2', data=df).fit()

# Calculate the main effects and interaction effects
main_effect_1 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][1]
main_effect_2 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][2]
interaction_effect = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][3]

print('Main effect of independent variable 1:', main_effect_1)
print('Main effect of independent variable 2:', main_effect_2)
print('Interaction effect:', interaction_effect)

Main effect of independent variable 1: 47.99999999999992
Main effect of independent variable 2: 3.155443620884047e-29
Interaction effect: 192.0


**Q6.** Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

**Answer 6-**

If the F-statistic of 5.23 and a p-value of 0.02 was obtained in a one-way ANOVA, it means that there is evidence of significant differences between the groups being compared. Specifically, it means that the variation in the sample means across the groups is greater than what would be expected by chance alone.

The p-value of 0.02 indicates that the probability of observing an F-statistic as extreme as the one obtained, assuming there is no difference between the groups, is only 0.02. This is below the commonly used significance level of 0.05, suggesting that the null hypothesis (i.e., there is no difference between the groups) should be rejected in favor of the alternative hypothesis (i.e., there is at least one group that is different from the others).

The effect size can also be calculated to better understand the magnitude of the differences between the groups. For example, one commonly used effect size measure is eta-squared (η2), which is the proportion of the total variance in the dependent variable that is explained by the group membership. The formula for calculating eta-squared is:

η2 = SSE / SST

where SSE is the sum of squared errors and SST is the total sum of squares.

If eta-squared was found to be 0.15, for example, this would mean that 15% of the variance in the dependent variable can be explained by the group membership.

**Q7.** In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

**Answer 7-** In a repeated measures ANOVA, missing data can be handled in several ways, such as:

1. Pairwise deletion: This involves deleting cases that have missing data on one or more variables. This method can lead to biased estimates if the missing data are not missing completely at random (MCAR).

2. Listwise deletion: This involves deleting cases that have missing data on any variable in the analysis. This method can lead to a loss of power if there are a large number of missing cases.

3. Imputation: This involves replacing missing values with estimated values. Imputation methods can be either single imputation or multiple imputation. Single imputation methods include mean imputation, median imputation, and regression imputation. Multiple imputation involves creating multiple imputed datasets based on the observed data and using these datasets to estimate the parameters of interest. 

The consequences of using different methods to handle missing data in a repeated measures ANOVA depend on the extent and pattern of missing data and the method used for handling missing data. If the missing data are MCAR, all methods will yield unbiased estimates. However, if the missing data are not MCAR, then pairwise or listwise deletion can lead to biased estimates and reduced power. Imputation methods can also lead to biased estimates if the imputation model is misspecified or if the assumptions underlying the imputation method are violated.

It is important to carefully consider the pattern and extent of missing data in a repeated measures ANOVA and to use appropriate methods for handling missing data to ensure valid and reliable results.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Answer - Post-hoc tests are used after ANOVA to determine which groups differ significantly from each other. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD) test, Bonferroni correction, Scheffe's method, and Dunnett's test.

Tukey's HSD test is used when all pairwise comparisons between groups are of interest, and it is generally the most powerful test. Bonferroni correction is more conservative and is used when multiple comparisons are made. Scheffe's method is used when the sample sizes are unequal and the variances are unequal. Dunnett's test is used when there is one control group and multiple treatment groups.

For example, suppose a researcher conducts a study to determine if there is a difference in the mean test scores between three different teaching methods. After conducting an ANOVA, the researcher finds that there is a significant difference between the groups. To determine which groups differ significantly from each other, the researcher can conduct a post-hoc test such as Tukey's HSD test.

**Q9.** A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.

Report the F-statistic and p-value, and interpret the results.

In [8]:
import numpy as np
from scipy.stats import f_oneway

# Generate simulated data assuming normal distribution with same variance
np.random.seed(1)
diet_A = np.random.normal(5, 1, 50)
diet_B = np.random.normal(4, 1, 50)
diet_C = np.random.normal(3, 1, 50)

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Set significance level
alpha = 0.05

# Null hypothesis: The mean weight loss is the same for all three diets.
# Alternative hypothesis: The mean weight loss is different for at least one diet.
null_hypothesis = "The mean weight loss is the same for all three diets."
alternate_hypothesis = "The mean weight loss is different for at least one diet."

print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < alpha:
    print("We reject the null hypothesis.")
    print(f"Conclusion : {alternate_hypothesis}")
else:
    print("We fail to reject the null hypothesis.")
    print(f"Conclusion : {null_hypothesis}")

F-statistic: 57.06379442059458
p-value: 4.5619061215783055e-19
We reject the null hypothesis.
Conclusion : The mean weight loss is different for at least one diet.


## Performing Tukey's test for mean difference:

In [9]:
import numpy as np
from statsmodels.stats.multicomp import pairwise_tukeyhsd

#create DataFrame to hold data
df = pd.DataFrame({'weight_loss': list(diet_A) + list(diet_B) + list(diet_C),
                   'group': np.repeat(['A', 'B', 'C'], repeats=50)})

# perform Tukey's test
tukey = pairwise_tukeyhsd(endog=df['weight_loss'],
                          groups=df['group'],
                          alpha=0.05)

# Print results
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     A      B  -0.8278   0.0 -1.2477 -0.4079   True
     A      C  -1.8898   0.0 -2.3097 -1.4699   True
     B      C   -1.062   0.0 -1.4819 -0.6421   True
---------------------------------------------------


## Above interpretation means all three means are different reject value is True for all of 3
1. Mean Difference between diet_A and diet_B is -0.8278
2. Mean Difference between diet_A and diet_C is -1.8898
3. Mean Difference between diet_A and diet_C is -1.062

**Q10.** A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Setting random seed for reproducibility
np.random.seed(123)

# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

# Print the simulated data head 
print('Simulated Data example :')
print(data.head())

print('\n======================================================================================\n')

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

# Set significance level
alpha = 0.05

# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")

Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


## Here are the interpretations of the three conclusions:
1. "There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.

2. "There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.

3. "There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

## Two Sample t-test, alpha = 0.05

In [11]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Setting numpy random seed
np.random.seed(45)

# Generating normal test scores with same variance for both control groups
test_score_control = np.random.normal(loc=70, scale=3, size=50)
test_score_experimental = np.random.normal(loc=85, scale=3, size=50)

# Creating the dataframe
df = pd.DataFrame({'test_score':list(test_score_control)+list(test_score_experimental),
                   'group':['control']*50 + ['experimental']*50})

# printing the sample dataframe
print('Simulated data for test_scores:')
print(df.head())
print('\n===============================\n')

null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."

# Conduct the two-sample t-test
control_scores = df[df['group'] == 'control']['test_score']
experimental_scores = df[df['group'] == 'experimental']['test_score']
t_stat, p_val = ttest_ind(control_scores, experimental_scores, equal_var=True)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_val}")
print('\n')

# Significance value 
alpha = 0.05
if p_val<alpha:
    print('Reject the Null Hypothesis')
    print(f'Conclusion : {alt_hypothesis}')
else:
    print('Failed to reject the Null Hypothesis')
    print(f'Conclusion : {null_hypothesis}')

Simulated data for test_scores:
   test_score    group
0   70.079124  control
1   70.780965  control
2   68.814563  control
3   69.387097  control
4   66.185102  control


t-statistic: -28.5074, p-value: 3.096206271894725e-49


Reject the Null Hypothesis
Conclusion : There is SIGNIFICANT difference in test scores between the control and experimental groups.


## 2. Tukey's HSD test

In [12]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Conduct post-hoc Tukey's test
tukey_results = pairwise_tukeyhsd(df['test_score'], df['group'], 0.05)
print(tukey_results)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
control experimental  15.8829   0.0 14.7773 16.9886   True
----------------------------------------------------------


## Tukey's Results Interpretation

1. Reject = True suggests that there is significant difference in both control and Experimental groups also p-adj is almost 0.

2. Experimental group has increased the performance of test scores of students by mean of 15.88 approximately

3. Mean score improved by Experimental method is (14.78,16.99) with 95% confidence level

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

## Answer - 

1. Assumed significance value of 0.05

In [13]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# set random seed for reproducibility
np.random.seed(456)

# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))

# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})

# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']

# Printing top 5 rows of generated data
print('Generated data top 5 rows : ')
print(sales_melted.head())

print('\n================================================\n')

# perform repeated measures ANOVA
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)

# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Generated data top 5 rows : 
   Day    Store        Sales
0    0  Store A   933.187150
1    1  Store A   950.179048
2    2  Store A  1061.857582
3    3  Store A  1056.869225
4    4  Store A  1135.050948


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 51.5040 2.0000 58.0000 0.0000

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B   21.2439 0.6945   -40.881   83.3688  False
Store A Store C -207.8078    0.0 -269.9328 -145.6829   True
Store B Store C -229.0517    0.0 -291.1766 -166.9268   True
-----------------------------------------------------------


## Interpretation of the above - 

1. In Repeated Measure ANOVA test we got p_value (Pr>F) as 0.0000 which is less than 0.05 .Reject the Null Hypothesis .Which means atleast one of the mean of groups is different.

2. In Tukey's Post Hoc Test we get following interpretation :

- No significant difference between sales of Store A and Store B. Store B earns 21.24 dollars more than store A(becuse reject=False for this)
- Significant difference between sales of Store A and Store C . Store C has approx 207.8 dollars lesser compared to store A (reject=True)
- Siginficant difference between sales of Store B and Store C . Store C has approx 229.0 dollars lesser compared to store B (reject=True)