**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.**

**Ans:**

ANOVA (Analysis of Variance) is a statistical test used to determine whether there are significant differences among the means of two or more groups. It is based on several assumptions that need to be met for the results to be valid. These assumptions include:

1. Normality of sampling distribution of means: The sampling distribution of means for each group should be normally distributed.

2. Homogeneity of variances: The variances of each group should be equal.

3. Independence of samples: The samples should be independent of each other.

4. Absence of outliers: There should be no extreme values or outliers in the data.



Examples of violations that could impact the validity of the results include:


1. Violation of normality assumption: 

   If the data are not normally distributed, the ANOVA results may not be valid. For example, if the data are skewed or have a heavy tail, the assumption of normality may be violated. In such cases, a non-parametric test may be more appropriate.


2. Violation of homogeneity of variances: 

   If the variances of the groups are not equal, the ANOVA results may not be reliable. For example, if the variances of the groups are very different, the assumption of homogeneity may be violated. In such cases, a Welch's ANOVA test may be more appropriate.


3. Violation of independence assumption: 

   If the samples are not independent, the ANOVA results may be biased. For example, if the same subjects are used in each group, the assumption of independence may be violated. In such cases, a repeated-measures ANOVA test may be more appropriate.


4. Presence of outliers: 

   If there are extreme values or outliers in the data, the ANOVA results may be affected. Outliers can have a disproportionate effect on the results and can lead to incorrect conclusions. In such cases, it may be necessary to remove the outliers or use a non-parametric test.

**Q2. What are the three types of ANOVA, and in what situations would each be used?**

**Ans:**

1. One-way ANOVA: 

    A one-way ANOVA is used when there is only one independent variable or factor with two or more levels. The aim is to compare the means of the dependent variable across different levels of the independent variable. 

    For example, a one-way ANOVA can be used to compare the mean height of plants grown in three different soil types (sand, loam, clay).
    

2. Repeated measures ANOVA: 

    A repeated measures ANOVA is used when the same group of participants is measured under different conditions or at different time points. The aim is to compare the means of the dependent variable across different levels of the within-subjects factor (i.e., the different conditions or time points). 

    For example, a repeated measures ANOVA can be used to compare the mean scores of participants on a cognitive task measured before and after a treatment intervention.
    

3. Factorial ANOVA: 

    A factorial ANOVA is used when there are two or more independent variables or factors, each with two or more levels. The aim is to examine the main effects of each independent variable and their interaction effect on the dependent variable. 

    For example, a factorial ANOVA can be used to examine the main effects of gender (male vs female) and age group (young vs old) on a measure of physical fitness, as well as the interaction effect between these two independent variables.

**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

**Ans:**

Partitioning of variance in ANOVA refers to the process of dividing the total variance of the dependent variable into different sources of variation that can be attributed to different factors or sources. This is important because it allows us to identify which factors are contributing significantly to the variation in the dependent variable and which are not.

The total variance in the dependent variable can be divided into two components: the within-group variance and the between-group variance. The within-group variance represents the variation in the dependent variable that is due to random or uncontrolled factors within each group, while the between-group variance represents the variation that is due to the differences between the group means.

The partitioning of variance is carried out using the F-test in ANOVA, which compares the ratio of the between-group variance to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it suggests that the means of the groups are significantly different from each other, and we can reject the null hypothesis of no group differences.

Understanding the partitioning of variance is important because it helps us to determine which factors are most important in explaining the variation in the dependent variable. This information can be used to guide further research, develop interventions, or make decisions based on the results of the study. It also helps us to identify potential confounding variables that may need to be controlled for in future studies.

**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?**

**Ans:**

In [3]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas dataframe with a dependent variable and an independent variable
dependent_variable = np.array([2, 4, 5, 3, 6, 7, 8, 9, 10, 11])
group_variable = np.array(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'])
df = pd.DataFrame({'dependent_variable': dependent_variable, 'group_variable': group_variable})

# Define the ANOVA model
model = ols('dependent_variable ~ group_variable', data=df).fit()

# Calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print('SST: ',sst)
print('SSE: ',sse)
print('SSR: ',ssr)

SST:  64.16666666666669
SSE:  18.333333333333336
SSR:  45.83333333333335


**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

**Ans:**

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas dataframe with two independent variables and a dependent variable
independent_variable_1 = np.array(['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'])
independent_variable_2 = np.array(['X', 'X', 'X', 'Y', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'Y', 'Y'])
dependent_variable = np.array([10, 12, 8, 14, 16, 12, 18, 20, 16, 22, 24, 20])
df = pd.DataFrame({'independent_variable_1': independent_variable_1, 'independent_variable_2': independent_variable_2, 'dependent_variable': dependent_variable})


# Define the ANOVA model with interaction
model_interaction = ols('dependent_variable ~ independent_variable_1 * independent_variable_2', data=df).fit()

# Calculate the main effects and interaction effects
main_effect_1 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][1]
main_effect_2 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][2]
interaction_effect = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][3]

print('Main effect of independent variable 1: ', main_effect_1)
print('Main effect of independent variable 2: ', main_effect_2)
print('Interaction effect: ', interaction_effect)

Main effect of independent variable 1:  47.99999999999992
Main effect of independent variable 2:  3.155443620884047e-29
Interaction effect:  192.0


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?**

**Ans:**

In a one-way ANOVA, the F-statistic tests the null hypothesis that the means of all the groups are equal, against the alternative hypothesis that at least one group mean is different from the others. The p-value associated with the F-statistic represents the probability of obtaining a test statistic as extreme or more extreme than the observed F-statistic, assuming that the null hypothesis is true.

In this case, the obtained F-statistic is 5.23 and the p-value is 0.02. Since the p-value is less than the commonly used significance level of 0.05, we can reject the null hypothesis and conclude that at least one group mean is different from the others.

We can interpret this result as evidence of statistically significant differences between the groups on the dependent variable. However, we cannot determine which group(s) is/are different from the others based solely on the ANOVA results. To determine which groups are different, we need to conduct post-hoc tests (e.g., Tukey's HSD, Bonferroni correction, etc.) or use other statistical techniques (e.g., contrasts).

It's important to note that statistical significance does not necessarily imply practical significance or importance. The size of the differences between the groups should also be considered when interpreting the results.

**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?**

**Ans:**

In a repeated measures ANOVA, missing data can be handled in several ways:


1. Pairwise deletion: 

    This involves excluding any participants with missing data on one or more variables. This approach can result in a loss of statistical power and potentially biased results if the missing data is not missing completely at random.


2. Listwise deletion: 

    This involves excluding any participants with missing data on any variable in the analysis. This approach can also result in a loss of statistical power and potentially biased results if the missing data is not missing completely at random.


3. Imputation: 

    This involves filling in missing data with plausible values based on the observed data. This approach can improve statistical power, reduce bias, and maintain sample size. There are several methods for imputing missing data, including mean imputation, regression imputation, and multiple imputation.




The potential consequences of using different methods to handle missing data include:


1. Bias: 

    If the missing data is not missing completely at random, then the results of the analysis may be biased. For example, if participants with missing data are systematically different from those without missing data, then excluding them from the analysis or imputing missing data based on the observed data may result in biased estimates.


2. Loss of power: 

    Excluding participants with missing data or imputing missing data based on the observed data can result in a loss of statistical power, which can make it more difficult to detect significant effects.


3. Inaccurate estimates: 

    Different imputation methods can result in different estimates of the effect sizes, standard errors, and p-values. Using inappropriate or invalid imputation methods can result in inaccurate estimates.

**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.**

**Ans:**

Post-hoc tests are used in ANOVA to compare specific pairs of means after a statistically significant result has been found. 

There are several common post-hoc tests used after ANOVA, including:


1. Tukey's Honestly Significant Difference (HSD) test: 

    This test is used to compare all possible pairs of means in a dataset. It is considered to be the most conservative post-hoc test and is commonly used when sample sizes are equal.


2. Bonferroni correction: 

    This test is used to adjust the p-values of the individual pairwise comparisons to control for multiple comparisons. It is more conservative than Tukey's HSD test, but it is commonly used when sample sizes are unequal.


3. Scheffe's test: 

    This test is used when the number of pairwise comparisons is large, and it controls the overall Type I error rate. It is more conservative than Tukey's HSD and Bonferroni correction, but it is more powerful when the number of comparisons is high.


An example of a situation where a post-hoc test might be necessary is if a one-way ANOVA showed that there was a statistically significant difference in the mean scores on an exam across different levels of study habits. A post-hoc test could be used to determine which specific groups differed significantly from each other. For instance, if we found that the ANOVA F-test was significant and showed a difference in the mean exam scores across the groups with different study habits, we could conduct Tukey's HSD test to see which groups differ significantly from each other. This would allow us to make more specific conclusions about the differences between the groups, rather than simply concluding that there is a significant difference between them.

**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.**

**Ans:**

H0 :- There is significance difference between mean weight loss.

H1 :- There is no significance difference between mean weight loss.

In [15]:
import numpy as np
from scipy.stats import f_oneway

diet_a = np.random.normal(5,1,50) # Mean weight loss of 5 lbs and standard deviation of 1 lb
diet_b = np.random.normal(6,1,50)
diet_c = np.random.normal(4,1,50)

# Combine all the data in single list or array
all_data = np.concatenate([diet_a,diet_b,diet_c])

# Create the list of lables for the three diets
lables = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

f_stat, p_val = f_oneway(diet_a,diet_b,diet_c)

print(f"f_stat: {f_stat} \np_val: {p_val}")

if f_stat > p_val:
    print('We reject the null hypothesis.')
else:
    print('We failed to reject null hypothesis')

f_stat: 44.75829213306713 
p_val: 6.591687716951498e-16
We reject the null hypothesis.


**Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.**

**Ans:**


In [18]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Setting random seed for reproducibility
np.random.seed(123)

# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

# Print the simulated data head 
print('Simulated Data example :')
print(data.head())

print('\n \n')

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

# Set significance level
alpha = 0.05

# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")

Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799

 

                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


**Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.**

**Ans:**

In [20]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Setting numpy random seed
np.random.seed(45)

# Generating normal test scores with same variance for both control groups
test_score_control = np.random.normal(loc=70, scale=3, size=50)
test_score_experimental = np.random.normal(loc=85, scale=3, size=50)

# Creating the dataframe
df = pd.DataFrame({'test_score':list(test_score_control)+list(test_score_experimental),
                   'group':['control']*50 + ['experimental']*50})

# printing the sample dataframe
print('Simulated data for test_scores:')
print(df.head())
print('\n \n')

null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."

# Conduct the two-sample t-test
control_scores = df[df['group'] == 'control']['test_score']
experimental_scores = df[df['group'] == 'experimental']['test_score']
t_stat, p_val = ttest_ind(control_scores, experimental_scores, equal_var=True)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_val}")
print('\n')

# Significance value 
alpha = 0.05
if p_val<alpha:
    print('Reject the Null Hypothesis')
    print(f'Conclusion : {alt_hypothesis}')
else:
    print('Failed to reject the Null Hypothesis')
    print(f'Conclusion : {null_hypothesis}')

Simulated data for test_scores:
   test_score    group
0   70.079124  control
1   70.780965  control
2   68.814563  control
3   69.387097  control
4   66.185102  control

 

t-statistic: -28.5074, p-value: 3.096206271894725e-49


Reject the Null Hypothesis
Conclusion : There is SIGNIFICANT difference in test scores between the control and experimental groups.


**Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.**

**Ans:**

In [22]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# set random seed for reproducibility
np.random.seed(456)

# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))

# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})

# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']

# Printing top 5 rows of generated data
print('Generated data top 5 rows : ')
print(sales_melted.head())

print('\n \n')

# perform repeated measures ANOVA
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)

# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Generated data top 5 rows : 
   Day    Store        Sales
0    0  Store A   933.187150
1    1  Store A   950.179048
2    2  Store A  1061.857582
3    3  Store A  1056.869225
4    4  Store A  1135.050948

 

               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 51.5040 2.0000 58.0000 0.0000

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B   21.2439 0.6786  -40.8832    83.371  False
Store A Store C -207.8078  0.001 -269.9349 -145.6807   True
Store B Store C -229.0517  0.001 -291.1788 -166.9246   True
-----------------------------------------------------------
