Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Assumptions in ANOVA 

- Normality of sampling distribution of means
- Absence of outliers
- Homogenity of variance
- Samples are independent and random

Violations of the homogeneity of variances assumption can be more impactful, especially when sample sizes are unequal between conditions.

Q2. What are the three types of ANOVA, and in what situations would each be used?

- one way anova
- repeated measures anova
- Factorial anova

one way anova ------------- Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night.

repeated measures anova --------------- with three time points: cigarette consumption immediately before, 1 month after, and 6 months after the hypnotherapy programme

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

divide the total variance in our data into the various sources of that variation.

because it provides a way to quantify the sources of variation in a dataset and to assess the significance of different factors or treatments

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

SST (TOTAL SUM OF SQUARE) = we can say that is is a sum of SS(between) and SS(within)

SSE (SUM OF SQUARE WITHIN THE SAMPLE (MEANS -- SQUARE AND SUM OF EACH SAMPLES ELEMENT))

SSR (SUM OF SQUARE BETWEEN THE GROUP (MEANS -- SQUARE THE ELEMENTS OF SAMPLES AND SUM IT))

In [3]:
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([9, 11, 13, 15, 17])
group3 = np.array([8, 10, 12, 14, 16])

data = np.concatenate([group1, group2, group3])

groups = np.array(['group1'] * 5 + ['group2'] * 5 + ['group3'] * 5)

model = ols('data ~ groups', data={'data': data, 'groups': groups}).fit()

sst = np.sum((data - np.mean(data))**2)

sse = np.sum(model.resid**2)

ssr = np.sum((model.fittedvalues - np.mean(data))**2)

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 130.0
SSE: 120.00000000000001
SSR: 10.000000000000018


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_csv("data.csv")

model = ols('response ~ C(factorA) + C(factorB) + C(factorA):C(factorB)', data=data).fit()

main_effects = sm.stats.anova_lm(model, typ=1)

interaction_effect = sm.stats.anova_lm(model, typ=2)

print(main_effects)
print(interaction_effect)


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

The p-value of 0.02 means that the probability of obtaining an F-statistic as large as or larger than 5.23 is only 2% under the null hypothesis that there is no difference between the groups. This suggests that the null hypothesis can be rejected, and there is evidence to support the alternative hypothesis that at least one group mean is different from the others.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

remove the data for that participant/animal/whatever entirely from the data table before running the ANOVA

- Bias in estimates
- Reduced statistical power
- Increased standard errors
- Different conclusions

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

some common post-hoc tests used after ANOVA

- Tukey's HSD (Honestly Significant Difference)
- Bonferroni correction
- Scheffé's method
- Dunnett's test
- Games-Howell test

- EXAMPLE - The ANOVA might find a significant difference between the three treatments, but it does not specify which treatments are different from each other. In this case, a post-hoc test, such as Tukey's HSD or Scheffé's method, can be used to determine which treatments are significantly different from each other. This can help the researchers to make more informed decisions about which treatment to recommend to patients.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [22]:
import scipy.stats as stats

In [23]:
dite_a = [5,4,6,9,8,7,5,3,2,5,6,4,5,1,5,8,9,5,6,2,3,6,4,5,8,9,6,8,9,8,5,5,6,5,7,5,5,8,5,6,9,8,6,5,2,4,5,6,8,7]
dite_b = [8,5,8,9,6,5,7,8,6,5,1,2,3,5,3,6,1,2,4,5,6,9,8,5,1,6,3,5,6,3,2,8,5,4,8,9,2,6,2,4,2,6,8,2,8,8,8,8,2,8]
dite_c = [8,9,6,5,2,3,6,5,4,1,2,5,4,8,7,5,4,1,2,3,6,5,9,8,6,5,2,6,5,4,8,7,5,9,6,5,3,5,6,4,6,9,5,4,9,6,5,2,3,9]

In [24]:
f_sta , p_value = stats.f_oneway(dite_a , dite_b , dite_c)

In [25]:
f_sta , p_value

(0.9200341871210703, 0.4007876504607588)

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 2,
    'Experience': ['Novice'] * 9 + ['Experienced'] * 9 + ['Novice'] * 9 + ['Experienced'] * 9,
    'Time': [10, 12, 11, 8, 9, 7, 15, 13, 14, 11, 13, 12, 6, 7, 5, 18, 17, 16, 9, 11, 10, 7, 8, 6, 13, 15, 14, 5, 6, 7]
})

model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)

print(table)


In [None]:
                            sum_sq    df          F        PR(>F)
C(Software)             494.111111   2.0  22.198582  7.399063e-08
C(Experience)             4.666667   1.0   0.420290  5.230134e-01
C(Software):C(Experience)  8.222222   2.0   0.369565  6.943090e-01
Residual                200.000000  24.0        NaN           NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [36]:
import pandas as pd
from scipy import stats

data = pd.DataFrame({
    'Group': ['Control'] * 50 + ['Experimental'] * 50,
    'Score': [80, 75, 85, 72, 89, 78, 85, 79, 81, 83, 85, 82, 76, 78, 80, 73, 71, 75, 80, 77,
              88, 87, 82, 79, 85, 84, 83, 82, 79, 81, 90, 85, 89, 87, 92, 83, 80, 86, 88, 79,
              81, 85, 84, 88, 92, 86, 84, 89, 87, 84, 86, 85, 88, 84, 83, 85, 87, 84, 89, 87,
              80, 75, 78, 71, 83, 72, 86, 81, 85, 88, 83, 87, 75, 79, 80, 73, 71, 75, 80, 77,
              78, 80, 75, 72, 81, 78, 79, 77, 73, 71, 77, 80, 79, 75, 81, 82, 83, 85, 87, 84]})

control_scores = data.loc[data['Group'] == 'Control', 'Score']
experimental_scores = data.loc[data['Group'] == 'Experimental', 'Score']
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print(f"t-statistic: {t_stat:.2f}")
print(f"p-value: {p_value:.4f}")


t-statistic: 2.34
p-value: 0.0211


In [38]:
import statsmodels.stats.multicomp as mc

tukey_results = mc.MultiComparison(data['Score'], data['Group']).tukeyhsd()

print(tukey_results)


    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower   upper  reject
-----------------------------------------------------------
Control Experimental    -2.38 0.0211 -4.3947 -0.3653   True
-----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [30]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd


In [31]:
data = {'store': ['A', 'B', 'C'] * 30,
        'sales': [10, 15, 20, 12, 18, 24, 11, 16, 21] * 10}
df = pd.DataFrame(data)


In [32]:
model = ols('sales ~ store', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


               sum_sq    df           F        PR(>F)
store     1706.666667   2.0  484.173913  7.101957e-48
Residual   153.333333  87.0         NaN           NaN


In [33]:
posthoc = pairwise_tukeyhsd(df['sales'], df['store'], alpha=0.05)
print(posthoc)


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower  upper  reject
-------------------------------------------------
     A      B   5.3333   0.0  4.516 6.1507   True
     A      C  10.6667   0.0 9.8493 11.484   True
     B      C   5.3333   0.0  4.516 6.1507   True
-------------------------------------------------
