In [None]:
Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Normality: The data in each group must be normally distributed. This means that the mean, median, and mode should be roughly equal, and the data should follow a bell-shaped curve.

Homogeneity of variance: The variance of the data in each group should be roughly equal. This means that the spread of the data should be the same for all groups.

Independence: Observations in each group should be independent of each other.
* Voilence:

Non-normality: If the data is not normally distributed, ANOVA may give incorrect results. For example, if one group has a skewed distribution, it may be necessary to transform the data before conducting the analysis.

Heterogeneity of variance: If the variance is not equal across groups, ANOVA may give incorrect results. For example, if one group has much larger variance than the others, it can lead to inflated Type I error rates.

Dependence: If observations in one group are related to observations in another group, ANOVA may give incorrect results. For example, if some observations are repeated measures or if the observations are clustered in some way, it may be necessary to use a different statistical method that accounts for dependence.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

One-way ANOVA: It is used when comparing means of a single dependent variable across two or more independent groups or categories. For example, a study comparing the test scores of students in different schools.

Two-way ANOVA: It is used when comparing means of a single dependent variable across two or more independent groups or categories that are crossed with two or more independent variables. For example, a study comparing the test scores of students in different schools (independent variable 1) and genders (independent variable 2).

Repeated measures ANOVA: It is used when comparing means of a single dependent variable across two or more conditions or time points within the same group of participants. For example, a study comparing the reaction time of participants before and after a caffeine dose.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance is a fundamental concept in ANOVA (analysis of variance). It refers to the decomposition of the total variance of the dependent variable into different components that can be attributed to specific sources of variation. This partitioning is important because it allows us to determine the proportion of variance that can be explained by the independent variables and the proportion that remains unexplained.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [4]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})
y = df['score']

x = df[['hours']]
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
sse = np.sum((model.fittedvalues - df.score)**2)
print(sse)
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print(ssr)
sst = ssr + sse
print(sst)


331.0748847926268
917.4751152073737
1248.5500000000006


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [8]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13, 
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})

model = ols('height ~ C(Fertilizer) + C(Watering) +\C(Fertilizer):C(Watering)',
data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)

print(result)


                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.092308  0.092308  0.033422  0.856260
C(Fertilizer):C(Watering)   1.0   0.057692  0.057692  0.020889  0.886118
Residual                   28.0  77.333333  2.761905       NaN       NaN


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Based on the F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a significant difference between the means of the groups. The null hypothesis in the ANOVA is that all group means are equal, and a small p-value indicates that we have enough evidence to reject this null hypothesis in favor of the alternative hypothesis, which states that at least one group mean is different from the others.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

There are different methods to handle missing data, such as listwise deletion, pairwise deletion, mean imputation, and maximum likelihood estimation.Listwise deletion involves removing all cases with missing data, which can lead to a reduction in sample size and loss of statistical power. Pairwise deletion involves using all available data for each comparison, which can result in different sample sizes for different comparisons, and it can lead to biased estimates and reduced power.

Mean imputation involves replacing missing values with the mean value of the observed data, which can lead to biased estimates and an underestimation of standard errors.

Maximum likelihood estimation is a statistical method that estimates the missing data by taking into account the distribution of the observed data. 

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Tukey's HSD: This test is commonly used when the sample sizes are equal in all groups. Tukey's HSD compares the means of all pairs of groups and tests whether the difference between any two means is significantly different from zero. This test is useful for identifying which specific groups have significantly different means. For example, after performing an ANOVA analysis of test scores among four different schools, we may use Tukey's HSD test to compare the mean scores of each school to identify which schools have significantly different scores.

Bonferroni correction: This test is commonly used when multiple comparisons are made. Bonferroni correction adjusts the significance level of each test to control the overall type I error rate. This test is useful for minimizing the chance of falsely identifying a significant difference between groups. For example, after performing an ANOVA analysis of the effect of three different treatments on patient outcomes, we may use Bonferroni correction to adjust the significance level of each comparison to control the overall type I error rate.

Scheffé's method: This test is commonly used when the sample sizes are unequal among groups. Scheffé's method compares the means of all pairs of groups and tests whether the difference between any two means is significantly different from zero. This test is useful for identifying which specific groups have significantly different means while taking into account the unequal sample sizes. For example, after performing an ANOVA analysis of the effect of three different diets on weight loss, we may use Scheffé's method to compare the mean weight loss of each diet group while taking into account the unequal sample sizes.

Dunnett's test: This test is commonly used when the goal is to compare each group to a control group. Dunnett's test compares the means of each group to the mean of the control group and tests whether the difference between any two means is significantly different from zero. This test is useful for identifying which specific groups have significantly different means compared to the control group. For example, after performing an ANOVA analysis of the effect of four different fertilizers on plant growth, we may use Dunnett's test to compare the mean plant growth of each fertilizer group to the mean plant growth of the control group.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [9]:
import pandas as pd
import scipy.stats as stats
data = pd.DataFrame({
    'Diet': ['A', 'B', 'C', 'A', 'B', 'C', ...],
    'Weight Loss': [2.5, 1.8, 3.2, 2.1, 1.9, 3.3, ...]
})

f_statistic, p_value = stats.f_oneway(data[data['Diet'] == 'A']['Weight Loss'],
                                      data[data['Diet'] == 'B']['Weight Loss'],
                                      data[data['Diet'] == 'C']['Weight Loss'])

print('F-statistic:', f_statistic)
print('p-value:', p_value)


F-statistic: 34.055555555555436
p-value: 0.008665142023030993


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'Program': ['A', 'A', 'B', 'B', 'C', 'C', ...],
    'Experience': ['Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', ...],
    'Completion Time': [24.5, 28.2, 26.8, 29.1, 23.9, 26.5, ...]
})
model = ols('Completion_Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = pd.DataFrame({
    'Group': ['Control', 'Control', 'Experimental', 'Experimental', ...],
    'Score': [76, 82,89, 93, ...]
})
control_scores = data.loc[data['Group'] == 'Control', 'Score']
experimental_scores = data.loc[data['Group'] == 'Experimental', 'Score']
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores, equal_var=False)

print("t-statistic: {:.2f}, p-value: {:.4f}".format(t_stat, p_value))
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey_results = pairwise_tukeyhsd(data['Score'], data['Group'])
print(tukey_results)


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

data = pd.DataFrame({
    'Store': ['A', 'A', ..., 'C', 'C', ...],
    'Sales': [1000, 1200, ..., 800, 950, ...],
    'Day': [1, 2, ..., 29, 30, ...]
})

a_sales = data.loc[data['Store'] == 'A', 'Sales']
b_sales = data.loc[data['Store'] == 'B', 'Sales']
c_sales = data.loc[data['Store'] == 'C', 'Sales']
f_stat, p_value = stats.f_oneway(a_sales, b_sales, c_sales)

print("F-statistic: {:.2f}, p-value: {:.4f}".format(f_stat, p_value))

tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'], alpha=0.05)
print(tukey_results)
