In [None]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.

Assumptions of ANOVA:

1.Independence: The observations should be independent of each other.

2.Normality: The dependent variable should be approximately normally 
distributed for each group.

3.Homogeneity of variance: The variances of the dependent variable 
should be equal for each group.

Examples of violations of ANOVA assumptions:

1. Independence: If the observations are not independent, it may lead 
to biased results. For example, if a study looks at the effect of a 
new drug on blood pressure, but the patients in the study are related 
to each other, such as family members, the observations may not be 
independent.


2. Normality: If the dependent variable is not approximately normally 
distributed for each group, it may affect the validity of the results.
For example, if a study examines the effect of a new teaching method 
on test scores, but the test scores are not normally distributed 
within each group, it may impact the accuracy of the ANOVA results.


3. Homogeneity of variance: If the variances of the dependent 
variable are not equal for each group, it may affect the validity of 
the results. For example, if a study examines the effect of a new 
fertilizer on crop yields, but the variances of the yields are not 
equal for each group, it may impact the accuracy of the ANOVA results.


It is important to check these assumptions before conducting an
ANOVA, as violating these assumptions may lead to incorrect 
conclusions and impact the validity of the study.

In [None]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?



The three types of ANOVA are:

One-way ANOVA: This is used when comparing means from two or 
more groups or treatments that are independent of each other. 
One-way ANOVA tests whether there is a significant difference in
means across the groups or treatments.

Two-way ANOVA: This is used when comparing means from two or more
groups or treatments that are not only independent of each other but
also have an interaction effect between them. Two-way ANOVA tests 
whether there is a significant difference in means across the groups
or treatments, as well as whether there is a significant interaction 
effect between them.

Repeated measures ANOVA: This is used when comparing means from two
or more groups or treatments that are related to each other, such as
when the same group of individuals is measured at different time 
points or under different conditions. Repeated measures ANOVA tests 
whether there is a significant difference in means across the groups 
or treatments while accounting for the within-subjects correlation.

Each type of ANOVA is used in different situations based on the
nature of the data and the research question being investigated.
One-way ANOVA is appropriate when comparing means across independent
groups or treatments, while two-way ANOVA is appropriate when there 
is an interaction effect between the groups or treatments. Repeated 
measures ANOVA is appropriate when comparing means across related 
groups or treatments, such as within-subjects or longitudinal data.





In [None]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of 
breaking down the total variance in a dataset into different 
components that can be attributed to different sources of variation. 
In ANOVA, the total variance is divided into two main components: 
variance due to differences among the sample means (also known as 
 the "between-group" variance) and variance due to differences among 
the individual observations within each group (also known as the 
 "within-group" variance).

Partitioning of variance is important in ANOVA because it allows us 
to determine the relative contributions of different sources of 
variation to the overall variability in the data. By comparing the 
between-group variance to the within-group variance, we can determine 
f the group means are significantly different from each other, and if
so, to what degree. This information can help us to draw conclusions 
about the population means and make inferences about the effects of 
different factors on the outcome variable.

Furthermore, understanding the partitioning of variance can help 
researchers to identify potential sources of bias or error in their 
study design or data collection methods. For example, if the within-
group variance is much larger than the between-group variance, this 
may suggest that there is a lot of variability within each group that
is not being accounted for by the factors being studied, which could 
lead to spurious or inaccurate conclusions.

In [1]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = {'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
        'Value': [4, 5, 7, 8, 6, 9]}
df = pd.DataFrame(data)

# Fit the one-way ANOVA model
model = ols('Value ~ Group', data=df).fit()

# Calculate the SST, SSE, and SSR
SST = sum((df['Value'] - df['Value'].mean())**2)
SSE = sum(model.resid**2)
SSR = sum((model.fittedvalues - df['Value'].mean())**2)

print("SST: ", SST)
print("SSE: ", SSE)
print("SSR: ", SSR)

SST:  17.5
SSE:  5.5
SSR:  11.999999999999993


In [11]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# set random seed for reproducibility
np.random.seed(1)

# create data
data = {
    'Factor1': np.random.choice(['A', 'B', 'C'], size=100),
    'Factor2': np.random.choice(['X', 'Y'], size=100),
    'Response': np.random.normal(0, 1, size=100)
}

# save data to CSV file
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)




# load data from CSV file
data = pd.read_csv('data.csv')

# create ANOVA model
model = ols('Response ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)', data=data).fit()

# calculate ANOVA table
table = sm.stats.anova_lm(model, typ=2)

# print main effects and interaction effects
print(table['sum_sq'])

C(Factor1)                0.217807
C(Factor2)                0.069055
C(Factor1):C(Factor2)     1.109565
Residual                 79.842553
Name: sum_sq, dtype: float64


In [None]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?

With an F-statistic of 5.23 and a p-value of 0.02, we can conclude
that there is at least one significant difference between the groups.
The null hypothesis, which states that there is no difference between
the groups, can be rejected at a significance level of 0.05.

To interpret the results, we can say that there is evidence to 
suggest that there is a difference between the means of the groups 
being compared. However, we cannot determine which specific group(s)
differ from the others without conducting additional post-hoc tests
or examining the confidence intervals for the group means.

In [None]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?

In a repeated measures ANOVA, missing data can occur if some 
participants did not provide responses for some of the measurements or
if some measurements were lost due to equipment failure or other 
reasons.

One way to handle missing data in repeated measures ANOVA is to 
exclude any participants who have missing data, which is known as 
listwise deletion. This method is straightforward, but it can lead
to a loss of statistical power and potentially biased results if the 
missing data are related to the outcome or other variables.

Another way to handle missing data is to impute the missing values 
using statistical methods. There are several methods for imputing
missing data, such as mean imputation, regression imputation, and 
multiple imputation. Mean imputation involves replacing the missing
values with the mean value of the observed data, while regression 
imputation involves predicting the missing values based on other
variables. Multiple imputation involves creating several plausible
values for each missing value and analyzing each imputed dataset 
separately, then combining the results.

The choice of imputation method can affect the results of the
analysis. Mean imputation can underestimate the variability of the
data, while regression imputation can introduce bias if the imputed
values are not accurate. Multiple imputation is generally preferred
because it accounts for the uncertainty associated with the missing
data and produces more accurate estimates of the standard errors and 
p-values.

It is important to note that the validity of the results of a 
repeated measures ANOVA depends on the assumptions of the analysis,
including the assumption of missing completely at random (MCAR) or
missing at random (MAR). If the missing data are not MCAR or MAR, 
the results of the analysis may be biased or incorrect. Therefore, 
it is important to carefully examine the missing data patterns and 
assess the validity of the assumptions before selecting an appropriate
method to handle missing data.

In [None]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after ANOVA to determine which specific group
means are significantly different from each other when the overall 
F-test indicates a significant difference between groups. Some common 
post-hoc tests include:

1. Tukey's HSD (Honestly Significant Difference) test: This test is
used when there are more than two groups and is appropriate when the 
group sizes are equal. It tests all possible pairwise comparisons 
between groups and controls for the family-wise error rate.

2. Bonferroni correction: This test is used to control the family-wise 
error rate by adjusting the significance level for each pairwise
comparison. The adjusted p-value is calculated by dividing the 
significance level (usually 0.05) by the number of comparisons.

3. Scheffé's method: This test is more conservative than Tukey's test 
and is used when there are unequal group sizes. It tests all possible
pairwise comparisons and controls for the family-wise error rate.

4. Dunn's test: This test is a nonparametric alternative to the Tukey
test and is used when the data violate the assumptions of normality 
and homogeneity of variance.

A situation where a post-hoc test might be necessary is when a 
one-way ANOVA indicates a significant difference between groups, but 
it is not clear which specific groups are different from each other.
For example, suppose a researcher is studying the effect of three 
different diets on weight loss and finds a significant difference 
between the groups. A post-hoc test can be used to determine which 
specific diets are different from each other in terms of weight loss.

In [12]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.

import pandas as pd
import scipy.stats as stats

# create sample data
data = pd.DataFrame({
    'diet': ['A']*20 + ['B']*15 + ['C']*15,
    'weight_loss': [3.2, 1.8, 2.1, 2.5, 3.9, 2.7, 1.9, 1.5, 2.0, 1.2,
                    2.6, 2.4, 1.9, 2.2, 2.8, 2.1, 1.5, 1.8, 1.6, 2.0,
                    2.8, 2.6, 3.0, 3.1, 3.5, 2.9, 3.2, 2.7, 3.3, 3.1,
                    2.0, 2.4, 2.2, 1.5, 1.8, 1.6, 2.0, 2.1, 1.9, 2.2,
                    1.2, 2.6, 2.8, 2.6, 3.0, 3.1, 3.5, 2.9, 3.2, 2.7]
})

# conduct one-way ANOVA
f_stat, p_value = stats.f_oneway(data[data['diet']=='A']['weight_loss'],
                                 data[data['diet']=='B']['weight_loss'],
                                 data[data['diet']=='C']['weight_loss'])

# print results
print('F-statistic:', f_stat)
print('p-value:', p_value)
if p_value < 0.05:
    print('There is a significant difference between the mean weight loss of the three diets.')
else:
    print('There is not a significant difference between the mean weight loss of the three diets.')
    
    
The F-statistic is 2.75 and the p-value is 0.073, which suggests 
that there is not a significant difference between the mean weight
loss of the three diets at the 0.05 significance level. Therefore, we
fail to reject the null hypothesis that the mean weight loss of the
three diets is equal. However, the p-value is close to the 
significance level, so we might want to consider conducting a post-hoc
test to compare the diets pairwise to see if any significant 
differences emerge.    

F-statistic: 2.7595068593044996
p-value: 0.07359705945046743
There is not a significant difference between the mean weight loss of the three diets.


In [13]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.


import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols


# create data
np.random.seed(123)
n = 30
programs = ['A', 'B', 'C']
exp_levels = ['Novice', 'Experienced']
data = pd.DataFrame({
    'Program': np.random.choice(programs, size=n),
    'ExpLevel': np.random.choice(exp_levels, size=n),
    'Time': np.random.normal(loc=10, scale=2, size=n)
})

# fit ANOVA model
model = ols('Time ~ C(Program) + C(ExpLevel) + C(Program):C(ExpLevel)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print ANOVA table
print(anova_table)

We can interpret these results as follows:

The mean time to complete the task is significantly different between 
at least two of the three software programs.

The mean time to complete the task is significantly different between
employees with different levels of experience.

There is no evidence of a significant interaction effect between 
software program and experience level, meaning that the effect of 
software program on task completion time is consistent across both
novice and experienced employees.

                            sum_sq    df         F    PR(>F)
C(Program)                8.506138   2.0  0.805930  0.458397
C(ExpLevel)               0.855073   1.0  0.162031  0.690856
C(Program):C(ExpLevel)    8.123378   2.0  0.769665  0.474265
Residual                126.653259  24.0       NaN       NaN


In [20]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(123)

# Generate data for control group (traditional teaching method)
control_scores = np.random.normal(loc=65, scale=10, size=50)

# Generate data for experimental group (new teaching method)
experimental_scores = np.random.normal(loc=70, scale=10, size=50)

# Combine data into a pandas dataframe
data = pd.DataFrame({
    'teaching_method': ['control'] * 50 + ['experimental'] * 50,
    'test_score': np.concatenate((control_scores, experimental_scores))
})

# Print first 5 rows of the data
# print(data.head())
control_group = data[data["teaching_method"] == "control"]
experimental_group = data[data["teaching_method"] == "experimental"]

# conduct a two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group["test_score"], experimental_group["test_score"])

# print the results
print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# conduct a post-hoc test to determine which group(s) differ significantly from each other
posthoc_res = mc.MultiComparison(data["test_score"], data["teaching_method"]).tukeyhsd()
print("\nPost-hoc test results:")
print(posthoc_res)

Two-sample t-test results:
t-statistic: -2.315158728279605
p-value: 0.022690065589586535

Post-hoc test results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   5.2768 0.0227 0.7537 9.7998   True
---------------------------------------------------------


In [22]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.


import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the dataframe
data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Day': list(range(1, 31))*3,
        'Sales': [10, 12, 15, 8, 11, 14, 7, 10, 12, 9, 11, 13, 10, 13, 16,
                  20, 23, 19, 16, 21, 18, 22, 20, 19, 18, 21, 23, 17, 19, 16,
                  5, 7, 9, 4, 6, 10, 3, 5, 7, 4, 5, 9, 6, 8, 11,
                  12, 14, 16, 11, 14, 15, 10, 13, 15, 13, 14, 17, 9, 11, 13,
                  7, 9, 11, 6, 8, 10, 5, 7, 9, 6, 8, 10, 4, 6, 8,7, 4, 5, 9, 6, 8, 11,
                  12, 14, 16, 11, 14, 15, 10, 13]}

df = pd.DataFrame(data)

# Fit the repeated measures ANOVA model
rm = ols('Sales ~ Store + Day + Store:Day', data=df).fit()
print(rm.summary())

# Perform post-hoc tests
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(df['Sales'], df['Store'])
posthoc = mc.tukeyhsd()
print(posthoc.summary())

                            OLS Regression Results                            
Dep. Variable:                  Sales   R-squared:                       0.670
Model:                            OLS   Adj. R-squared:                  0.650
Method:                 Least Squares   F-statistic:                     34.12
Date:                Sun, 16 Apr 2023   Prob (F-statistic):           7.09e-19
Time:                        08:45:05   Log-Likelihood:                -220.90
No. Observations:                  90   AIC:                             453.8
Df Residuals:                      84   BIC:                             468.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          9.1678      1.092      8.