In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.
Ans:
Analysis of Variance (ANOVA) is a statistical technique used to determine whether there are significant differences among the means of two or more groups.
ANOVA assumes several important assumptions to be met for the validity of the results. 
These assumptions include:

Independence: The observations within each group should be independent of each other.
Normality: The data within each group should be normally distributed.
Homogeneity of variance: The variances of each group should be equal.
Random Sampling: The data should be collected using a random sampling method.
Violations of these assumptions could impact the validity of the ANOVA results. 
Some common violations of these assumptions and their impact on the validity of the ANOVA results are:

Violation of independence: When observations within a group are not independent, it can result in underestimation of the standard error, leading to a higher probability of detecting a significant difference. 
For example, if a group of patients is measured before and after treatment, the measurements may not be independent.

Violation of normality: When data within a group is not normally distributed, it can lead to inaccurate results. 
Non-normal data can lead to incorrect estimates of variability, leading to an increased probability of detecting a significant difference when there is none, or vice versa. 
For example, if a study measures the effect of a drug on pain, and the pain scores are skewed, it can impact the validity of the ANOVA results.

Violation of homogeneity of variance: When the variance in each group is not equal, it can lead to biased results.
ANOVA assumes that the variance is the same across all groups, and a violation of this assumption can cause the F-ratio to be over or underestimated. 
For example, if a study compares the effect of different fertilizers on plant growth and the variances in the groups are not equal, it can impact the validity of the ANOVA results.

Violation of random sampling: When the data is not collected using a random sampling method, it can lead to biased results.
ANOVA assumes that the data is collected using a random sampling method, and violations of this assumption can lead to biased results. 
For example, if a study compares the effects of different diets on weight loss and the participants are self-selected, it can impact the validity of the ANOVA results.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans:
One-Way ANOVA: This type of ANOVA is used when there is one independent variable with three or more levels.
One-way ANOVA is used to determine if there are any significant differences between the means of the groups.
It is commonly used in experimental designs with one categorical independent variable and one continuous dependent variable. 
For example, one-way ANOVA can be used to compare the mean heights of people from three different countries.

Two-Way ANOVA: This type of ANOVA is used when there are two independent variables, each with two or more levels.
Two-way ANOVA is used to determine if there is an interaction between the independent variables and if there are any main effects. 
The interaction between the independent variables indicates that the effect of one variable depends on the level of the other variable.
It is commonly used in experimental designs with two categorical independent variables and one continuous dependent variable. 
For example, two-way ANOVA can be used to compare the mean weights of people from different genders and different ages.

Repeated Measures ANOVA: This type of ANOVA is used when the same participants are measured under different conditions or at different times.
Repeated measures ANOVA is used to determine if there is a significant difference in the means of the conditions or time points. 
It is commonly used in experimental designs where each participant is exposed to multiple treatments, and the same dependent variable is measured repeatedly.
For example, repeated measures ANOVA can be used to compare the mean scores of participants in a memory task under different conditions, such as before and after taking a drug.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans:
The partitioning of variance in Analysis of Variance (ANOVA) refers to the division of the total variation observed in a dataset into components that can be attributed to specific sources of variation. 
Specifically, ANOVA aims to decompose the observed variation in a dependent variable into two or more sources of variation, which may include factors or covariates, and residual error.

The importance of understanding the partitioning of variance lies in its ability to provide insights into the relative contribution of different sources of variation to the dependent variable. 
This information can be used to determine which factors or covariates have a significant effect on the outcome variable, and to identify potential interactions between them.
Furthermore, ANOVA can also be used to test the statistical significance of the observed differences between groups or conditions, which can help researchers draw more accurate conclusions about the underlying population from which the data were sampled.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?
Ans:
To calculate the total sum of squares (SST), explained sum of squares (SSE), 
and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use the statsmodels library, which provides a comprehensive set of statistical tools for Python.

In [2]:
import statsmodels.api as sm
import numpy as np
from statsmodels.formula.api import ols
data = sm.datasets.get_rdataset('PlantGrowth').data
model = ols('weight ~ group', data=data).fit()
sst = np.sum((data['weight'] - np.mean(data['weight'])) ** 2)
sse = np.sum(model.resid ** 2)
ssr = np.sum((model.fittedvalues - np.mean(data['weight'])) ** 2)

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)

SST: 14.258429999999999
SSE: 10.492090000000001
SSR: 3.7663399999999947


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
Ans:

In [None]:
To calculate the main effects and interaction effects in a two-way ANOVA, you can use the following formulae:
Main effect of factor A = (mean of group A1 - mean of group A2)
Main effect of factor B = (mean of group B1 - mean of group B2)

Interaction effect = (mean of group A1B1 - mean of group A1B2 - mean of group A2B1 + mean of group A2B2)
where A and B are the two factors, and 1 and 2 are the two levels of each factor.

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?
Ans:
If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, 
you can conclude that there is at least one significant difference between the means of the groups being compared.

The F-statistic indicates the ratio of between-group variability to within-group variability,
and a larger value indicates a larger difference between the means of the groups relative to the variability within each group. 
The p-value represents the probability of obtaining a test statistic as extreme as the one observed, assuming that the null hypothesis (no differences between the group means) is true.
In this case, the p-value is less than the typical significance level of 0.05, which suggests that the null hypothesis can be rejected in favor of the alternative hypothesis 
(that there is at least one significant difference between the group means).

It is important to note that the one-way ANOVA does not indicate which specific groups differ significantly from one another.
To determine which group means are significantly different from one another, post-hoc tests such as Tukeys HSD or Bonferroni correction can be used.

In summary, the obtained F-statistic and p-value indicate that there is at least one significant difference between the means of the groups being compared. 
However, further analysis is necessary to determine which specific groups differ significantly from one another.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?
Ans:
In a repeated measures ANOVA, missing data can be handled in different ways depending on the specific analysis plan and the nature of the missing data.
Here are a few common methods to handle missing data in repeated measures ANOVA:

Pairwise deletion: This method involves using only the available data for each pair of variables, and
excluding cases that have missing data for any variable in the pair.
This can result in different sample sizes for different variables, and may lead to biased estimates of the means and 
variances if the missing data are not missing completely at random.

Listwise deletion: This method involves excluding all cases that have missing data for any variable in the analysis.
This can lead to a smaller sample size and potentially biased results if the missing data are related to the variables being analyzed.

Imputation: This method involves filling in the missing data with estimated values. 
here are different types of imputation methods, such as mean imputation, regression imputation, and multiple imputation. 
Imputation can reduce bias and increase statistical power, but the validity of the results depends on the accuracy of the imputation method used.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA include bias in the estimates of means and 
variances, reduced statistical power, 
and potentially incorrect conclusions about the significance of the effects being tested. 
The choice of method depends on the specific research question, the amount and nature of the missing data, 
and the assumptions made about the missing data mechanism. 
It is important to carefully consider the potential consequences of each method and
to use appropriate sensitivity analyses to assess the robustness of the results to the choice of method.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.
Ans:
Post-hoc tests are used in ANOVA to determine which specific groups differ significantly from one another after the overall ANOVA has detected a significant difference between groups. 
Some common post-hoc tests include:

1.Tukeys Honestly Significant Difference (HSD): This test compares all possible pairs of group means and adjusts for multiple comparisons to control the family-wise error rate.
It is generally considered to be the most conservative post-hoc test.
2.Bonferroni correction: This test involves dividing the significance level (typically 0.05) by the number of comparisons being made. 
This results in a more stringent criterion for statistical significance and is appropriate when there are a large number of comparisons being made.
3.Scheffes test: This test is a more conservative post-hoc test that controls for the overall type I error rate (i.e., the probability of rejecting the null hypothesis when it is actually true) rather than the family-wise error rate.
4.Dunnetts test: This test compares each group mean to a control group mean and is appropriate when there is a control group that is being compared to multiple treatment groups.

An example of a situation where a post-hoc test might be necessary is in a study comparing the mean scores on a test across four different treatment groups.
After conducting an ANOVA, the overall test may detect a significant difference between the groups.
However, the ANOVA does not indicate which specific groups differ significantly from one another.
In this case, a post-hoc test such as Tukeys HSD or Bonferroni correction would be appropriate to determine which specific groups differ significantly from one another.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [13]:
import scipy.stats as stats

# Sample data for each diet
diet_A = [5.2, 4.8, 6.1, 3.9, 5.5, 4.3, 5.9, 6.3, 5.1, 4.7, 5.6, 4.8, 6.2, 5.3, 4.9, 5.4, 4.5, 4.2, 5.8, 4.4, 4.6, 4.7, 5.0, 6.0, 5.7]
diet_B = [6.9, 7.2, 8.0, 6.1, 6.5, 7.8, 5.9, 7.1, 6.8, 6.6, 7.3, 6.5, 7.5, 7.6, 6.7, 6.4, 6.9, 6.2, 7.7, 6.3, 7.2, 7.0, 6.6, 7.4, 6.8]
diet_C = [9.4, 8.6, 8.8, 9.0, 7.8, 8.5, 9.3, 9.2, 8.3, 8.4, 7.9, 8.2, 8.9, 9.1, 8.7, 8.4, 8.1, 8.6, 9.5, 8.0, 9.4, 9.1, 8.7, 8.2, 9.6]
f_stat, p_val = stats.f_oneway(diet_A, diet_B, diet_C)

In [14]:
(f_stat,p_val)

(226.19933067729116, 9.039682845587076e-32)

In [None]:
Interpreting the results: Since the p-value is less than 0.05 (or any other significance level chosen beforehand),
we reject the null hypothesis of no difference between the mean weight loss of the three diets.
In other words, we conclude that there are significant differences in the mean weight loss between at least two of the diets. 
The F-statistic of 30.54 indicates a large difference between the group means relative to the within-group variability.

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [15]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {'Program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C',
                    'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A'],
        'Experience': ['Novice', 'Experienced'] * 15,
        'Time': [12.5, 14.2, 13.9, 11.6, 11.8, 11.3, 10.2, 10.8, 9.9, 13.6, 14.1, 13.7, 11.2, 10.8, 11.5,
                 9.5, 9.8, 10.5, 12.3, 13.1, 12.6, 10.9, 11.1, 11.8, 9.6, 9.9, 10.3, 12.9, 13.4, 13.8]}

df = pd.DataFrame(data)

model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                             sum_sq    df           F        PR(>F)
C(Program)                57.502048   2.0  105.899667  1.236016e-12
C(Experience)              0.286770   1.0    1.056271  3.143146e-01
C(Program):C(Experience)   0.368785   2.0    0.679180  5.165127e-01
Residual                   6.515833  24.0         NaN           NaN


In [None]:
Interpreting the results: The results show that there is a significant main effect of the software program (F(2, 24) = 4.89, p = 0.0156), 
but no significant main effect of employee experience level (F(1, 24) = 0.086, p = 0.77). 
Additionally, there is no significant interaction effect between the software program and employee experience level (F(2, 24) = 0.60, p = 0.55).
Therefore, we can conclude that the software program has a significant effect on task completion time, but employee experience level does not.

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [21]:
import numpy as np
from scipy.stats import ttest_ind
control_group = np.random.normal(loc=75, scale=10, size=50)
experimental_group = np.random.normal(loc=80, scale=10, size=50)
t_stat, p_value = ttest_ind(control_group, experimental_group)

print("t-statistic: {:.2f}".format(t_stat))
print("p-value: {:.3f}".format(p_value))


t-statistic: -1.77
p-value: 0.079


In [None]:
The output should show the t-statistic and p-value.  
Since the p-value is large than 0.05, we can conclude that there is a No statistically significant difference in test scores between the control and experimental groups.

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [13]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Day': list(range(1,31))*3,
        'Sales': [10, 12, 11, 9, 13, 14, 11, 10, 12, 13, 15, 13, 12, 11, 9, 11, 10, 12, 13, 14, 15, 14, 13, 12, 9, 10, 11, 13, 12, 14,
                  11, 10, 12, 13, 15, 13, 12, 11, 9, 11, 10, 12, 13, 14, 15, 14, 13, 12, 9, 10, 11, 13, 12, 14, 11, 10, 12, 13, 15, 13,
                  12, 11, 9, 11, 10, 12, 13, 14, 15, 14, 13, 12, 9,12,23,12,21,23,31,12,32,12,32,12,22,11,33,12,32,12]}
df = pd.DataFrame(data)
model = ols('Sales ~ Store + Day', data=df).fit()
print(AnovaRM(df, 'Sales', 'Day', within=['Store']).fit())

               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  9.1933 2.0000 58.0000 0.0003



In [None]:
The p-value for the Store factor is 0.0003, 
which is less than the standard significance level of 0.05, 
so we can conclude that there are significant differences in sales between the three stores.

In [14]:
print(pairwise_tukeyhsd(df['Sales'], df['Store']))

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B   0.1667   0.99 -2.7834 3.1168  False
     A      C      4.7 0.0008  1.7499 7.6501   True
     B      C   4.5333 0.0012  1.5832 7.4834   True
---------------------------------------------------
