Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Assumptions required to use ANOVA and violations that could impact the validity of the results:

#### Assumptions:

Independence of observations

Normality of the residuals

Homogeneity of variances

Random sampling from each group/population

#### Violations:

Non-independence of observations (e.g., repeated measures or clustering)

Non-normality of the residuals (e.g., skewness or outliers)

Heterogeneity of variances (e.g., unequal variances between groups)

Non-random sampling (e.g., selection bias or confounding)

Violations of these assumptions can lead to biased or inefficient estimates of the population 
parameters and incorrect inference about the statistical significance of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Three types of ANOVA and their usage:

#### One-way ANOVA: 
used to test for differences between two or more groups on a single continuous outcome variable.

#### Two-way ANOVA:

used to test for main and interaction effects of two categorical independent variables on a single continuous outcome variable.

#### Three-way ANOVA:

used to test for main and interaction effects of three categorical independent variables on a single continuous outcome variable.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance refers to the decomposition of the total variability in the outcome variable into different sources of variability, such as between groups, within groups, and error. This is important because it helps to identify the relative contribution of each source of variability to the overall variance and to estimate the effect size of the independent variable(s) on the outcome variable.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

Calculation of SST, SSE, and SSR in a One-Way ANOVA using Python:
To calculate SST, SSE, and SSR in a One-Way ANOVA using Python, we need to first import the necessary libraries and load the data into a DataFrame. Then we can calculate the mean of the dependent variable for each group, the overall mean, and the total sum of squares (SST).

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# load data into DataFrame
df = pd.read_csv('data.csv')

# calculate mean of dependent variable for each group
group_means = df.groupby('group')['dependent_variable'].mean()

# calculate overall mean
overall_mean = df['dependent_variable'].mean()

# calculate total sum of squares (SST)
SST = np.sum((df['dependent_variable'] - overall_mean)**2)

# calculate explained sum of squares (SSE)
SSE = np.sum((group_means - overall_mean)**2)

# calculate residual sum of squares (SSR)
SSR = SST - SSE

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

load data into DataFrame
df = pd.read_csv('data.csv')

# perform Two-Way ANOVA
model = ols('dependent_variable ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# extract main effects and interaction effects
main_effect_factor1 = anova_table.loc['C(factor1)', 'sum_sq']
main_effect_factor2 = anova_table.loc['C(factor2)', 'sum_sq']
interaction_effect = anova_table.loc['C(factor1):C(factor2)', 'sum_sq']



Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

### Interpretation of Results from a One-Way ANOVA:

A One-Way ANOVA tests whether there are significant differences in the means of three or more groups. In this case, the F-statistic is 5.23 and the p-value is 0.02. The p-value is less than the significance level of 0.05, which indicates that there is a significant difference between the means of at least two of the groups. Therefore, we can conclude that there are significant differences between the groups. The F-statistic of 5.23 tells us how much larger the variation between the group means is compared to the variation within the groups. The higher the F-statistic, the larger the difference between the groups.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling Missing Data in Repeated Measures ANOVA:
In a repeated measures ANOVA, missing data can occur when a participant fails to complete one or more measurements. To handle missing data, there are several methods that can be used, such as pairwise deletion, listwise deletion, mean substitution, and maximum likelihood estimation.

#### Pairwise deletion: 
The analysis is conducted on only the complete cases, and any missing values are ignored. This can lead to loss of power and biased estimates if the missing data is not missing completely at random (MCAR).

#### Listwise deletion: 
The analysis is conducted on only the participants with complete data, and any participants with missing values are excluded.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.


Common Post-Hoc tests and their usage:

Tukey's HSD: used to compare all possible pairs of group means.

Bonferroni: used to control for type I error rates when multiple pairwise comparisons are made.

Scheffe: used when there are more than two groups and the assumption of homogeneity of variance is violated.

Dunnett: used to compare multiple groups with a control group.

A post-hoc test is necessary when a significant difference is found in the ANOVA, and there are more than two groups being compared.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [3]:
import pandas as pd
import scipy.stats as stats

data = pd.read_csv("data.csv")

F, p = stats.f_oneway(data['A'], data['B'], data['C'])
print("F-statistic:", F)
print("p-value:", p)

The output will show the F-statistic and p-value. The researcher can interpret the results by comparing the p-value to the significance level (e.g., 0.05) and determine whether to reject the null hypothesis (there are no significant differences between the means) or not.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd

import statsmodels.api as sm

from statsmodels.formula.api import ols

data = pd.read_csv("data.csv")

model = ols('Time ~ Software + Experience + Software:Experience', data).fit()

anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


The output will show the F-statistics and p-values for the main effects and interaction effect. The researcher can interpret the results by comparing the p-values to the significance level and determine whether to reject the null hypothesis (there are no significant effects) or not.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.



In [5]:
import pandas as pd
import scipy.stats as stats

data = pd.read_csv("data.csv")

control = data[data['Group']=='Control']['Scores']
experimental = data[data['Group']=='Experimental']['Scores']

t, p = stats.ttest_ind(control, experimental)
print("t-statistic:", t)
print("p-value:", p)

The output will show the t-statistic and p-value. If the p-value is less than the significance level, the researcher can conclude that there is a significant difference in test scores between the two groups. The researcher can then perform a post-hoc test to determine which group(s) differ significantly from each other.

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.


In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_csv("data.csv")

model = ols('Sales ~ Store', data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


The output will show the F-statistic and p-value for the repeated measures ANOVA. If the p-value is less than the significance level, the researcher can conclude that there is a significant difference in sales between the three stores. The researcher can then perform a post-hoc test to determine which store(s) differ significantly from each other. If there is missing data, the researcher can handle it using methods like listwise deletion, imputation, or mixed models. However, the choice of method can affect the results and the conclusions drawn from them.