In [None]:
'''Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.
Answer-ANOVA (Analysis of Variance) is a statistical test used to determine whether there are significant differences 
between the means of three or more groups. To use ANOVA, there are several assumptions that need to be met:

Independence: The data points in each group should be independent of each other. In other words, the value of one data 
point should not influence the value of another data point within the same group.

Normality: The data in each group should be normally distributed. This means that the data should follow a bell-shaped
curve with a symmetrical distribution.

Homogeneity of variance: The variance of the data in each group should be equal. This means that the spread of the data 
should be roughly the same for each group.

If these assumptions are not met, it can impact the validity of the ANOVA results. Here are some examples of violations
and their impact:

Violation of independence: If the data points in each group are not independent, it can lead to biased results. For 
example, if you are comparing the performance of two groups of employees and one group includes all the senior employees, while the other group includes all the junior employees, the senior employees may influence the performance of the junior employees. This violation can be addressed by ensuring that the groups are truly independent.

Violation of normality: If the data in each group is not normally distributed, it can lead to inaccurate results. For 
example, if you are comparing the weight of three different types of apples and the weight data is skewed, the ANOVA 
results may not accurately reflect the differences between the means. This violation can be addressed by transforming the data or using a non-parametric test instead.

Violation of homogeneity of variance: If the variance of the data in each group is not equal, it can lead to inaccurate 
results. For example, if you are comparing the growth rates of three different types of plants and one type of plant has a much larger variance in growth rates than the others, the ANOVA results may not accurately reflect the differences between the means. This violation can be addressed by using a Welch's ANOVA or transforming the data.

In summary, ANOVA requires the assumptions of independence, normality, and homogeneity of variance to be met. Violations
of these assumptions can impact the validity of the ANOVA results and should be addressed appropriately.'''

In [None]:
'''Q2. What are the three types of ANOVA, and in what situations would each be used?
Answer-There are three types of ANOVA:

One-Way ANOVA: One-Way ANOVA is used to compare the means of three or more groups that are independent of each other on a
single independent variable (also known as a factor). For example, you might use One-Way ANOVA to compare the average 
sales of three different stores in a retail chain. The independent variable is the store location, and the dependent 
variable is the sales.

Two-Way ANOVA: Two-Way ANOVA is used to compare the means of three or more groups on two independent variables. For 
example, you might use Two-Way ANOVA to compare the average salaries of employees in three different departments of a
company, while also taking into account their level of experience. The two independent variables are the department and
the level of experience, and the dependent variable is the salary.

Three-Way ANOVA: Three-Way ANOVA is used to compare the means of three or more groups on three independent variables.
For example, you might use Three-Way ANOVA to compare the average test scores of students in three different schools, 
while also taking into account their age and gender. The three independent variables are the school, age, and gender,
and the dependent variable is the test score.

In summary, One-Way ANOVA is used when there is a single independent variable, Two-Way ANOVA is used when there are two 
independent variables, and Three-Way ANOVA is used when there are three independent variables. The appropriate type of
ANOVA to use depends on the research question and the number of independent variables being examined.'''

In [None]:
'''Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Answer-The partitioning of variance in ANOVA refers to the process of breaking down the total variance in the data into
different components based on the sources of variability. ANOVA partitions the total variance into two types of variance: 
the variance between groups and the variance within groups. The variance between groups represents the differences 
between the means of the different groups being compared, while the variance within groups represents the variation 
within each group.

Partitioning of variance is important in ANOVA because it allows us to determine the contribution of each independent
variable to the dependent variable. By examining the proportion of variance accounted for by each variable, we can assess
the importance of each variable in explaining the differences in the dependent variable. This can help us to identify the
most important factors that are influencing the outcome variable and to determine whether the differences between groups
are statistically significant.

Furthermore, the partitioning of variance can also help us to determine the appropriate type of ANOVA to use, such as
One-Way ANOVA, Two-Way ANOVA, or Three-Way ANOVA, based on the number of independent variables and their interactions.

Overall, understanding the partitioning of variance in ANOVA is essential for interpreting the results of the analysis, 
identifying the sources of variation in the data, and drawing meaningful conclusions from the study.'''

In [None]:
'''Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?'''
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data
data = pd.read_csv('data.csv')
# fit one-way ANOVA model
model = ols('dependent_variable ~ independent_variable', data=data).fit()

# extract sum of squares values
ssr = model.ssr
sse = model.sse
sst = ssr + sse


In [None]:
'''Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?'''
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data
data = pd.read_csv('data.csv')
# fit two-way ANOVA model
model = ols('dependent_variable ~ independent_variable_1 + independent_variable_2 + independent_variable_1*independent_variable_2', data=data).fit()

# extract main effects and interaction effect
main_effect_1 = model.params[independent_variable_1]
main_effect_2 = model.params[independent_variable_2]
interaction_effect = model.params['{}:{}'.format(independent_variable_1, independent_variable_2)]


In [None]:
'''Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?
Answer-f we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that
there is a statistically significant difference between the means of the groups.

The F-statistic tells us the ratio of the variation between the groups (explained variation) to the variation within the 
groups (unexplained variation). A higher F-statistic indicates that the difference between the means of the groups is 
larger relative to the variation within the groups.

The p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic if there is no 
difference between the means of the groups (assuming the null hypothesis is true). Since this p-value is smaller than 
the conventional alpha level of 0.05, we reject the null hypothesis and conclude that the difference between the means
of the groups is statistically significant.

In terms of interpretation, we can say that the groups are not all the same and that at least one group has a different
mean from the others. However, we cannot say which group(s) have different means without further analysis, such as 
post-hoc tests. It's important to note that statistical significance does not necessarily imply practical significance, 
and we should also consider the magnitude of the differences between the means and whether they are meaningful in the 
context of the study.'''

In [None]:
'''Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?
Answer-Handling missing data in a repeated measures ANOVA can be challenging, as missing values can potentially bias
the results and reduce the statistical power of the analysis. There are several methods to handle missing data, and the choice of method can have consequences
for the results of the analysis.

One approach to handling missing data is to simply exclude cases with missing values from the analysis. This approach is
called listwise deletion or complete-case analysis. This method is straightforward but can lead to loss of statistical
power and potentially biased results if the missing data are related to the outcome or predictors of interest.

Another approach is to impute missing data, where missing values are replaced with estimates based on the observed data.
There are several methods for imputing missing data, such as mean imputation, regression imputation, and multiple 
imputation. Mean imputation involves replacing missing values with the mean of the observed values for that variable. 
Regression imputation involves predicting the missing values based on other variables in the dataset. Multiple imputation 
involves generating several plausible imputed datasets based on a model that accounts for the uncertainty in the missing
values.'''

In [None]:
'''Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.
Answer-Post-hoc tests are used to determine which specific group means differ from each other after a statistically 
significant result is obtained in an ANOVA. There are several common post-hoc tests, including:

Tukey's Honestly Significant Difference (HSD) test: This test controls the family-wise error rate (FWER) and is used
when the number of groups is equal or unequal. It compares all possible pairwise differences between group means and
provides simultaneous confidence intervals for the differences. It is often used in situations where there is no prior
hypothesis about which specific group means differ from each other.

Bonferroni correction: This test is used to control the FWER and is appropriate when the number of groups is small.
It divides the alpha level by the number of comparisons to obtain a more stringent alpha level for each comparison.

Scheffe's test: This test controls the FWER and is used when the number of groups is large. It is more conservative
than Tukey's HSD test, but it is useful in situations where there are a large number of pairwise comparisons.'''

In [8]:
'''Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.'''
import numpy as np
from scipy.stats import f_oneway

# generate sample data
np.random.seed(1)
diet_a = np.random.normal(5, 1, 50)
diet_b = np.random.normal(6, 1, 50)
diet_c = np.random.normal(7, 1, 50)

# conduct one-way ANOVA
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

# report results
print("F-statistic: {:.2f}".format(f_statistic))
print("p-value: {:.4f}".format(p_value))

if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 71.07
p-value: 0.0000
There is a significant difference between the mean weight loss of the three diets.


In [10]:
'''Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.'''
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create sample data
np.random.seed(1)
program = ["A", "B", "C"] * 30
experience = ["Novice"] * 45 + ["Experienced"] * 45
time = np.random.normal(10, 2, 90)

# convert data to dataframe
df = pd.DataFrame({"Program": program, "Experience": experience, "Time": time})

# conduct two-way ANOVA
model = ols("Time ~ C(Program) + C(Experience) + C(Program):C(Experience)", data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# report results
print(anova_table)


                              sum_sq    df         F    PR(>F)
C(Program)                  6.234931   2.0  0.956295  0.388458
C(Experience)               7.218401   1.0  2.214272  0.140484
C(Program):C(Experience)   13.489107   2.0  2.068918  0.132712
Residual                  273.835194  84.0       NaN       NaN


In [11]:
'''Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.'''
import numpy as np
from scipy.stats import ttest_ind

# Generate some example data
np.random.seed(1)
control_scores = np.random.normal(70, 10, 100)
experimental_scores = np.random.normal(75, 10, 100)

# Conduct two-sample t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Conduct post-hoc test (e.g. Tukey's HSD test)
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(np.concatenate([control_scores, experimental_scores]), 
                                  np.concatenate([np.repeat("Control", 100), np.repeat("Experimental", 100)]))

print(tukey_results)


t-statistic: -4.584315463985094
p-value: 8.059088190829134e-06
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   5.9221   0.0 3.3746 8.4696   True
--------------------------------------------------------


In [None]:
'''Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.'''