**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.**

ANOVA (Analysis of Variance) is a statistical method used to compare means among two or more groups. It assumes that the samples are drawn from populations that have a normal distribution, homogeneity of variance, and independence of observations.

Assumptions of ANOVA are:

Normality: The populations from which the samples are drawn are normally distributed. Violation of this assumption can occur when the data are skewed, have outliers, or the sample size is small.

Homogeneity of variance: The variance of the groups is equal. Violation of this assumption can lead to inflated Type I errors (false positives) and decreased power (ability to detect true effects).

Independence of observations: The observations in each group are independent of each other. This assumption is violated when there is a correlation between the observations or the sampling method is not random.

**Q2. What are the three types of ANOVA, and in what situations would each be used?**

There are three types of ANOVA:

One-way ANOVA: One-way ANOVA is used when there is one independent variable with three or more levels or groups. It is used to determine if there is a significant difference in the means of the groups. For example, a one-way ANOVA could be used to compare the mean weight loss of three different diets.

Two-way ANOVA: Two-way ANOVA is used when there are two independent variables or factors. It is used to determine if there is a significant interaction between the two factors and if there is a significant main effect of each factor. For example, a two-way ANOVA could be used to determine if there is a significant difference in test scores based on both gender and teaching method.

Repeated measures ANOVA: Repeated measures ANOVA is used when the same group of participants is tested under multiple conditions or at multiple time points. It is used to determine if there is a significant difference between the conditions or time points. For example, a repeated measures ANOVA could be used to determine if there is a significant difference in anxiety levels before and after a mindfulness intervention.

**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

ANOVA (Analysis of Variance) is a statistical method used to analyze the differences between two or more groups of data. The partitioning of variance refers to the division of the total variance in a data set into its different sources of variation.

The partitioning of variance in ANOVA involves dividing the total variance into two components: the variance between groups (also known as the "treatment" or "explained" variance) and the variance within groups (also known as the "error" or "unexplained" variance). The variance between groups measures the extent to which the means of the different groups differ from each other, while the variance within groups measures the variability of the data within each group.

Understanding the partitioning of variance in ANOVA is important because it allows researchers to determine whether the differences between the means of the groups are statistically significant or due to chance. By comparing the explained variance (the variance between groups) to the unexplained variance (the variance within groups), researchers can determine whether the differences between the means of the groups are larger than would be expected by chance alone.

This information is critical for researchers who are trying to understand the factors that influence a particular outcome. By identifying the sources of variation that are most important, researchers can focus their efforts on designing interventions or treatments that are most likely to be effective. Additionally, understanding the partitioning of variance can help researchers to avoid drawing false conclusions based on small or biased samples, which can lead to inaccurate or misleading results.







**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?**

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas DataFrame
data = [1,2,3,4,5,6]

# Fit the ANOVA model using the formula interface
model = ols('response_variable ~ group_variable', data=data).fit()

# Calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

# Print the results
print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas DataFrame
data = [[1,2,3],[4,5,6]]

# Fit the ANOVA model using the formula interface
model = ols('response_variable ~ factor(variable_1) + factor(variable_2) + factor(variable_1):factor(variable_2)', data=data).fit()

# Calculate the main effect of variable_1
me_1 = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the main effect of variable_2
me_2 = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the interaction effect between variable_1 and variable_2
ie = sm.stats.anova_lm(model, typ=1)['sum_sq'][2]

# Print the results
print('Main effect of variable 1:', me_1)
print('Main effect of variable 2:', me_2)
print('Interaction effect:', ie)


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?**

In this scenario, you have conducted a one-way ANOVA, which is a statistical test used to determine whether there are significant differences between the means of three or more groups. The F-statistic is a test statistic that is calculated as the ratio of the between-group variance to the within-group variance, and it is used to determine whether there is a significant difference between the means of the groups.

In this case, the F-statistic is 5.23, and the p-value is 0.02. The p-value represents the probability of observing a test statistic as extreme as the one calculated if the null hypothesis were true. In other words, it represents the probability of observing the observed differences between the groups if there were actually no differences between them.

A p-value of 0.02 indicates that there is a 2% chance of observing the observed differences between the groups if there were actually no differences between them. Since this p-value is less than the standard threshold of 0.05, we can reject the null hypothesis that there are no differences between the groups and conclude that there are significant differences between them.

However, the ANOVA test only tells us that there are significant differences between the groups, but it does not tell us which groups are different from each other. To determine this, we need to conduct post-hoc tests, such as the Tukey HSD test or the Bonferroni correction.

Therefore, in summary, we can conclude that there are significant differences between the groups, but we cannot determine which groups are different without conducting post-hoc tests.







**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?**

In a repeated measures ANOVA, missing data can occur when a participant has missing values on one or more of the measures across time points. Handling missing data is important to ensure that the results are not biased and to maximize the statistical power of the analysis.

There are several methods for handling missing data in a repeated measures ANOVA, including:

Complete case analysis: This method involves only using data from participants who have complete data for all time points. The main advantage of this method is that it is straightforward and easy to implement. However, it may lead to a loss of power and potential bias if the missing data is not missing completely at random (MCAR), meaning that the missingness is not related to the outcome variable or any other covariates.

Imputation methods: Imputation methods involve replacing missing values with estimated values based on the available data. There are several types of imputation methods, including mean imputation, hot deck imputation, and multiple imputation. Imputation methods can increase the power of the analysis and reduce the bias caused by missing data, especially when the missing data is missing at random (MAR), meaning that the missingness is related to other variables in the data set.

Mixed-effects models: Mixed-effects models can handle missing data by including all available data in the analysis, using a maximum likelihood estimation method to estimate the missing values. This method can be particularly useful when the missing data is not MCAR, as it can account for individual differences in the missingness patterns.

The potential consequences of using different methods to handle missing data can vary. Complete case analysis can lead to a loss of power and potential bias if the missing data is not MCAR, while imputation methods can result in biased estimates if the missing data is not MAR. Mixed-effects models can be computationally complex and may require a larger sample size to achieve adequate power.

Therefore, it is important to carefully consider the missing data mechanism and choose a method for handling missing data that is appropriate for the data set and research question. It is also recommended to report the missing data patterns and the methods used to handle missing data in the analysis, to increase the transparency and replicability of the study.







**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.**

Post-hoc tests are used after conducting an ANOVA to determine which specific groups differ significantly from each other. The choice of a specific post-hoc test depends on the research question, the number of groups being compared, and the assumptions of the data.

Here are some common post-hoc tests used after ANOVA:

Tukey's Honestly Significant Difference (HSD): This post-hoc test is used when the number of groups is equal and the variances are homogenous. It is considered to be the most conservative post-hoc test and controls the overall Type I error rate.

Bonferroni correction: This post-hoc test is used when the number of pairwise comparisons is large, and it is known for its strict control of the familywise error rate.

Scheffé's method: This post-hoc test is used when the number of groups is unequal, and the variances are not homogenous. It is known for its ability to handle complex data sets with unequal group sizes and variances.

Games-Howell test: This post-hoc test is used when the assumptions of equal variances and normality are not met. It is known for its robustness to violations of normality and homogeneity of variances assumptions.

Dunnett's test: This post-hoc test is used when one group is compared to all other groups. It is known for its ability to control the familywise error rate.

An example of a situation where a post-hoc test might be necessary is when a researcher is interested in comparing the mean scores of three or more groups on a variable. For instance, let's say that a researcher wants to investigate whether there are any differences in the average weight loss among three different diet programs. After conducting an ANOVA, the researcher finds that there is a significant difference in the mean weight loss among the three groups. However, to determine which specific groups differ significantly from each other, a post-hoc test, such as Tukey's HSD or Scheffé's method, would be necessary. This would enable the researcher to determine which diet programs lead to significantly different weight loss outcomes.







**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.**

In [2]:
import numpy as np
from scipy.stats import f_oneway

# Sample weight loss data for diets A, B, and C
diet_a = np.array([5.2, 7.1, 3.5, 6.2, 4.7, 3.9, 5.8, 6.1, 4.2, 7.8,
                   6.7, 5.5, 4.8, 5.6, 6.4, 5.9, 4.4, 7.2, 4.1, 5.0,
                   5.3, 6.9, 5.7, 7.0, 5.1])
diet_b = np.array([4.5, 3.8, 2.9, 4.1, 5.4, 3.1, 2.5, 3.6, 4.9, 3.3,
                   4.2, 5.6, 3.7, 5.0, 3.9, 4.6, 3.2, 5.3, 4.8, 3.4,
                   2.8, 4.0, 5.1, 3.0, 4.7])
diet_c = np.array([2.1, 1.9, 1.8, 2.6, 2.3, 3.0, 1.7, 1.6, 2.2, 2.9,
                   2.5, 2.8, 2.0, 1.5, 1.8, 2.4, 2.7, 2.1, 2.6, 2.2,
                   1.9, 1.6, 2.3, 2.0, 2.5])

# Perform one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-statistic:", f_stat)
print("p-value:", p_val)


F-statistic: 95.11301918095292
p-value: 6.189286329579846e-21


**Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.**

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a dataframe of the data
data = pd.DataFrame({'program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 2,
                     'experience': ['novice'] * 9 + ['experienced'] * 9 + ['novice'] * 9 + ['experienced'] * 9,
                     'time': [17, 19, 20, 21, 22, 24, 15, 18, 19, 24, 26, 27, 12, 14, 16, 18, 19, 21,16, 18, 19, 20, 22, 23, 14, 15, 16, 21, 23, 25]})

# Conduct two-way ANOVA
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=data).fit()
sm.stats.anova_lm(model, typ=2)



ValueError: arrays must all be same length

**Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.**

In [5]:
import pandas as pd
from scipy.stats import ttest_ind, f_oneway, posthoc_tukey

# Create a dataframe of the data
data = pd.DataFrame({'group': ['control'] * 50 + ['experimental'] * 50,
                     'score': [78, 81, 74, 85, 69, 82, 73, 80, 76, 77, 78, 83, 79, 84, 75, 72, 77, 83, 70, 81,
                               85, 71, 72, 76, 81, 77, 75, 73, 74, 79, 72, 75, 78, 79, 76, 71, 82, 85, 73, 77,
                               72, 81, 78, 79, 84, 77, 75, 82, 83, 80, 81, 78, 75, 82, 83, 86, 90, 88, 85, 91,
                               87, 88, 90, 89, 91, 92, 85, 87, 89, 93, 92, 90, 86, 87, 91, 92, 88, 90, 91, 93,
                               89, 87, 92, 91, 93, 94, 92, 91, 88, 90, 93, 95, 94, 90, 92, 93, 95, 94, 93, 92]})

# Conduct two-sample t-test
control_scores = data[data['group'] == 'control']['score']
experimental_scores = data[data['group'] == 'experimental']['score']
t, p = ttest_ind(control_scores, experimental_scores)

# Report results
print('Two-sample t-test:')
print('t-statistic:', t)
print('p-value:', p)

# Conduct post-hoc test (Tukey's HSD test)
f, p = f_oneway(control_scores, experimental_scores)
tukey_results = posthoc_tukey([control_scores, experimental_scores])
print('\nPost-hoc test (Tukey\'s HSD test):')
print(tukey_results)


ImportError: cannot import name 'posthoc_tukey' from 'scipy.stats' (C:\Users\ps450\anaconda3\lib\site-packages\scipy\stats\__init__.py)

**Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.**

In [6]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create the data
store_a_sales = np.random.normal(50, 10, 30)
store_b_sales = np.random.normal(60, 15, 30)
store_c_sales = np.random.normal(55, 12, 30)

sales_data = pd.DataFrame({'Store A': store_a_sales,
                           'Store B': store_b_sales,
                           'Store C': store_c_sales})


In [7]:
f_stat, p_val = f_oneway(sales_data['Store A'], sales_data['Store B'], sales_data['Store C'])
print('F-statistic:', f_stat)
print('p-value:', p_val)


F-statistic: 14.211026617872603
p-value: 4.56588403349219e-06
