**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**


Analysis of Variance (ANOVA) is a statistical technique used to compare means across two or more groups to determine if there are significant differences among them. It makes several assumptions about the data, and violating these assumptions could impact the validity of the ANOVA results. The main assumptions for ANOVA are:

Independence: Observations in each group are assumed to be independent of each other. This means that the measurements within one group do not influence the measurements in another group.

Normality: The residuals (differences between observed values and predicted values) within each group should follow a normal distribution. This assumption is particularly important when the group sizes are small. However, ANOVA is somewhat robust to deviations from normality, especially when sample sizes are large.

Homogeneity of Variance (Homoscedasticity): The variability of observations within each group should be approximately equal across all groups. In other words, the variances of the groups should be similar.

Examples of violations of these assumptions that could impact the validity of ANOVA results:

Independence Violation: If observations within groups are not independent, it can lead to pseudoreplication. For example, if you're comparing the performance of different students in a class and some students collaborate or influence each other's scores, the assumption of independence is violated.

Normality Violation: If the residuals within groups do not follow a normal distribution, ANOVA results may not be reliable, especially when dealing with small sample sizes. For example, if you're comparing the reaction times of different groups of participants, and the reaction times are heavily skewed or have outliers, the assumption of normality could be violated.

Homoscedasticity Violation: If the variability within groups is not equal, ANOVA results may be influenced by groups with larger variances. This can lead to increased Type I error rates or decreased power. For instance, if you're comparing the heights of individuals from different regions, and the height variability in one region is much larger than in others, the assumption of homoscedasticity might be violated.

**Q2. What are the three types of ANOVA, and in what situations would each be used?**


There are three main types of ANOVA: one-way ANOVA, two-way ANOVA, and repeated measures ANOVA. Each is used in different situations to analyze variations in data:

One-Way ANOVA:

Situation: Used when comparing means across three or more independent groups.

Example: Comparing the effectiveness of three different treatments on patients' recovery times.
Two-Way ANOVA:

Situation: Used when there are two categorical independent variables, and you want to examine their combined effects on a continuous dependent variable.

Example: Analyzing the impact of both gender and age group on test scores among students.
Repeated Measures ANOVA:

Situation: Used when measuring the same subjects multiple times under different conditions or treatments.

Example: Assessing changes in heart rate before, during, and after exercise for the same group of participants.

Each type of ANOVA addresses specific research questions and study designs, allowing you to explore variations in data that involve different factors or repeated measurements.


**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

The partitioning of variance in ANOVA refers to breaking down the total variability in data into components attributed to different sources. This concept is vital because it helps analyze the contributions of various factors to differences among groups. Understanding this breakdown allows for identifying significant effects, assessing the importance of factors, guiding interpretation, and forming hypotheses. It provides insights into the significance and practical relevance of observed differences.

**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**

Certainly, I can provide a concise explanation of these terms:

Total Sum of Squares (SST):

SST represents the total variability in the dependent variable (response variable).
It measures the total deviation of individual data points from the overall mean.
SST = SSE (explained variability) + SSR (unexplained variability).
In a one-way ANOVA, it captures the variability of all data points around the grand mean.
Explained Sum of Squares (SSE):

SSE measures the variability explained by the differences between group means and the overall mean.
It quantifies the extent to which group means differ from the grand mean.
Larger SSE indicates greater differences among group means, suggesting that the groups are not similar.
Residual Sum of Squares (SSR):

SSR represents the unexplained variability that remains after accounting for differences in group means.
It measures the variation within each group and indicates how much individual data points deviate from their respective group means.
Larger SSR suggests that there's a significant amount of variability within each group that is not explained by the differences in group means.
In summary, SST is the total variability in the data, SSE represents the variability explained by group means, and SSR captures the unexplained variability within each group. These components are used to assess the significance of the differences among group means in a one-way ANOVA.

In [15]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'factor_A': ['A1', 'A1', 'A2', 'A2', 'A1', 'A1', 'A2', 'A2'],
    'factor_B': ['B1', 'B2', 'B1', 'B2', 'B2', 'B1', 'B2', 'B1'],
    'dependent_variable': [10, 12, 15, 13, 14, 11, 16, 12]
})

p_value_A = stats.f_oneway(*[data[data['factor_A'] == level]['dependent_variable'] for level in data['factor_A'].unique()]).pvalue
p_value_B = stats.f_oneway(*[data[data['factor_B'] == level]['dependent_variable'] for level in data['factor_B'].unique()]).pvalue

formula = 'dependent_variable ~ factor_A * factor_B'
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model)

print("Main Effect A p-value:", p_value_A)
print("Main Effect B p-value:", p_value_B)
print("Interaction Effect p-value:", anova_table.loc['factor_A:factor_B', 'PR(>F)'])



Main Effect A p-value: 0.121952421383888
Main Effect B p-value: 0.2507869611250125
Interaction Effect p-value: 0.5655332664528653


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**

With an F-statistic of 5.23 and a p-value of 0.02 from the one-way ANOVA:

You can conclude that there are statistically significant differences between the groups' means.

The p-value indicates that these differences are unlikely to have occurred by chance.

Rejecting the null hypothesis, you have evidence that at least one group mean is different from the others in a meaningful way.


**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**


In a repeated measures ANOVA, handling missing data involves choosing a method to deal with data points that are not available. Different methods have distinct consequences:

Listwise Deletion:

Exclude cases with any missing data.

Consequence: Loss of data, reduced power, potential bias if missingness is non-random.

Mean Imputation:

Replace missing values with the mean of available data.

Consequence: Underestimation of variability, bias, distorted relationships.

Last Observation Carried Forward (LOCF):

Use the last observed value for missing data.

Consequence: Distorted temporal trends, potential overestimation of effects.

Multiple Imputation:

Generate multiple datasets with imputed values, combine results.

Consequence: Reflects uncertainty, accurate if assumptions met, computationally intensive.

Model-Based Imputation:

Predict missing values using statistical models.

Consequence: Preserves relationships, depends on model assumptions.

Zero Imputation:

Replace missing values with zero.

Consequence: Distorted distributions, relationships, biased results.

Weighted Analysis:

Give more weight to complete cases.

Consequence: Partial mitigation of bias, incomplete solution.

Select a method based on data characteristics, aim, and assumptions. Proper handling maintains data integrity and validity of repeated measures ANOVA results.


**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**

Common post-hoc tests after ANOVA:

Tukey's HSD: Identifying significant group differences, suitable for equal sample sizes and variances.

Bonferroni Correction: Controlling familywise error rate, suitable for few pairwise comparisons.

Scheffe's: Complex designs, unequal sample sizes; controls error rate with wider confidence intervals.

Dunn's: Non-parametric alternative, used when normality assumptions are not met.

Holm-Bonferroni: Stepwise control of error rate, less conservative than Bonferroni.

Example Situation:
In an educational study comparing the effectiveness of three teaching methods (A, B, C), ANOVA reveals a significant difference. To pinpoint which methods differ, Tukey's HSD might be used. For instance, if Tukey's test reveals that Method A and Method B have significantly different outcomes, you gain insights into which specific methods are better for student learning.

In [29]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data 
# from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using 
# Python to determine if there are any significant differences between the mean weight loss of the three 
# diets. Report the F-statistic and p-value, and interpret the results

import numpy as np
import scipy.stats as stats

diet_A = np.array([np.random.rand()*4 for i in range(50)])  
diet_B = np.array([np.random.rand()*4 for i in range(50)])  
diet_C = np.array([np.random.rand()*4 for i in range(50)])  

f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There are significant differences in mean weight loss between the diet groups.")
else:
    print("There are no significant differences in mean weight loss between the diet groups.")



F-statistic: 3.356140004737559
p-value: 0.037561645884032616
There are significant differences in mean weight loss between the diet groups.


In [37]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs. experienced)
# . Report the F-statistics and p-values, and interpret the results.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 30,
    'Experience': ['Novice'] * 45 + ['Experienced'] * 45,
    'Time': np.random.randint(10, 60, size=90) 
})

formula = 'Time ~ Software + Experience + Software:Experience'
model = ols(formula, data).fit()
anova_table = anova_lm(model)

print(anova_table)

if anova_table['PR(>F)'][0] < 0.05:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

if anova_table['PR(>F)'][1] < 0.05:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

if anova_table['PR(>F)'][2] < 0.05:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")


                       df        sum_sq     mean_sq         F    PR(>F)
Software              2.0    277.355556  138.677778  0.508296  0.603358
Experience            1.0    187.777778  187.777778  0.688263  0.409107
Software:Experience   2.0    145.088889   72.544444  0.265898  0.767160
Residual             84.0  22917.600000  272.828571       NaN       NaN
There is no significant main effect of Software.
There is no significant main effect of Experience.
There is no significant interaction effect between Software and Experience.


In [35]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
#  scores. They randomly assign 100 students to either the control group (traditional teaching method)
#  or the experimental group (new teaching method) and administer a test at the end of the semester. 
# Conduct a two-sample t-test using Python to determine if there are any significant differences in
#  test scores between the two groups. If the results are significant, follow up with a post-hoc test 
# to determine which group(s) differ significantly from each other.

import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

control_group = np.array([50+np.random.randint(50) for i in range(100)])  
experimental_group = np.array([50+np.random.randint(50) for i in range(100)])  

t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")

if p_value < 0.05:
    data = np.concatenate((control_group, experimental_group))
    group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)
    tukey_results = pairwise_tukeyhsd(data, group_labels, alpha=0.05)
    print(tukey_results)



t-statistic: -0.5961353074623413
p-value: 0.5517657284161626
There is no significant difference in test scores between the two groups.


In [36]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of
#  three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales
#  for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there 
# are any significant differences in sales between the three stores. If the results are significant, 
# follow up with a post-hoc test to determine which store(s) differ significantly from each other.

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

data = pd.DataFrame({
    'Day': list(range(1, 31)) * 3,
    'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30,
    'Sales': np.random.randint(50, 150, size=90)  # Replace with actual data
})

anova_results = sm.stats.anova_lm(sm.OLS.from_formula('Sales ~ Store', data=data).fit())

print(anova_results)

if anova_results['PR(>F)'][0] < 0.05:
    tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'], alpha=0.05)
    print(tukey_results)


            df        sum_sq     mean_sq         F    PR(>F)
Store      2.0   1710.288889  855.144444  1.081952  0.343449
Residual  87.0  68762.333333  790.371648       NaN       NaN
