#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions required to use ANOVA are:

1. Normality: The data for each group should be normally distributed. This means that the distribution of the data should be bell-shaped and symmetric.

2. Homogeneity of variances: The variances of the groups should be equal. This means that the spread of the data should be similar across all groups.

3. Independence: The observations within each group should be independent of each other. This means that the value of one observation should not be related to the value of another observation within the same group.

4. Outliars: The outliars should be removed.

Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of violations and their impact:

1. Violation of normality: If the data is not normally distributed, ANOVA may not be appropriate and may lead to incorrect conclusions. For example, if the data is skewed or has outliers, this may violate the normality assumption. In such cases, a transformation of the data may be necessary to achieve normality.

2. Violation of homogeneity of variances: If the variances of the groups are not equal, ANOVA may not be appropriate and may lead to incorrect conclusions. This is known as the "equal variance assumption". If the variances are unequal, the F-test used in ANOVA may be biased. 

3. Violation of independence: If the observations within each group are not independent, ANOVA may not be appropriate and may lead to incorrect conclusions. For example, if there is clustering or dependence between the observations, this may violate the independence assumption. In such cases, a different statistical method that accounts for the dependence may be necessary.

Overall, it is important to check the assumptions before using ANOVA to ensure that the results are valid and reliable. If the assumptions are violated, appropriate corrective measures should be taken to ensure that the results are accurate and meaningful.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

1. One-way ANOVA: This is used when there is one independent variable with three or more levels or groups. The aim is to determine whether there are any significant differences between the means of the groups. For example, a one-way ANOVA could be used to compare the mean exam scores of students who have been taught by three different teachers.

2. Two-way ANOVA: This is used when there are two independent variables, each with two or more levels or groups. The aim is to determine whether there are any significant main effects of each independent variable and whether there is an interaction effect between the two independent variables. For example, a two-way ANOVA could be used to examine the effects of two different teaching methods on exam scores, while controlling for the effects of gender.

3. MANOVA (Multivariate ANOVA): This is used when there are two or more dependent variables and one or more independent variables. The aim is to determine whether there are any significant differences between the means of the groups on the dependent variables. For example, a MANOVA could be used to compare the means of several different personality traits (dependent variables) between two different groups of people (independent variable).



#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of decomposing the total variance in a dataset into different sources of variation. ANOVA partitions the total variance into two components: the variance between groups and the variance within groups. This partitioning allows us to determine the extent to which the differences between groups are due to the independent variable being tested or due to chance.

The partitioning of variance is important to understand because it helps us to:

1. Determine the significance of the independent variable: By comparing the variance between groups to the variance within groups, we can determine whether the differences between groups are statistically significant. If the variance between groups is much larger than the variance within groups, it suggests that the independent variable has a significant effect on the dependent variable.

2. Estimate effect sizes: By calculating the proportion of variance accounted for by the independent variable (i.e., the ratio of the between-groups variance to the total variance), we can estimate the effect size of the independent variable. Larger effect sizes suggest that the independent variable has a stronger impact on the dependent variable.

3. Identify potential sources of error: By examining the variance within groups, we can identify potential sources of error that may be contributing to the variability in the data. This can help us to refine our experimental design and control for sources of error in future studies.

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import numpy as np
import scipy.stats as stats

group1 = np.random.normal(10, 3, 20)
group2 = np.random.normal(15, 3, 20)
group3 = np.random.normal(20, 3, 20)

data = np.concatenate([group1, group2, group3])



overall_mean = np.mean(data)

SST = np.sum((data - overall_mean) ** 2)



group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])

SSE= np.sum((group_means - overall_mean) ** 2 * len(group1))



SSR = squares_total - squares_explained



print("Total sum of squares (SST):", SST)
print("Explained sum of squares (SSE):", SSE)
print("Residual sum of squares (SSR):", SSR)

Total sum of squares (SST): 1704.6681732426407
Explained sum of squares (SSE): 1169.5356011986323
Residual sum of squares (SSR): 356.4492088540144


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [36]:
import pandas as pd
import numpy as np
import scipy.stats as stats

np.random.seed(1234)

factor1 = ['A', 'B']
factor2 = ['X', 'Y', 'Z']
n = 10

data = []
for f1 in factor1:
    for f2 in factor2:
        group_data = np.random.normal(loc=10, scale=2, size=n)
        data.extend(zip([f1]*n, [f2]*n, group_data))

        
df = pd.DataFrame(data, columns=['Factor1', 'Factor2', 'Data'])


result = stats.f_oneway(
    df[df['Factor1']=='A'][df['Factor2']=='X']['Data'],
    df[df['Factor1']=='A'][df['Factor2']=='Y']['Data'],
    df[df['Factor1']=='A'][df['Factor2']=='Z']['Data'],
    df[df['Factor1']=='B'][df['Factor2']=='X']['Data'],
    df[df['Factor1']=='B'][df['Factor2']=='Y']['Data'],
    df[df['Factor1']=='B'][df['Factor2']=='Z']['Data']
)


main_effects = result[:2]
if len(result) > 1:
    interaction_effect = result[1]
else:
    interaction_effect = None

# Print the results
print(f"Main effects: {main_effects}")
if interaction_effect is not None:
    print(f"Interaction effect: {interaction_effect}")
else:
    print("No significant interaction effect")

Main effects: (0.8842163578598713, 0.4980872026179599)
Interaction effect: 0.4980872026179599


  df[df['Factor1']=='A'][df['Factor2']=='X']['Data'],
  df[df['Factor1']=='A'][df['Factor2']=='Y']['Data'],
  df[df['Factor1']=='A'][df['Factor2']=='Z']['Data'],
  df[df['Factor1']=='B'][df['Factor2']=='X']['Data'],
  df[df['Factor1']=='B'][df['Factor2']=='Y']['Data'],
  df[df['Factor1']=='B'][df['Factor2']=='Z']['Data']


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

The F-statistic represents the ratio of the variance between the groups to the variance within the groups. An F-statistic above 1 indicates more variance between groups than within groups. So, the groups massively varies.

Based on the p-value of 0.02, we can reject the null hypothesis that the group means are equal at the standard 0.05 level of significance. This means there is a less than 2% probability we would observe these results by chance if the group means were actually equal.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

There are a few main ways to handle missing data in a repeated measures ANOVA:

1. Listwise deletion - Delete any cases with any missing data. This is the simplest method but can significantly reduce sample size and statistical power.

2. Pairwise deletion - Delete cases only for the specific analysis they are missing data for. This retains more data but can create non-independence between comparisons.

3. Imputation - Replace missing values with imputed estimates based on other data. Simple imputation uses means, regression can be used for more sophisticated imputation. This retains full sample size but introduces some error.

The potential consequences of these different methods include:

1. Reduced sample size and statistical power with listwise deletion.

2. Increased type I error rate (false positives) with pairwise deletion due to non-independence.

3. Increased standard errors and type II error rates (false negatives) with imputation due to introduced error.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Some common post-hoc tests used after ANOVA are:

1. Tukey HSD - Used when all group comparisons are of interest. Compares all possible pairs of groups.  
    Example: Comparing the efficency of 4 different drug treatments.

2. Bonferroni correction - Makes multiple comparisons more stringent by adjusting the alpha level. Less prone to type I error but lower power.
    Example: Evaluating differences between 10 classroom interventions.  

3. Fisher's LSD - The most liberal test. Higher probability of type I error but higher power. Only recommended if all comparisons of interest.    
    Example: Determining which of 5 strains of bacteria grew the most in nutrient broth.

4. Dunnett's test - Compares multiple treatment groups to a single control group.    
    Example: Testing 3 drug doses against a placebo.

A post-hoc test is necessary when:

1) The ANOVA detects a statistically significant difference between at least two groups but does not indicate which specific groups differ. Post-hoc tests identify where the significant differences lie.

2) All pairwise comparisons of interest are planned prior to the experiment. Post-hoc tests test these specific, a priori hypotheses to avoid data dredging.

For example, say an experiment compares 4 drug treatments and finds a significant difference in efficency via ANOVA (F=3.44, p = 0.04). Post-hoc tests are then run to determine specifically which drug treatments differed significantly in efficency. This provides more informative results to guide further research.

The choice of post-hoc test depends on your planned comparisons, number of groups, willingness to adjust alpha levels, and risk tolerance for type I and type II errors. Reporting multiple post-hoc tests can provide greater insight into the robustness of the findings.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [40]:
import numpy as np
import scipy.stats as stats

np.random.seed(1234)
weight_loss_a = np.random.normal(loc=5.0, scale=2.0, size=50)
weight_loss_b = np.random.normal(loc=6.0, scale=2.0, size=50)
weight_loss_c = np.random.normal(loc=7.0, scale=2.0, size=50)

weight_loss = np.concatenate([weight_loss_a, weight_loss_b, weight_loss_c])

groups = np.concatenate([
    np.repeat('A', 50),
    np.repeat('B', 50),
    np.repeat('C', 50)
])

f_stat, p_value = stats.f_oneway(weight_loss_a, weight_loss_b, weight_loss_c)

print(f"F-statistic: {f_stat:.2f}")
print(f"p-value: {p_value:.4f}")

F-statistic: 13.57
p-value: 0.0000


Interpretation:since the p-value is less than 0.05, we can conclude that there is a statistically significant difference between the mean weight loss of the three diets. We can reject the null hypothesis that the means are all equal, and conclude that at least one of the diets is associated with a different mean weight loss than the others. However, we do not know which specific diets are different from each other yet. We would need to perform post-hoc tests to determine this.

#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [42]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

np.random.seed(1234)
times_a_novice = np.random.normal(loc=10.0, scale=2.0, size=30)
times_a_exp = np.random.normal(loc=8.0, scale=2.0, size=30)
times_b_novice = np.random.normal(loc=12.0, scale=2.0, size=30)
times_b_exp = np.random.normal(loc=9.0, scale=2.0, size=30)
times_c_novice = np.random.normal(loc=14.0, scale=2.0, size=30)
times_c_exp = np.random.normal(loc=11.0, scale=2.0, size=30)

data = pd.DataFrame({
    'time': np.concatenate([
        times_a_novice, times_a_exp,
        times_b_novice, times_b_exp,
        times_c_novice, times_c_exp
    ]),
    'program': np.repeat(['A', 'B', 'C'], 60),
    'experience': np.tile(['novice', 'experienced'], 90)
})

model = ols('time ~ program * experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                         sum_sq     df          F        PR(>F)
program              372.987890    2.0  30.981361  3.093207e-12
experience            11.313202    1.0   1.879409  1.721670e-01
program:experience    24.338887    2.0   2.021652  1.355356e-01
Residual            1047.402228  174.0        NaN           NaN


Interpretation: we see that there is a significant main effect of software program (p < 0.05), indicating that the mean task completion times are different for at least one pair of software programs. We also see a significant main effect of employee experience level (p < 0.05), indicating that the mean task completion times are different between novice and experienced employees, on average.However, the interaction effect between software program and employee experience level is not significant (p > 0.05). This suggests that the effect of software program on task completion time does not depend on the level of employee experience.

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [44]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(1234)
scores_control = np.random.normal(loc=70.0, scale=10.0, size=50)
scores_experimental = np.random.normal(loc=75.0, scale=10.0, size=50)

data = pd.DataFrame({
    'score': np.concatenate([scores_control, scores_experimental]),
    'group': np.repeat(['control', 'experimental'], 50)
})

control_scores = data.loc[data['group'] == 'control', 'score']
experimental_scores = data.loc[data['group'] == 'experimental', 'score']
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

print('t-statistic:', t_statistic)
print('p-value:', p_value)

tukey_results = pairwise_tukeyhsd(data['score'], data['group'])

print('Tukey HSD results:')
print(tukey_results)

t-statistic: -2.0949294926210147
p-value: 0.03875602556376061
Tukey HSD results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower upper  reject
--------------------------------------------------------
control experimental   4.2108 0.0388 0.222 8.1996   True
--------------------------------------------------------


#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [50]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(1234)
sales_a = np.random.normal(loc=5000.0, scale=1000.0, size=30)
sales_b = np.random.normal(loc=5500.0, scale=1000.0, size=30)
sales_c = np.random.normal(loc=6000.0, scale=1000.0, size=30)

data = pd.DataFrame({
    'sales': np.concatenate([sales_a, sales_b, sales_c]),
    'day': np.repeat(np.arange(1, 31), 3),
    'store': np.tile(['A', 'B', 'C'], 30)
})

model = AnovaRM(data, 'sales', 'store', within=['day'])
results = model.fit()


print(results.summary())


tukey_results = pairwise_tukeyhsd(data['sales'], data['store'])


print(tukey_results)

              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
day  0.9531 29.0000 58.0000 0.5446

   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj    lower     upper   reject
---------------------------------------------------------
     A      B -315.4651 0.5205 -1003.2059 372.2758  False
     A      C  178.1571 0.8108  -509.5838 865.8979  False
     B      C  493.6221 0.2067  -194.1187 1181.363  False
---------------------------------------------------------
