# Q1

Assumptions of ANOVA:

1. Independence: The observations within each group should be independent of each other. This means that the data points within a group should not be influenced by or related to each other. Violations of this assumption can occur when there is a correlation between observations within a group, leading to potential bias in the results.

2. Normality: The residuals (the differences between observed values and predicted values) within each group should follow a normal distribution. This assumption is critical, especially for smaller sample sizes, as ANOVA becomes more robust with larger sample sizes. Departures from normality can lead to inaccurate p-values and incorrect conclusions about the significance of group differences.

3. Homogeneity of Variance (Homoscedasticity): The variances of the residuals within each group should be roughly equal. In other words, the spread of data points should be consistent across all groups. If there's a significant difference in the variability of the groups, the results of ANOVA can be compromised. This can be tested using techniques like Levene's test or the Bartlett test.

4. Homogeneity of Regression Slopes (if applicable): If you're performing an ANOVA with a regression component (e.g., ANCOVA), the relationship between the dependent variable and the covariate should be similar across all groups. If this assumption is violated, it's called the "homogeneity of regression slopes" assumption.

5. Equal Group Sizes (for one-way ANOVA): In a one-way ANOVA, where you're comparing means across different groups, having roughly equal group sizes is preferred. Unequal group sizes can affect the statistical power and precision of the analysis.

6. Random Sampling: The data should be collected through a random sampling process, ensuring that the observations are representative of the broader population. This assumption is important for making valid inferences about the population.

Examples of Violations and Impacts:

    Non-Normality: If the residuals within each group do not follow a normal distribution, the p-values and confidence intervals obtained from ANOVA might be inaccurate. This can lead to incorrect conclusions about group differences.

    Heteroscedasticity: Unequal variances across groups can lead to inflated Type I error rates (incorrectly rejecting the null hypothesis) and reduced statistical power. The F-test in ANOVA assumes equal variances, so violations can compromise the validity of the results.

    Independence Violation: Correlation among observations within groups can lead to biased results. For example, in a study involving repeated measurements on the same subjects, violating the assumption of independence can lead to inflated significance levels.

    Unequal Group Sizes (for one-way ANOVA): Unequal group sizes can affect the precision of the results. Smaller groups might have less statistical power to detect true differences between means.

# Q2

There are 3 types of anova
1. One-Way ANOVA:

Example: Imagine you are studying the effect of different teaching methods on student test scores. You have three groups: Group A with traditional teaching, Group B with online teaching, and Group C with interactive teaching. One-way ANOVA can be used to determine if there are significant differences in test scores among these teaching methods.

2. Two-Way ANOVA:

Example: Suppose you are investigating the effects of both teaching method (traditional, online, interactive) and gender (male, female) on student test scores. Two-way ANOVA can help you determine if there are main effects of teaching method and gender, as well as whether the interaction between these two factors has a significant impact on test scores.

3. Repeated Measures ANOVA:

Example: You are conducting a study to examine the effect of a new drug on blood pressure. You measure the blood pressure of each participant before taking the drug, after one week of taking the drug, and after two weeks of taking the drug. Repeated measures ANOVA can help you determine if there are significant differences in blood pressure across the different time points.

# Q3

The partitioning of variance in ANOVA refers to the process of breaking down the total variability observed in a dataset into different components that can be attributed to various sources of variation. This concept is fundamental to understanding the sources of variation and their contributions to the overall variability in the data. ANOVA achieves this partitioning by decomposing the total sum of squares (SS) into several components, including the sum of squares between groups (SSB), the sum of squares within groups (SSW), and sometimes the sum of squares due to interactions (if applicable).

The partitioning of variance is important for several reasons:

1. Source of Variation: By breaking down the total variance into different components, ANOVA helps you identify which sources of variation contribute significantly to the differences between groups. This allows you to attribute variability to specific factors or treatments under investigation.

2. Hypothesis Testing: ANOVA uses the partitioning of variance to perform hypothesis tests that compare the variation between groups to the variation within groups. This is essential for determining whether the observed differences in means among groups are statistically significant or simply due to random chance.

3. F-Statistic Calculation: The partitioning of variance is used to calculate the F-statistic, which is the ratio of the between-group variation to the within-group variation. The F-statistic is then compared to a critical value to determine if the group means are significantly different from each other.

4. Interpretation of Results: Understanding the partitioning of variance allows researchers to interpret ANOVA results more effectively. You can determine the proportion of variance explained by the independent variable (or factors) and compare it to the proportion that remains unexplained.

5. Design and Analysis Improvement: If the partitioning of variance reveals that a significant proportion of the total variance is due to a particular factor or interaction, it suggests that this factor has a substantial effect on the dependent variable. This insight can guide further research, experimental design adjustments, or exploration of potential covariates.

6. Assumptions Checking: The partitioning of variance highlights the role of variance components, helping researchers identify potential violations of ANOVA assumptions, such as homoscedasticity or normality. If the partitioned variance components are not as expected, it may signal the need for further investigation or alternative analysis methods.

# Q4


In [7]:
import numpy as np
from scipy import stats
group1 = np.array([15, 18, 20, 22, 25])
group2 = np.array([28, 30, 32, 35, 38])
group3 = np.array([42, 45, 48, 50, 53])
all_data = np.concatenate([group1, group2, group3])
overall_mean = np.mean(all_data)
sst = np.sum((all_data - overall_mean) ** 2)
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])
sse = np.sum((group_means - overall_mean) ** 2) * len(group1)
ssr = sst - sse
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 2103.6000000000004
Explained Sum of Squares (SSE): 1909.2000000000003
Residual Sum of Squares (SSR): 194.4000000000001


# Q5

In [8]:
import numpy as np
from scipy import stats
data = np.array([
    [12, 15, 18],
    [20, 25, 28],
    [30, 32, 35]
])
mean_factor_a = np.mean(data, axis=1)
mean_factor_b = np.mean(data, axis=0)
overall_mean = np.mean(data)
main_effect_a = np.sum((mean_factor_a - overall_mean) ** 2)
main_effect_b = np.sum((mean_factor_b - overall_mean) ** 2)
interaction_effect = np.sum((data - mean_factor_a[:, np.newaxis] - mean_factor_b) ** 2)
sst = np.sum((data - overall_mean) ** 2)
ss_main_effects = main_effect_a + main_effect_b
ss_interaction_effect = interaction_effect
ss_error = sst - ss_main_effects - ss_interaction_effect

print("Main Effect of Factor A:", main_effect_a)
print("Main Effect of Factor B:", main_effect_b)
print("Interaction Effect:", interaction_effect)
print("Error:", ss_error)


Main Effect of Factor A: 150.51851851851856
Main Effect of Factor B: 20.074074074074066
Interaction Effect: 5139.222222222223
Error: -4794.925925925926


# Q6

In your case, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how you can interpret these results:

    F-Statistic: The F-statistic of 5.23 indicates the ratio of variability between group means to the variability within groups. A larger F-statistic suggests that the variation between the group means is relatively larger compared to the variation within the groups.

    p-value: The p-value of 0.02 indicates that the probability of observing an F-statistic as extreme as 5.23 (or more extreme) under the assumption of no real differences between the group means is 0.02. In other words, if the null hypothesis were true (i.e., if the group means were all equal), you would only expect to see an F-statistic as extreme as 5.23 about 2% of the time.

    Conclusion: With a p-value of 0.02, you would typically use a predetermined significance level (often denoted as α, commonly set to 0.05) to make a decision. If the p-value is less than or equal to α, you reject the null hypothesis. In this case, since the p-value (0.02) is less than α (0.05), you would conclude that there is sufficient evidence to reject the null hypothesis.

    Interpretation: Based on these results, you can interpret that there are likely significant differences between the groups. In other words, at least one group mean is different from the others. However, the ANOVA itself doesn't tell you which specific groups have different means. If you reject the null hypothesis, you might proceed with post hoc tests (such as Tukey's HSD or Bonferroni corrections) to determine which groups differ significantly from each other.

# Q7 

Handling missing data in a repeated measures ANOVA is essential to ensure the validity and reliability of your results. Missing data can occur for various reasons, such as participant dropout, measurement errors, or other sources of nonresponse. There are several methods to handle missing data in a repeated measures ANOVA, each with its own implications:

    Listwise Deletion (Complete Case Analysis): This method involves removing any participant with missing data from the analysis. While it's straightforward, it can lead to a reduction in sample size, loss of statistical power, and potential bias if the missing data is not random. The remaining participants may not be representative of the original sample.

    Mean Imputation: Missing values are replaced with the mean value of the available data for that variable. While simple to implement, mean imputation can underestimate standard errors, distort relationships, and potentially result in biased estimates if the data is not missing at random.

    Last Observation Carried Forward (LOCF): Missing values are replaced with the last observed value for that participant. This method can lead to incorrect estimates of group differences if the data is not missing completely at random and if the pattern of missingness is related to the dependent variable.

    Linear Interpolation: Missing values are estimated based on the trend of the observed data points before and after the missing value. This method assumes a linear relationship and may not work well for non-linear data.

    Multiple Imputation: This involves creating multiple plausible imputed datasets and analyzing each separately. The results are then combined to produce valid estimates and appropriate standard errors. Multiple imputation provides a robust approach if assumptions about the missing data mechanism are met, but it requires more complex analyses.

    Model-Based Methods: Advanced methods, such as mixed-effects models, can be used to simultaneously model the repeated measures and handle missing data. These methods take advantage of the available data and account for the within-subject correlations.

The potential consequences of using different methods to handle missing data include:

    Bias: Methods like mean imputation or LOCF can introduce bias if the missing data mechanism is not random.
    Loss of Power: Removing participants with missing data or imputing missing values with crude methods can reduce the effective sample size and lower statistical power.
    Invalid Inferences: Incorrectly handling missing data can lead to incorrect conclusions and invalid inferences about group differences and relationships.
    Inaccurate Estimates: Some methods may underestimate or overestimate variability and relationships in the data.
    Assumption Violations: Some methods may assume that the missing data mechanism is ignorable or missing at random, which might not hold in real-world situations.

# Q8
Some common post-hoc tests include:

    Tukey's Honestly Significant Difference (HSD): Tukey's HSD is a conservative post-hoc test that controls the familywise error rate. It's used when you want to compare all possible pairs of group means. Tukey's HSD is suitable when you have more than three groups and you're interested in comprehensively assessing differences among them.

    Bonferroni Correction: This method involves adjusting the significance level (alpha) for each comparison to control the familywise error rate. It's a more stringent correction that reduces the likelihood of Type I errors. Bonferroni correction is used when you want to control for the overall experiment-wise error rate but can result in increased Type II errors (false negatives).

    Sidak Correction: Similar to the Bonferroni correction, the Sidak correction adjusts the significance level for each comparison. However, it's a slightly less conservative correction, providing a balance between controlling Type I errors and maintaining power.

    Dunn's Test: Dunn's test, also known as the Dunn-Bonferroni test, is a non-parametric post-hoc test suitable when the assumptions of normality and homoscedasticity are violated. It's used to compare group means while controlling the familywise error rate.

    Holm's Method: Holm's method is a step-down procedure that adjusts p-values for multiple comparisons. It's less conservative than Bonferroni correction and maintains familywise error control.

    Fisher's Least Significant Difference (LSD): Fisher's LSD is another post-hoc test that's less stringent than Tukey's HSD. It's often used when you have a specific hypothesis about which groups might differ, as it doesn't control the familywise error rate as effectively.

Example Situation:
Suppose you conducted an experiment to compare the effectiveness of three different workout routines (A, B, and C) on cardiovascular fitness. After performing a one-way ANOVA, you found a significant difference among the means. Now, you want to determine which specific workout routines show significant differences in terms of cardiovascular fitness.

# Q9

In [9]:
import numpy as np
from scipy import stats
diet_A = np.array([3.2, 4.1, 2.8, 5.6, 3.9, 4.3, 2.7, 3.0, 4.7, 5.2,
                   3.8, 4.0, 4.5, 2.9, 3.7, 4.2, 3.4, 4.6, 5.1, 3.6,
                   2.7, 3.3, 4.8, 3.5, 4.6, 3.8, 4.2, 2.8, 4.5, 3.9,
                   5.0, 3.2, 4.1, 3.6, 3.8, 4.9, 3.3, 4.7, 3.6, 4.0,
                   2.9, 3.1, 4.3, 3.5, 5.2, 4.5, 3.4, 3.7, 4.6, 3.0])
diet_B = np.array([2.5, 3.0, 3.2, 2.0, 2.8, 3.6, 2.9, 3.5, 3.1, 2.4,
                   2.7, 2.3, 2.8, 2.6, 3.3, 2.2, 2.9, 3.1, 2.5, 2.7,
                   3.0, 2.4, 2.8, 3.2, 2.1, 2.7, 2.9, 2.6, 2.5, 3.0,
                   3.4, 2.8, 2.3, 2.6, 2.9, 3.0, 2.2, 3.3, 2.4, 3.1,
                   2.6, 2.7, 2.5, 2.8, 3.0, 2.9, 2.3, 2.1, 2.7, 2.5])
diet_C = np.array([1.8, 1.9, 2.2, 2.5, 1.7, 2.3, 1.6, 2.0, 2.4, 1.5,
                   2.1, 1.8, 2.3, 2.2, 1.7, 2.5, 2.0, 1.9, 1.8, 2.4,
                   2.2, 2.3, 1.6, 2.1, 2.0, 2.4, 1.9, 1.7, 2.3, 2.2,
                   1.5, 2.0, 2.1, 2.4, 1.8, 2.3, 2.2, 1.7, 2.5, 2.0,
                   1.9, 1.6, 2.4, 2.3, 2.1, 1.8, 1.7, 2.2, 2.0, 1.9])
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)
alpha = 0.05
if p_value < alpha:
    print("There are significant differences between the mean weight loss of the diets.")
else:
    print("There are no significant differences between the mean weight loss of the diets.")

F-Statistic: 172.21724007247386
p-value: 2.9860750504035665e-39
There are significant differences between the mean weight loss of the diets.


# Q10

In [10]:
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
np.random.seed(0)
software_programs = np.random.choice(['A', 'B', 'C'], size=90)
experience_level = np.random.choice(['Novice', 'Experienced'], size=90)
completion_times = np.random.normal(loc=10, scale=2, size=90)
data = pd.DataFrame({'Software': software_programs,
                     'Experience': experience_level,
                     'Time': completion_times})
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data).fit()
anova_table = anova_lm(model, typ=2)
print(anova_table)

                               sum_sq    df         F    PR(>F)
C(Software)                  4.600606   2.0  0.532542  0.589080
C(Experience)                1.359515   1.0  0.314741  0.576279
C(Software):C(Experience)   15.102201   2.0  1.748150  0.180369
Residual                   362.836289  84.0       NaN       NaN


# Q11

In [11]:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
np.random.seed(0)
control_group_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_group_scores = np.random.normal(loc=75, scale=10, size=50)

t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

print("Two-sample t-test results:")
print("T-Statistic:", t_statistic)
print("p-value:", p_value)

alpha = 0.05

if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
    print("Performing post hoc test...")
    
    all_scores = np.concatenate([control_group_scores, experimental_group_scores])
    group_labels = np.array(['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores))
    
    tukey_result = pairwise_tukeyhsd(all_scores, group_labels, alpha=0.05)
    print(tukey_result)
else:
    print("There is no significant difference in test scores between the two groups.")


Two-sample t-test results:
T-Statistic: -1.6677351961320235
p-value: 0.09856078338184605
There is no significant difference in test scores between the two groups.


# Q12

In [12]:
import numpy as np
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
store_A_sales = np.random.randint(500, 1000, size=30)
store_B_sales = np.random.randint(450, 950, size=30)
store_C_sales = np.random.randint(400, 900, size=30)
all_sales = np.concatenate([store_A_sales, store_B_sales, store_C_sales])
store_labels = np.array(['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30)
f_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)
print("One-way ANOVA results:")
print("F-Statistic:", f_statistic)
print("p-value:", p_value)
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in daily sales between the three stores.")
    print("Performing post hoc test...")
    tukey_result = pairwise_tukeyhsd(all_sales, store_labels, alpha=0.05)
    print(tukey_result)
else:
    print("There is no significant difference in daily sales between the three stores.")

One-way ANOVA results:
F-Statistic: 1.865941659055099
p-value: 0.16089061995020176
There is no significant difference in daily sales between the three stores.
