# 1.

Analysis of Variance (ANOVA) is a statistical method used for comparing means of three or more groups. Like any statistical technique, ANOVA has certain assumptions that, if violated, can impact the validity of the results. Here are the key assumptions of ANOVA:

Normality: The dependent variable should be normally distributed within each group. While ANOVA is known for being robust to moderate violations of normality, severe departures from normality can affect the results. Violations could include highly skewed or heavy-tailed distributions.

Homogeneity of Variances (Homoscedasticity): The variances of the dependent variable should be approximately equal across all groups. This assumption is crucial for the robustness of ANOVA. Violations, known as heteroscedasticity, can lead to inflated Type I error rates and affect the precision of the estimated group means.

Independence: Observations within and between groups should be independent. This means that the value of one observation should not be influenced by the value of another observation. Violations of independence could occur in repeated measures or clustered data, where observations within the same group are not independent.

Examples of Violations and Their Impact:

Non-Normality:

Example: Suppose a dataset has a highly skewed distribution within one group.
Impact: ANOVA is generally robust to mild deviations from normality. However, severe non-normality may lead to unreliable results. In such cases, transformations or non-parametric alternatives might be considered.
Heteroscedasticity:

Example: Unequal variances between groups.
Impact: Heteroscedasticity can lead to inaccurate assessment of group differences. Performing a Welch's ANOVA, which is less sensitive to unequal variances, or using a transformation may be alternatives.
Dependent Observations:

Example: Observations within groups are correlated, as in a repeated measures design.
Impact: Violating independence can lead to underestimation of standard errors, affecting the precision of the estimated means. Mixed-effects models or repeated measures ANOVA may be more appropriate.
Outliers:

Example: Extreme values in one or more groups.
Impact: Outliers can affect the estimation of group means and inflate Type I error rates. Robust ANOVA methods or data transformation may be considered.
Categorical Variables:

Example: Including categorical variables as covariates without proper consideration.
Impact: Misinterpretation of results and violation of assumptions. ANCOVA (ANOVA with covariates) assumes that covariates have a linear relationship with the dependent variable.

# 2.

Analysis of Variance (ANOVA) can be categorized into three main types based on the design of the study and the number of independent variables involved:

One-Way ANOVA:

Situation: Used when there is one categorical independent variable (factor) with two or more levels (groups), and the dependent variable is continuous.
Example: Examining if there are differences in test scores among students taught by different teachers (where the teachers represent the levels of the factor).
Two-Way ANOVA:

Situation: Used when there are two independent variables (factors) simultaneously influencing the dependent variable. It explores the interaction effect between the two factors and their individual effects.
Example: Investigating the effects of both treatment (drug dosage) and gender on the response variable (e.g., blood pressure). Here, treatment and gender are the two factors.
Repeated Measures ANOVA:

Situation: Used when measurements are taken on the same subjects under different conditions or at different time points. It accounts for the within-subject variability.
Example: Assessing the impact of a new teaching method on students' performance by measuring their test scores before, during, and after the intervention. Each student serves as their control.

# 3.

The partitioning of variance in Analysis of Variance (ANOVA) refers to the breakdown of the total variance observed in the data into different components associated with various sources of variation. Understanding this concept is crucial because it allows researchers to assess the relative importance of different factors or sources that may influence the variability in the dependent variable.

In ANOVA, the total variance observed in the data is divided into three main components:

Between-Group Variance (SSB):

Definition: Represents the variability among the group means.
Interpretation: Measures how much the group means differ from each other. A larger SSB suggests that there are significant differences between at least some of the group means.

    
Within-Group Variance (SSW):

Definition: Represents the variability within each group.
Interpretation: Measures the variability of individual observations around their respective group means. A larger SSW indicates greater variability within groups.

Total Variance (SST):

Definition: Represents the overall variability in the data.
Interpretation: The sum of the squared differences between each observation and the grand mean. SST is the sum of both SSB and SSW. It reflects the total variability in the data.

The importance of understanding the partitioning of variance in ANOVA lies in its ability to provide insights into the factors contributing to variability in the dependent variable. Researchers can use this information to:

Assess Group Differences: By examining the magnitude of SSB, researchers can determine if there are significant differences between the group means.

Evaluate Homogeneity of Groups: Comparing SSW with SSB helps assess whether the variability within groups is similar or if there are groups with significantly different variances.

Interpret Overall Variability: SST provides a reference point for understanding the total variability in the data. It serves as a baseline against which the contributions of between-group and within-group variability are evaluated.
    

# 4.

In [3]:
import numpy as np

def one_way_anova_sums_of_squares(groups):
    # Combine all data into a single array
    all_data = np.concatenate(groups)

    # Calculate the grand mean
    grand_mean = np.mean(all_data)

    # Calculate Total Sum of Squares (SST)
    sst = np.sum((all_data - grand_mean)**2)

    # Calculate Explained Sum of Squares (SSE)
    sse = np.sum([len(group) * (np.mean(group) - grand_mean)**2 for group in groups])

    # Calculate Residual Sum of Squares (SSR)
    ssr = np.sum([(x - np.mean(group))**2 for group in groups for x in group])

    return sst, sse, ssr

# Example data for three groups
group1 = [80, 85, 90, 92, 87, 83]
group2 = [75, 78, 82, 79, 81, 84]
group3 = [70, 72, 76, 74, 77, 73]

# Calculate sums of squares
sst, sse, ssr = one_way_anova_sums_of_squares([group1, group2, group3])

# Print the results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 651.7777777777776
Explained Sum of Squares (SSE): 468.7777777777778
Residual Sum of Squares (SSR): 183.0


# 5.

In a two-way ANOVA, you can calculate the main effects and interaction effects for each independent variable using :
Main Effect of Factor A 
Main Effect of Factor B 
Interaction Effect

In [None]:
# Example
import numpy as np

def two_way_anova_effects(data):
    # Get the dimensions of the data
    I, J = data.shape

    # Calculate the grand mean
    grand_mean = np.mean(data)

    # Calculate Main Effect of Factor A (Main Effect A)
    mea = np.mean(data, axis=1) - grand_mean

    # Calculate Main Effect of Factor B (Main Effect B)
    meb = np.mean(data, axis=0) - grand_mean

    # Calculate Interaction Effect (Interaction AB)
    ia = grand_mean - np.mean(data) - mea - meb

    return mea, meb, ia

# Example data for a 2x3 design (2 levels of Factor A, 3 levels of Factor B)
data = np.array([[10, 12, 14],
                 [15, 18, 21]])

# Calculate main effects and interaction effect
mea, meb, ia = two_way_anova_effects(data)

# Print the results
print("Main Effect of Factor A (MEA):", mea)
print("Main Effect of Factor B (MEB):", meb)
print("Interaction Effect (IA):", ia)


# 6.


In a one-way ANOVA, the F-statistic is used to test whether there are significant differences among the means of three or more groups. The associated p-value helps determine the statistical significance of the observed differences. Here's how to interpret the given results:

F-Statistic:

Value (Given): 5.23
Interpretation: The F-statistic represents the ratio of the variability between group means to the variability within groups. A higher F-value indicates a larger between-group variability relative to within-group variability.
P-Value:

Value (Given): 0.02
Interpretation: The p-value is the probability of observing such extreme F-statistic results if the null hypothesis (no group differences) were true. A smaller p-value suggests stronger evidence against the null hypothesis.
Conclusion:

P-Value < Significance Level (e.g., 0.05): If the p-value is less than the chosen significance level (commonly 0.05), you reject the null hypothesis.
Interpretation: In this case (p-value = 0.02 < 0.05), there is enough evidence to reject the null hypothesis. This implies that there are statistically significant differences among at least some of the group means.
Post hoc Tests (if applicable):

If your one-way ANOVA indicates significant differences, you may perform post hoc tests (e.g., Tukey's HSD, Bonferroni) to identify which specific groups differ from each other.
Effect Size:

Additionally, it is often useful to consider the effect size, such as eta-squared (η^2), which quantifies the proportion of total variability in the dependent variable explained by group membership.

# 7.

Handling missing data in a repeated measures ANOVA is an important consideration to ensure accurate and unbiased results. There are several methods to handle missing data, each with its own potential consequences. Here are some common approaches:

Complete Case Analysis (Listwise Deletion):

Handling: Exclude any participant with missing data from the analysis.
Consequences: Reduces the sample size, potentially leading to biased results, loss of statistical power, and decreased generalizability. The assumption is that missing data are missing completely at random (MCAR).
Mean Imputation:

Handling: Replace missing values with the mean of the observed values for that variable.
Consequences: Preserves the sample size but may introduce bias if data are not missing completely at random. It assumes that missing values have the same mean as observed values.
Last Observation Carried Forward (LOCF):

Handling: Impute missing values with the last observed value for that participant.
Consequences: Assumes that the participant's last observed value is a good estimate of the missing value. This method may not be suitable for all types of data and can lead to biased results, especially if there is a trend in the data.
Linear Interpolation:

Handling: Estimate missing values by linearly interpolating between adjacent observed values.
Consequences: Assumes a linear relationship between observed values. This method may be suitable for continuous variables with a clear pattern.
Multiple Imputation:

Handling: Generate multiple datasets with imputed values, incorporating uncertainty about missing data.
Consequences: Provides more accurate estimates and standard errors, accounting for variability in imputations. However, it requires additional assumptions about the missing data mechanism and can be computationally intensive.
Model-Based Imputation:

Handling: Impute missing values based on a statistical model, such as regression imputation.
Consequences: Assumes a specific relationship between variables. Can be more accurate than simple imputation methods but requires careful model specification.
Maximum Likelihood Estimation (MLE):

Handling: Estimate model parameters using all available data, including incomplete cases.
Consequences: Provides unbiased estimates if the missing data mechanism is ignorable and can make efficient use of available data. However, it requires assumptions about the missing data mechanism.

Potential Consequences:

Bias: The chosen method may introduce bias if the missing data mechanism is not completely random.
Loss of Power: Complete case analysis reduces the sample size and statistical power.
Invalid Inferences: Choosing an inappropriate imputation method can lead to invalid inferences.
Assumption Violations: Imputation methods assume a certain distribution or relationship between variables, which may not hold in the actual data.

# 8.

Post-hoc tests are used after an Analysis of Variance (ANOVA) to identify specific group differences when the overall ANOVA indicates that there are significant differences among groups. Common post-hoc tests include:

Tukey's Honestly Significant Difference (HSD):

When to Use: Use Tukey's HSD when you have three or more groups, and you want to test all possible pairwise comparisons.
Example: In a study comparing the mean scores of four different teaching methods, an ANOVA indicates significant differences among the groups. Tukey's HSD can be used to identify which pairs of teaching methods differ significantly.
Bonferroni Correction:

When to Use: Use Bonferroni correction when you have three or more groups, and you want to control the familywise error rate by adjusting the significance level for each pairwise comparison.
Example: In a clinical trial comparing the efficacy of four treatments, an ANOVA indicates significant differences. To conduct pairwise comparisons while controlling the overall Type I error rate, Bonferroni correction can be applied.
Sidak Correction:

When to Use: Similar to Bonferroni, Sidak correction is used to control the familywise error rate but may be less conservative for larger numbers of comparisons.
Example: In a marketing study comparing the mean sales across five different promotional strategies, an ANOVA suggests significant differences. Sidak correction can be applied to conduct pairwise comparisons while controlling the overall Type I error rate.
Duncan's Multiple Range Test:

When to Use: Duncan's test is used when you have three or more groups, and you want to identify specific group differences.
Example: In an agricultural experiment comparing the yields of five different fertilizers, an ANOVA reveals significant differences. Duncan's test can help identify which pairs of fertilizers result in significantly different yields.
Games-Howell Test:

When to Use: Use the Games-Howell test when the assumption of equal variances is violated, and you have three or more groups.
Example: In a psychology study comparing the mean scores of three different therapeutic interventions, an ANOVA shows significant differences. As the variances are unequal, the Games-Howell test can be used for pairwise comparisons.
Bonferroni-Dunn Test:

When to Use: The Bonferroni-Dunn test is used when you have multiple treatments and a control group, and you want to compare each treatment with the control.
Example: In a pharmaceutical study comparing the effects of multiple drug treatments and a placebo, an ANOVA indicates significant differences. The Bonferroni-Dunn test can be applied to compare each drug treatment with the placebo.

# 9.

To conduct a one-way ANOVA in Python, you can use the scipy.stats module. Here's an example code snippet to perform a one-way ANOVA on weight loss data for three diets (A, B, and C):

In [7]:
import numpy as np
from scipy.stats import f_oneway

# Generate example weight loss data for three diets
np.random.seed(42)  # For reproducibility
data_A = np.random.normal(loc=2, scale=1, size=50)  # Mean weight loss for diet A
data_B = np.random.normal(loc=3, scale=1, size=50)  # Mean weight loss for diet B
data_C = np.random.normal(loc=2.5, scale=1, size=50)  # Mean weight loss for diet C

# Combine the data into a single array
all_data = np.concatenate([data_A, data_B, data_C])

# Create corresponding group labels
group_labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(data_A, data_B, data_C)

# Print the results
print("One-way ANOVA Results:")
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("The one-way ANOVA indicates significant differences between the mean weight loss of the three diets.")
else:
    print("There is not enough evidence to conclude significant differences between the mean weight loss of the three diets.")


One-way ANOVA Results:
F-Statistic: 21.809565795751933
p-value: 5.076768176045347e-09
The one-way ANOVA indicates significant differences between the mean weight loss of the three diets.


# 10.

In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate example data
np.random.seed(42)

# Create a DataFrame with random data
data = pd.DataFrame({
    'Time': np.random.normal(loc=20, scale=5, size=90),  # Overall mean time
    'Program': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45),
})

# Convert categorical variables to categorical data type
data['Program'] = data['Program'].astype('category')
data['Experience'] = data['Experience'].astype('category')

# Fit the two-way ANOVA model
formula = 'Time ~ Program + Experience + Program:Experience'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Interpretation of main effects and interaction effects
alpha = 0.05
print("\nInterpretation:")
if anova_table['PR(>F)']['Program'] < alpha:
    print("There is a significant main effect of Program.")
else:
    print("There is no significant main effect of Program.")

if anova_table['PR(>F)']['Experience'] < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

if anova_table['PR(>F)']['Program:Experience'] < alpha:
    print("There is a significant interaction effect between Program and Experience.")
else:
    print("There is no significant interaction effect between Program and Experience.")


                         sum_sq    df         F    PR(>F)
Program               15.717327   2.0  0.344485  0.709581
Experience             2.994142   1.0  0.131248  0.718051
Program:Experience     9.952457   2.0  0.218133  0.804472
Residual            1916.273490  84.0       NaN       NaN

Interpretation:
There is no significant main effect of Program.
There is no significant main effect of Experience.
There is no significant interaction effect between Program and Experience.


# 11.

In [9]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate example data for control group (traditional teaching) and experimental group (new teaching)
control_group = np.random.normal(loc=70, scale=10, size=50)
experimental_group = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Print the results of the t-test
print("Two-sample t-test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check for significance
alpha = 0.05
if p_value < alpha:
    print("The two-sample t-test indicates a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")

# Post-hoc test (Bonferroni correction)
if p_value < alpha:
    # Combine data for post-hoc test
    all_data = np.concatenate([control_group, experimental_group])
    
    # Create corresponding group labels
    group_labels = ['Control'] * 50 + ['Experimental'] * 50
    
    # Perform one-way ANOVA for post-hoc test
    f_statistic, anova_p_value = f_oneway(control_group, experimental_group)
    
    # Print the results of the one-way ANOVA
    print("\nOne-way ANOVA Results for Post-hoc Test:")
    print("F-statistic:", f_statistic)
    print("p-value:", anova_p_value)
    
    # Perform post-hoc Tukey HSD test
    tukey_results = pairwise_tukeyhsd(all_data, group_labels)
    
    # Print the results of the post-hoc Tukey HSD test
    print("\nPost-hoc Tukey HSD Test Results:")
    print(tukey_results)


Two-sample t-test Results:
t-statistic: -4.108723928204809
p-value: 8.261945608702611e-05
The two-sample t-test indicates a significant difference in test scores between the control and experimental groups.

One-way ANOVA Results for Post-hoc Test:
F-statistic: 16.88161231820275
p-value: 8.261945608702588e-05

Post-hoc Tukey HSD Test Results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


# 12.

In [10]:
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate example data for daily sales of three stores
store_A_sales = np.random.normal(loc=1000, scale=50, size=30)
store_B_sales = np.random.normal(loc=1100, scale=60, size=30)
store_C_sales = np.random.normal(loc=1050, scale=55, size=30)

# Combine data for one-way ANOVA
all_sales = np.concatenate([store_A_sales, store_B_sales, store_C_sales])

# Create corresponding group labels
group_labels = ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(store_A_sales, store_B_sales, store_C_sales)

# Print the results of the one-way ANOVA
print("One-way ANOVA Results:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Check for significance
alpha = 0.05
if p_value < alpha:
    print("The one-way ANOVA indicates a significant difference in daily sales between the three stores.")
else:
    print("There is no significant difference in daily sales between the three stores.")

# Post-hoc test (Tukey HSD)
if p_value < alpha:
    # Perform post-hoc Tukey HSD test
    tukey_results = pairwise_tukeyhsd(all_sales, group_labels)
    
    # Print the results of the post-hoc Tukey HSD test
    print("\nPost-hoc Tukey HSD Test Results:")
    print(tukey_results)


One-way ANOVA Results:
F-statistic: 29.199185903529578
p-value: 1.9849398798210062e-10
The one-way ANOVA indicates a significant difference in daily sales between the three stores.

Post-hoc Tukey HSD Test Results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1  group2 meandiff p-adj   lower    upper   reject
--------------------------------------------------------
Store A Store B 102.1376    0.0  70.1016 134.1736   True
Store A Store C   60.116 0.0001    28.08   92.152   True
Store B Store C -42.0216 0.0067 -74.0576  -9.9856   True
--------------------------------------------------------
