Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans:

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups to determine if at least one group mean is significantly different from the others. However, the validity of ANOVA results depends on certain assumptions. Here are the primary assumptions required for using ANOVA, along with examples of violations that could impact the validity of the results:

Assumptions of ANOVA
Independence of Observations:

Description: The observations within each group and between groups must be independent of each other.
Violation Example: If measurements are taken from the same subjects under different conditions without proper randomization, the independence assumption is violated. For example, repeated measures on the same individuals without accounting for the repeated nature.
Normality:

Description: The data within each group should be approximately normally distributed.
Violation Example: If the data is heavily skewed or has outliers, the normality assumption is violated. For example, income data in socio-economic studies often violates normality due to a few very high incomes.
Homogeneity of Variances (Homoscedasticity):

Description: The variances within each group should be approximately equal.
Violation Example: If one group's variance is significantly larger or smaller than others, this assumption is violated. For instance, comparing test scores from different schools where one school has much more variability in student performance than the others.
Impact of Violations
Violation of Independence:

Impact: The test may yield incorrect results due to correlated errors, leading to an increased risk of Type I or Type II errors. For example, ignoring the correlation in repeated measures data can underestimate the true variability, inflating the F-statistic.
Violation of Normality:

Impact: ANOVA is relatively robust to moderate violations of normality, especially with large sample sizes, due to the Central Limit Theorem. However, severe departures from normality can affect the validity of the F-test, increasing the likelihood of Type I or Type II errors. For example, with small sample sizes, non-normal data can lead to an inaccurate assessment of group differences.
Violation of Homogeneity of Variances:

Impact: If the assumption of homogeneity of variances is violated, the ANOVA F-test becomes more sensitive to differences in variances rather than differences in means. This can lead to an increased risk of Type I errors if variances are unequal. For example, when using ANOVA to compare groups with highly unequal variances, significant results might be driven by variance differences rather than mean differences.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans:
    
ANOVA (Analysis of Variance) comes in three main types, each suited for different experimental designs and research questions. These types are:

One-Way ANOVA:

Description: One-Way ANOVA is used to compare the means of three or more independent groups based on one factor or independent variable.

Situations for Use:
When you have a single categorical independent variable (factor) with three or more levels (groups) and one continuous dependent variable.

Example: Comparing the average test scores of students from three different teaching methods.

Two-Way ANOVA:

Description: Two-Way ANOVA is used to examine the effect of two independent factors on a dependent variable and can also explore the interaction effect between the two factors.

Situations for Use:
When you have two categorical independent variables (factors), each with two or more levels, and one continuous dependent variable.

Example: Studying the impact of both teaching method (three levels) and study environment (two levels) on student test scores. This setup allows for analyzing both the main effects of each factor and their interaction effect.

Repeated Measures ANOVA:

Description: Repeated Measures ANOVA is used when the same subjects are measured multiple times under different conditions or over time.

Situations for Use:
When you have a single group of subjects exposed to multiple conditions or measured at different time points.

Example: Measuring the blood pressure of patients before, during, and after taking a medication.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans:
    
The partitioning of variance in ANOVA (Analysis of Variance) is a fundamental concept that helps in understanding how the total variability in the data is broken down into components attributable to different sources. 

Partitioning of Variance in ANOVA
In ANOVA, the total variance observed in the dependent variable is partitioned into several components:

Total Sum of Squares (SST):

Description: Represents the total variability in the data. It is the sum of the squared differences between each observation and the overall mean.

Interpretation: This quantifies the overall dispersion of the data points around the grand mean.

Between-Group Sum of Squares (SSB or SSA):

Description: Represents the variability due to differences between the group means. It measures how much of the total variance can be explained by the differences between the group means.

Interpretation: This quantifies the variability due to the factor or independent variable being tested.

Within-Group Sum of Squares (SSW or SSE):

Description: Represents the variability within each group. It measures the dispersion of individual observations around their respective group means.

Interpretation: This quantifies the variability due to random error or inherent variability within the groups.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

# Example data: Three groups with different observations
data = {
    'Group': ['A']*5 + ['B']*5 + ['C']*5,
    'Value': [2, 4, 3, 5, 6, 3, 4, 5, 6, 7, 6, 7, 8, 7, 9]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Calculate overall mean
overall_mean = df['Value'].mean()

# Calculate total sum of squares (SST)
sst = np.sum((df['Value'] - overall_mean) ** 2)

# Calculate group means
group_means = df.groupby('Group')['Value'].mean()

# Calculate explained sum of squares (SSE)
sse = np.sum(df.groupby('Group').size() * (group_means - overall_mean) ** 2)

# Calculate residual sum of squares (SSR)
ssr = np.sum((df['Value'] - df['Group'].map(group_means)) ** 2)

print(f'Total Sum of Squares (SST): {sst}')
print(f'Explained Sum of Squares (SSE): {sse}')
print(f'Residual Sum of Squares (SSR): {ssr}')

# Verify the sum
print(f'SST = SSE + SSR: {sst == sse + ssr}')


Total Sum of Squares (SST): 55.733333333333334
Explained Sum of Squares (SSE): 30.533333333333342
Residual Sum of Squares (SSR): 25.2
SST = SSE + SSR: False


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [5]:
pip install pandas statsmodels

Note: you may need to restart the kernel to use updated packages.


In [12]:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: Two factors (FactorA and FactorB) with different levels and a dependent variable (Value)
data = {
    'FactorA': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2', 'A3', 'A3', 'A3'],
    'FactorB': ['B1', 'B2', 'B3', 'B1', 'B2', 'B3', 'B1', 'B2', 'B3'],
    'Value': [20, 21, 19, 22, 23, 21, 24, 25, 23]
}

# Convert to a DataFrame
df = pd.DataFrame(data)

# Define the model formula for two-way ANOVA
formula = 'Value ~ C(FactorA) + C(FactorB) + C(FactorA):C(FactorB)'

# Fit the model using OLS (Ordinary Least Squares)
model = ols(formula, data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

                           


  return np.dot(wresid, wresid) / self.df_resid


ValueError: array must not contain infs or NaNs

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Ans:

The one-way ANOVA results with an F-statistic of 5.23 and a p-value of 0.02 lead us to conclude that there are statistically significant differences between the means of the groups being compared. This indicates that at least one group mean is different from the others, but further analysis with post-hoc tests is required to pinpoint the specific differences.


In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?