In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
Assumptions of ANOVA:

Independence of Observations: The samples must be independent of each other.
Violation Example: Data collected from the same subjects over multiple time points without accounting for the repeated measures.
Normality: The data in each group should be approximately normally distributed.
Violation Example: Skewed data or data with heavy tails can lead to incorrect conclusions.
Homogeneity of Variances: The variance among the groups should be approximately equal (homoscedasticity).
Violation Example: If one group has a much larger variance than the others, this can distort the F-statistic.
Violations of these assumptions can lead to increased Type I or Type II errors, making the results of the ANOVA unreliable.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?
One-Way ANOVA: Used when comparing the means of three or more independent groups based on one factor.
Situation: Comparing the average test scores of students from different schools.
Two-Way ANOVA: Used when comparing the means of groups based on two factors.
Situation: Assessing the effect of two different fertilizers on crop yield, considering both the type of fertilizer and the type of crop.
Repeated Measures ANOVA: Used when the same subjects are measured multiple times under different conditions.
Situation: Measuring the blood pressure of patients at multiple time points after administering a drug.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Partitioning of Variance:

Total Sum of Squares (SST): The total variation in the data.
Between-Group Sum of Squares (SSB): Variation due to differences between the group means.
Within-Group Sum of Squares (SSW): Variation due to differences within each group (residual variation).
Importance:
Understanding how the total variance is partitioned helps in determining whether the group means are significantly different from each other. It provides a basis for calculating the F-statistic and p-value, which are used to test the null hypothesis in ANOVA.

In [None]:
4.
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# Example data
data = {
    'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Value': [10, 12, 14, 20, 22, 24, 30, 32, 34]
}
df = pd.DataFrame(data)

# Overall mean
overall_mean = df['Value'].mean()

# Group means
group_means = df.groupby('Group')['Value'].mean()

# Total Sum of Squares (SST)
sst = sum((df['Value'] - overall_mean) ** 2)

# Explained Sum of Squares (SSB)
ssb = sum(df.groupby('Group').size() * (group_means - overall_mean) ** 2)

# Residual Sum of Squares (SSR)
ssr = sum((df.set_index('Group')['Value'] - group_means) ** 2)

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSB):", ssb)
print("Residual Sum of Squares (SSR):", ssr)


In [None]:
5.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'Factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Factor2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
    'Value': [10, 12, 14, 20, 22, 24, 30, 32, 34]
}
df = pd.DataFrame(data)

# Fit the model
model = ols('Value ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)', data=df).fit()

# ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
Conclusion:
Since the p-value (0.02) is less than the significance level (typically 0.05), we reject the null hypothesis. This suggests that there are significant differences between the group means.

Interpretation:
The observed differences in the group means are unlikely to have occurred by random chance alone. Further post-hoc tests may be required to identify which specific groups differ from each other.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
Handling Missing Data:

Listwise Deletion: Remove any cases with missing values.
Consequence: Reduces sample size, which can decrease statistical power.
Mean Imputation: Replace missing values with the mean of the available data.
Consequence: Underestimates variability and can bias results.
Last Observation Carried Forward (LOCF): Use the last observed value to fill in missing data.
Consequence: Assumes no change over time, which may not be valid.
Multiple Imputation: Use statistical models to estimate missing values multiple times and combine results.
Consequence: Provides a more accurate estimation, but is complex to implement.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
Common Post-Hoc Tests:

Tukey's Honestly Significant Difference (HSD): Used to compare all possible pairs of group means.
Situation: After finding a significant F-statistic in a one-way ANOVA with more than two groups.
Bonferroni Correction: Adjusts the significance level for multiple comparisons to control the Type I error rate.
Situation: When performing multiple pairwise tests and needing to control for family-wise error rate.
Scheffé's Test: More conservative than Tukey's HSD, used for complex comparisons.
Situation: When making multiple and possibly unequal comparisons between group means.
Example Situation: After conducting a one-way ANOVA on the mean test scores of students from different schools, and finding a significant F-statistic, use Tukey's HSD to determine which specific schools' mean scores differ from each other.

In [None]:
9.
import pandas as pd
import numpy as np
import scipy.stats as stats

# Example data
data = {
    'Diet': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
             'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
             'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Weight_Loss': [3, 4, 5, 2, 3, 4, 5, 6, 3, 4,
                    2, 3, 1, 2, 3, 2, 1, 3, 2, 1,
                    5, 6, 7, 5, 6, 5, 6, 7, 6, 5]
}
df = pd.DataFrame(data)

# Perform one-way ANOVA
f_stat, p_val = stats.f_oneway(df[df['Diet'] == 'A']['Weight_Loss'],
                               df[df['Diet'] == 'B']['Weight_Loss'],
                               df[df['Diet'] == 'C']['Weight_Loss'])

print("F-statistic:", f_stat)
print("P-value:", p_val)

# Interpretation
if p_val < 0.05:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences between the mean weight loss of the three diets.")


In [None]:
10.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'Program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C',
                'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Experience': ['Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
                   'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
                   'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
                   'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice'],
    'Time': [45, 40, 47, 35, 50, 42, 60, 55, 58, 48, 38, 46, 34, 48, 40, 62, 57, 59,
             44, 39, 45, 36, 47, 43, 59, 54, 57, 46, 41, 47, 33, 49, 41, 61, 56, 58]
}
df = pd.DataFrame(data)

# Fit the model
model = ols('Time ~ C(Program) * C(Experience)', data=df).fit()

# ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


In [None]:
11.
import numpy as np
import scipy.stats as stats

# Example data
control_group = np.random.normal(70, 10, 50)
experimental_group = np.random.normal(75, 10, 50)

# Perform two-sample t-test
t_stat, p_val = stats.ttest_ind(control_group, experimental_group)

print("T-statistic:", t_stat)
print("P-value:", p_val)

# Interpretation
if p_val < 0.05:
    print("Reject the null hypothesis. There are significant differences in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences in test scores between the control and experimental groups.")


In [None]:
12.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Example data
data = {
    'Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 10,
    'Day': list(range(1, 11)) * 3,
    'Sales': [200, 210, 215, 220, 225, 230, 250, 255, 260, 205, 215, 220, 225, 235, 240, 255, 260, 265, 210, 220, 225, 230, 240, 245, 260, 265, 270]
}
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
print(res)

# Interpretation
if res.anova_table['Pr > F'][0] < 0.05:
    print("Reject the null hypothesis. There are significant differences in sales between the three stores.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences in sales between the three stores.")
