Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact 
the validity of the results.


In [1]:
# ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more groups. However, to obtain valid and reliable results 
# from ANOVA, certain assumptions need to be met. Violating these assumptions can lead to misleading or incorrect conclusions. 

# The main assumptions for ANOVA are as follows:

# Independence of Observations: The observations in each group are assumed to be independent of each other. This means that the data points within 
# each group should not be influenced by or related to each other.

# Normality: The data within each group should follow a normal distribution. ANOVA is robust to mild departures from normality, especially when sample
# sizes are large. However, severe departures from normality can affect the validity of the results.

# Homogeneity of Variance: The variance of the data in each group should be approximately equal. Homogeneity of variance means that the spread of data
# points should be similar across all groups. If the variances are unequal, it can lead to a loss of power and result in Type I or Type II errors.

# Examples of Violations and Their Impact on ANOVA:

# Non-Independence: In a study involving multiple measurements from the same subject, such as repeated measures ANOVA, the independence assumption may be
# violated. For example, in a study where the same group of participants is tested at different time points, the observations within each participant may be correlated.

# Non-Normality: If the data within each group deviates significantly from a normal distribution, it may impact the accuracy of p-values and confidence intervals. 
# For instance, if the data is heavily skewed or contains extreme outliers, ANOVA results may not be valid.

# Heterogeneity of Variance: Unequal variance across groups can lead to inflated or deflated F-statistics, affecting the interpretation of significance levels.
# For instance, if one group has much larger variance than the others, the overall ANOVA test may become sensitive to that group.

Q2. What are the three types of ANOVA, and in what situations would each be used?


In [2]:
## One-Way ANOVA:
# One-Way ANOVA is used when we want to compare the means of three or more independent groups (levels) with a single categorical independent variable.
# The categorical variable should have at least three levels. It helps determine if there are any significant differences among the means of the groups.
# For example, it can be used to compare the performance of students from different schools or the effectiveness of different drug treatments.

## Two-Way ANOVA:
# Two-Way ANOVA is used when we have two categorical independent variables and want to analyze their combined effects on a continuous dependent variable. 
# It allows us to investigate main effects (the effect of each independent variable individually) as well as the interaction effect (the joint effect of both 
# independent variables). For example, in a study of the effects of a new drug treatment, we might want to investigate the main effects of dosage and gender, 
# as well as the interaction between dosage and gender.

## Repeated Measures ANOVA:
# Repeated Measures ANOVA is used when we have a single group of subjects measured at multiple time points or under different conditions.
# It allows us to analyze within-subjects effects over time or conditions. This design is commonly used in longitudinal studies or experiments where the
# same subjects are measured under various conditions. For example, in a study on the effect of cognitive training, the same group of participants might 
# be tested before training, after one week of training, and after one month of training.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [4]:
# The partitioning of variance in ANOVA refers to the decomposition of the total variation observed in the data into different components that can be attributed
# to specific sources of variation. These components include the between-group variance and the within-group variance. Understanding this concept is crucial
# because it allows researchers to determine the relative contributions of different factors to the total variation in the data. 

# The partitioning of variance is important for the following reasons:

# Identifying Significant Effects: By comparing the between-group variance with the within-group variance, ANOVA helps determine if there are any significant
# differences among the group means. If the between-group variance is much larger than the within-group variance, it suggests that there are significant differences 
# between the groups.

# Assessing the Magnitude of Effects: Understanding the partitioning of variance allows researchers to assess the magnitude of the effects of the independent
# variable(s) on the dependent variable. Larger between-group variance relative to the within-group variance indicates stronger effects.

# Design and Experimental Planning: Partitioning variance helps in experimental design and planning. Researchers can focus on factors that contribute the most 
# to the overall variation, allowing them to design more efficient experiments and allocate resources more effectively.

# Interpreting Results: Knowing the partitioning of variance helps in interpreting the results of the ANOVA analysis. Researchers can explain how much of the 
# variation is due to group differences and how much is due to individual variability.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
sum of squares (SSR) in a one-way ANOVA using Python?

In [8]:
import numpy as np
import scipy.stats as stats

group1 = [10, 12, 15, 8, 11]
group2 = [20, 18, 22, 19, 23]
group3 = [30, 35, 32, 28, 31]

all_data = np.concatenate([group1, group2, group3])
overall_mean = np.mean(all_data)
sst = np.sum((all_data - overall_mean) ** 2)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
sse = np.sum((group_means - overall_mean) ** 2) * len(group1)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)

Total Sum of Squares (SST): 1072.9333333333334
Explained Sum of Squares (SSE): 1002.1333333333334
Residual Sum of Squares (SSR): 70.79999999999995


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [13]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = {'GroupA': [10, 12, 15, 8, 11, 20, 18, 22, 19, 23],
        'GroupB': [30, 35, 32, 28, 31, 40, 38, 42, 39, 43],
        'Outcome': [60, 70, 80, 50, 65, 85, 90, 75, 95, 100]}
df = pd.DataFrame(data)
model = ols('Outcome ~ GroupA + GroupB + GroupA:GroupB', data=df).fit()
anova_table = sm.stats.anova_lm(model)
main_effect_groupA = anova_table.loc['GroupA', 'sum_sq']
main_effect_groupB = anova_table.loc['GroupB', 'sum_sq']
interaction_effect = anova_table.loc['GroupA:GroupB', 'sum_sq']

print("Main Effect for GroupA:", main_effect_groupA)
print("Main Effect for GroupB:", main_effect_groupB)
print("Interaction Effect:", interaction_effect)


Main Effect for GroupA: 1724.9452269170602
Main Effect for GroupB: 2.4781103987525217
Interaction Effect: 177.5667832633544


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. 
What can you conclude about the differences between the groups, and how would you interpret these 
results?

In [14]:
# In a one-way ANOVA, the F-statistic measures the ratio of the variance between the groups to the variance within the groups. The p-value associated with 
# the F-statistic indicates the probability of observing such an extreme F-statistic by chance alone, assuming that the null hypothesis is true 

# Given the F-statistic of 5.23 and a p-value of 0.02, we can make the following conclusions:

# Statistical Significance:
# The p-value (0.02) is less than the significance level (usually denoted by alpha) typically set at 0.05. This means that there is sufficient evidence to 
# reject the null hypothesis, suggesting that there are significant differences between at least two of the groups.

# Group Differences:
# The F-statistic of 5.23 indicates that the variation between the group means is greater than the variation within the groups. This suggests that there are 
# differences in the means of at least some of the groups. However, the F-statistic does not provide information about which specific groups differ from each other.

# Further Analysis:
# Since the one-way ANOVA test indicates significant differences among the groups, it is appropriate to perform post hoc tests (e.g., Tukey's HSD, Bonferroni,
# or Dunnett's test) to determine which specific groups differ significantly from each other. These post hoc tests will help identify the pairwise comparisons that
# contribute to the significant F-statistic.

# Practical Significance:
# While the results may be statistically significant, it is essential to consider the practical significance of the group differences. A significant difference 
# does not always imply a large or meaningful effect size. Evaluating the effect size will provide insight into the magnitude of the differences among the groups
# and their practical relevance.

# In summary, with an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are significant differences between at least some of the groups.
# However, further post hoc tests and consideration of effect sizes are necessary to determine the specific group differences and their practical implications.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential 
consequences of using different methods to handle missing data?

In [15]:
# Handling missing data in a repeated measures ANOVA is essential to obtain valid and reliable results. The approach you choose to handle missing data can 
# have significant consequences on the validity of your analysis and the interpretation of the results. Here are some common methods for handling missing data 
# in a repeated measures ANOVA and their potential consequences:

# Complete Case Analysis (Listwise Deletion):
# In this method, any participant with missing data on any variable is excluded from the analysis. While it is straightforward, it can lead to a loss of 
# valuable information and reduced statistical power, especially if the missing data are not missing completely at random (MCAR). If data are missing not at random 
# (MNAR) or missing at random (MAR), this method can introduce bias and make the analysis less representative of the entire sample.

# Mean Imputation:
# Mean imputation involves replacing missing values with the mean of the observed values for that variable. This method can artificially reduce variability and
# underestimate standard errors, leading to an inflated Type I error rate. It also does not account for uncertainty in the imputed values, which can lead to 
# biased estimates and invalid standard errors.

# Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):
# LOCF involves using the last observed value to replace missing data, while NOCB uses the next observed value. These methods can introduce bias if there is a 
# trend in the data over time or if missing values are related to treatment response. They may not be suitable for repeated measures ANOVA if there is no reasonable
# assumption that missing data remain constant between measurement points.

# Multiple Imputation:
# Multiple imputation involves creating multiple plausible imputations for each missing value, incorporating uncertainty in the imputed values.
# The analysis is performed on each imputed dataset separately, and the results are combined to obtain unbiased estimates and valid standard errors. 
# Multiple imputation can be computationally intensive but is considered a robust and statistically valid approach to handle missing data.

# Mixed-Effects Models (Longitudinal Data Analysis):
# Mixed-effects models (also known as hierarchical linear models or random-effects models) can handle missing data naturally within the framework of the analysis.
# These models use all available data, including data from participants with missing data, and account for individual-level variability, which helps to reduce the
# impact of missing data on the analysis.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide 
an example of a situation where a post-hoc test might be necessary.

In [16]:
# After obtaining a significant result from ANOVA (Analysis of Variance), post-hoc tests are used to compare specific pairs of groups to determine which groups
# differ significantly from each other. Post-hoc tests are necessary when ANOVA indicates that there are overall group differences, but it does not specify which
# specific groups are different from each other. Some common post-hoc tests include:

# Tukey's Honestly Significant Difference (HSD) Test:
# Tukey's HSD test is commonly used when the sample sizes are equal across all groups. It controls the familywise error rate, making it suitable for multiple
# pairwise comparisons. Tukey's HSD test is often used when you have a balanced design with a moderate to large number of groups.

# Bonferroni Correction:
# Bonferroni correction is a conservative method that adjusts the significance level for each individual comparison. It divides the desired alpha level by the 
# number of comparisons being made to control the familywise error rate. It is suitable for situations where the number of pairwise comparisons is relatively small.

# Scheffe's Method:
# Scheffe's method is a more liberal approach that is used when the number of comparisons is large or when the sample sizes are unequal. It offers robust control
# of the familywise error rate but may have wider confidence intervals compared to Tukey's HSD or Bonferroni correction.

# Dunn's Test (for Nonparametric Data):
# If the data do not meet the assumptions of normality or homogeneity of variance, nonparametric post-hoc tests like Dunn's test can be used. Dunn's test is 
# suitable for situations where the data are ranked or ordinal.

# Example Situation:
# Suppose you conducted a study to compare the effectiveness of four different treatment methods (A, B, C, and D) for pain relief. You collected pain intensity
# scores from participants in each treatment group. After performing one-way ANOVA, you find that there is a significant difference among the four 
# treatment groups (p < 0.05).

# Now, you want to determine which specific treatment groups differ significantly from each other. To do this, you would use a post-hoc test such as Tukey's HSD,
# Bonferroni correction, or Scheffe's method. These post-hoc tests would allow you to make pairwise comparisons between the treatment groups and identify which 
#treatments are significantly different in terms of pain relief.

# For example, you might find that Treatment A and Treatment B have significantly higher pain relief scores compared to Treatment C and Treatment D. 
# However, there may be no significant difference between Treatment A and Treatment B. The post-hoc test provides the necessary information to 
# interpret the differences between the groups and make more detailed comparisons beyond the overall ANOVA result.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python 
to determine if there are any significant differences between the mean weight loss of the three diets. 
Report the F-statistic and p-value, and interpret the results.

In [19]:
import numpy as np
import scipy.stats as stats

diet_A = [2, 3, 4, 3, 5, 4, 6, 3, 4, 5, 3, 2, 4, 5, 3, 4, 5, 4, 6, 3, 4, 5, 6, 4, 3, 5, 6, 4, 3, 2, 4, 3, 5, 6, 5, 4, 3, 2, 4, 3, 5, 4, 3, 2, 4, 3, 5, 6, 4, 3]
diet_B = [3, 4, 5, 4, 6, 5, 7, 4, 5, 6, 4, 3, 5, 6, 4, 5, 6, 5, 7, 4, 5, 6, 7, 5, 4, 6, 7, 5, 4, 3, 5, 4, 6, 7, 6, 5, 4, 3, 5, 4, 6, 5, 4, 3, 5, 4, 3, 5, 6, 5, 4]
diet_C = [4, 5, 6, 5, 7, 6, 8, 5, 6, 7, 5, 4, 6, 7, 5, 6, 7, 6, 8, 5, 6, 7, 8, 6, 5, 7, 8, 6, 5, 4, 6, 5, 7, 8, 7, 6, 5, 4, 6, 5, 7, 6, 5, 4, 6, 5, 4, 6, 7, 6, 5]

all_data = np.concatenate([diet_A, diet_B, diet_C])

groups = ['Diet A'] * len(diet_A) + ['Diet B'] * len(diet_B) + ['Diet C'] * len(diet_C)
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-Statistic:", f_statistic)
print("P-value:", p_value)

F-Statistic: 34.35319172893017
P-value: 5.382392031934751e-13


Q10. A company wants to know if there are any significant differences in the average time it takes to 
complete a task using three different software programs: Program A, Program B, and Program C. They 
randomly assign 30 employees to one of the programs and record the time it takes each employee to 
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or 
interaction effects between the software programs and employee experience level (novice vs. 
experienced). Report the F-statistics and p-values, and interpret the results.

Q11. An educational researcher is interested in whether a new teaching method improves student test 
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the 
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
two-sample t-test using Python to determine if there are any significant differences in test scores 
between the two groups. If the results are significant, follow up with a post-hoc test to determine which 
group(s) differ significantly from each other

In [27]:
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
control_group_scores = [75, 78, 80, 82, 70, 72, 77, 76, 79, 74, 73, 75, 80, 71, 75, 76, 79, 82, 78, 81, 75, 79, 77, 78, 80, 76, 74, 78, 72, 79]
experimental_group_scores = [85, 82, 84, 88, 83, 86, 87, 89, 81, 84, 85, 83, 82, 86, 87, 88, 85, 82, 86, 87, 83, 89, 84, 83, 87, 86, 85, 84, 88, 89]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

print("Two-Sample T-Test Results:")
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)
if p_value < 0.05:
    all_scores = np.concatenate([control_group_scores, experimental_group_scores])
    group_labels = ['Control Group'] * len(control_group_scores) + ['Experimental Group'] * len(experimental_group_scores)

    tukey_results = pairwise_tukeyhsd(all_scores, group_labels)

    print("\nTukey's HSD Post-Hoc Test Results:")
    print(tukey_results)


Two-Sample T-Test Results:
T-Statistic: -11.856700966043
P-Value: 3.8870232625941334e-17

Tukey's HSD Post-Hoc Test Results:
         Multiple Comparison of Means - Tukey HSD, FWER=0.05         
    group1          group2       meandiff p-adj lower   upper  reject
---------------------------------------------------------------------
Control Group Experimental Group   8.5667   0.0 7.1204 10.0129   True
---------------------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three 
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store 
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any 
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other

In [29]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

store_a_sales = [100, 120, 130, 110, 90, 100, 80, 140, 130, 120, 110, 95, 105, 125, 115, 135, 115, 125, 105, 100, 110, 115, 105, 120, 110, 90, 100, 105, 115, 120]
store_b_sales = [85, 95, 105, 90, 75, 85, 65, 110, 100, 95, 90, 80, 90, 105, 100, 115, 100, 105, 90, 85, 90, 100, 90, 105, 95, 75, 85, 90, 100, 105]
store_c_sales = [120, 140, 150, 130, 110, 120, 100, 160, 150, 140, 130, 115, 125, 145, 135, 155, 135, 145, 125, 120, 130, 135, 125, 140, 130, 110, 120, 125, 135, 140]

data = pd.DataFrame({'Sales': store_a_sales + store_b_sales + store_c_sales,
                     'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30})

model = ols('Sales ~ C(Store)', data=data).fit()
anova_table = sm.stats.anova_lm(model)
F_statistic = anova_table.loc['C(Store)', 'F']
p_value = anova_table.loc['C(Store)', 'PR(>F)']

print("One-Way ANOVA Results:")
print("F-Statistic:", F_statistic)
print("P-Value:", p_value)
if p_value < 0.05:
    tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])

    print("\nTukey's HSD Post-Hoc Test Results:")
    print(tukey_results)


One-Way ANOVA Results:
F-Statistic: 63.04010695187157
P-Value: 1.1952578128754204e-17

Tukey's HSD Post-Hoc Test Results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj  lower    upper  reject
------------------------------------------------------
Store A Store B    -18.0   0.0 -26.0734 -9.9266   True
Store A Store C     20.0   0.0  11.9266 28.0734   True
Store B Store C     38.0   0.0  29.9266 46.0734   True
------------------------------------------------------
