In [1]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.

In [2]:
# Analysis of Variance (ANOVA) is a statistical method used to compare means across multiple groups.
# The validity of ANOVA results relies on certain assumptions. Here are the key assumptions and examples
# of violations that could impact the validity of ANOVA results:

# ### Assumptions of ANOVA:

# 1. **Normality:** The data within each group should be approximately normally distributed.
   
# 2. **Homogeneity of Variances (Homoscedasticity):** The variances of the groups should be roughly equal.

# 3. **Independence:** Observations within each group should be independent of each other.

# ### Examples of Violations:

# 1. **Non-Normality:**
#    - **Issue:** If the assumption of normality is violated, it may affect the accuracy of p-values and confidence intervals.
#    - **Example:** A group with skewed or heavily-tailed data.

# 2. **Heteroscedasticity:**
#    - **Issue:** Unequal variances can lead to inaccurate F-statistics and affect the validity of ANOVA results.
#    - **Example:** One group has much larger variability than the others.

# 3. **Independence:**
#    - **Issue:** Violation of independence can lead to biased estimates and inaccurate standard errors.
#    - **Example:** Data points within a group are correlated or dependent.

# ### Mitigating Strategies:

# 1. **Transformations:** If normality is violated, transformations (e.g., log transformation) may help stabilize variances.

# 2. **Use of Robust Tests:** For cases of unequal variances, robust ANOVA tests (e.g., Welch's ANOVA) can be considered.

# 3. **Randomized Experimental Design:** Ensure that the experimental design involves random assignment and independent observations.

# 4. **Nonparametric Alternatives:** If assumptions cannot be met, nonparametric alternatives like the Kruskal-Wallis test can be considered.

# ### Important Note:
# Violations of assumptions may not always lead to catastrophic consequences, and ANOVA is known for being robust in some situations.
#                                                                                 However, it's crucial to assess the degree of violation 
#                                                                                 and consider alternative methods if needed.

# Always examine the residuals, consider data transformations, and explore robust methods when assumptions are in doubt.

# It's recommended to tailor the approach based on the specific characteristics of the data and the context of the analysis.

In [3]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?

In [4]:
# Analysis of Variance (ANOVA) can be categorized into three main types based on the experimental design and the number of factors involved:

# ### 1. One-Way ANOVA:
#    - **Use Case:**
#      - **Situation:** When comparing means across two or more independent groups (levels) for a single factor (independent variable).
#      - **Example:** Examining the impact of different teaching methods on student test scores, where students are divided into several groups,
#     each taught by a different method.

# ### 2. Two-Way ANOVA:
#    - **Use Case:**
#      - **Situation:** When comparing means across two independent factors (independent variables), considering the main effects of each factor
#         and their interaction.
#      - **Example:** Investigating the influence of both gender and educational background on exam performance. This involves two independent 
#     variables: gender (male/female) and educational background (science/humanities).

# ### 3. Three-Way (and Higher) ANOVA:
#    - **Use Case:**
#      - **Situation:** When comparing means across three or more independent factors (independent variables), considering the main effects of 
#         each factor and their interactions.
#      - **Example:** Analyzing the impact of factors like temperature, humidity, and light on the growth of plants. Each of these factors has
#     multiple levels, leading to a three-way or higher ANOVA.

# ### Key Points:
# - **Main Effects:** In ANOVA, "main effects" refer to the influence of each individual factor on the dependent variable.
# - **Interactions:** ANOVA allows the assessment of interactions between factors, revealing whether the combined effect of factors is different 
# from the sum of their individual effects.

# ### Considerations:
# - **Post hoc Tests:** Following ANOVA, post hoc tests (e.g., Tukey's HSD) may be employed to identify specific group differences.
# - **Assumptions:** ANOVA assumes normality, homogeneity of variances, and independence of observations.

# ### Summary:
# - **One-Way ANOVA:** Compares means across multiple independent groups for a single factor.
# - **Two-Way ANOVA:** Compares means across two independent factors, considering their main effects and interaction.
# - **Three-Way (and Higher) ANOVA:** Extends the analysis to three or more independent factors.

# The choice of ANOVA type depends on the experimental design and the number of factors under investigation.

In [5]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [6]:
# The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variability
# observed in the dependent variable into different components or sources of variation. Understanding this concept is crucial
# for gaining insights into the factors that contribute to the variability in the data and for assessing the significance of
# these factors in explaining the observed differences in means.

# In ANOVA, the total variance observed in the data is divided into three main components:

# 1. **Between-Group Variance (SSB or SSBetween):**
#    - Represents the variability among the group means.
#    - Measures the extent to which the group means differ from each other.
#    - If this component is large relative to the total variance, it suggests that the groups have different means.

# 2. **Within-Group Variance (SSW or SSWithin):**
#    - Represents the variability within each group.
#    - Measures the spread of individual scores around their respective group means.
#    - If this component is small, it indicates that observations within groups are relatively consistent.

# 3. **Total Variance (SST or SSTotal):**
#    - Represents the overall variability in the entire dataset.
#    - The sum of between-group and within-group variances: SST = SSB + SSW.

# Understanding the partitioning of variance is essential for several reasons:

# - **Hypothesis Testing:** ANOVA tests whether the means of the groups are significantly different. The partitioning helps 
# in assessing the contribution of between-group variability relative to within-group variability.

# - **Effect Size:** By examining the proportion of total variance attributed to between-group variance, researchers can determine
# the practical significance or effect size of the observed differences.

# - **Model Evaluation:** Understanding how the total variance is divided allows researchers to evaluate the effectiveness of the 
# model in explaining the variability in the data.

# - **Identifying Sources of Variation:** Researchers can identify which factors or independent variables contribute significantly 
# to the observed differences in means.

# - **Assumptions:** Understanding variance components is linked to ANOVA assumptions, such as homogeneity of variances and normality, 
# as they relate to the reliability of the ANOVA results.

# In summary, the partitioning of variance in ANOVA provides a structured way to analyze and interpret sources of variability in the 
# data, aiding researchers in drawing meaningful conclusions about the effects of different factors on the dependent variable.

In [7]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?

In [8]:
import pandas as pd
import numpy as np
from scipy import stats

# Example dataset (replace this with your dataset)
data = {'Group1': [4, 5, 6, 7, 8],
        'Group2': [9, 10, 11, 12, 13],
        'Group3': [14, 15, 16, 17, 18]}

df = pd.DataFrame(data)

# Flatten the data for ANOVA
flat_data = df.values.flatten()

# Calculate overall mean
overall_mean = np.mean(flat_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((flat_data - overall_mean)**2)

# Calculate Explained Sum of Squares (SSE)
group_means = df.mean()
sse = np.sum((group_means - overall_mean)**2 * len(df.columns))

# Calculate Residual Sum of Squares (SSR)
ssr = np.sum((df - group_means)**2).values.flatten().sum()

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 280.0
Explained Sum of Squares (SSE): 150.0
Residual Sum of Squares (SSR): 30.0


In [9]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [23]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset (replace this with your dataset)
data = {'A': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3'],
        'B': ['B1', 'B2', 'B1', 'B2', 'B1', 'B2'],
        'Value': [10, 12, 15, 18, 8, 10]}

df = pd.DataFrame(data)

# Fit a two-way ANOVA model
formula = 'Value ~ A + B + A:B'
model = ols(formula, data=df).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Extract main effects and interaction effect
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A', 'df']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['B', 'df']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'df']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


In [24]:
# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# What can you conclude about the differences between the groups, and how would you interpret these
# results?

In [25]:
# In a one-way ANOVA, the F-statistic tests the null hypothesis that the means of the groups are equal. 
# The obtained F-statistic and p-value provide information on whether there are significant differences between the group means.

# Here's how to interpret the results:

# 1. **F-Statistic:**
#    - The F-statistic measures the ratio of variance between the groups to variance within the groups. 
#     A larger F-statistic suggests greater differences between the group means.

# 2. **P-Value:**
#    - The p-value associated with the F-statistic indicates the probability of obtaining the observed results
#     (or more extreme) if the null hypothesis (equal group means) is true.
#    - In this case, the p-value is 0.02, which is less than the common significance level of 0.05.

# **Interpretation:**
#    - Since the p-value (0.02) is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis.
#    - This suggests that there are significant differences between at least two group means.

# **Conclusion:**
#    - There is evidence to suggest that the means of the groups are not all equal. However, the ANOVA test itself does 
#     not tell us which specific groups differ from each other.
#    - If the ANOVA is followed by post-hoc tests (e.g., Tukey's HSD, Bonferroni), you can identify which groups have significantly different means.

# In summary, the results of the one-way ANOVA suggest that there are statistically significant differences between the groups. 
#                                                  Further post-hoc analysis would be needed to identify the specific groups that differ.
                                                 

In [26]:
# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# consequences of using different methods to handle missing data?

In [27]:
# Handling missing data in a repeated measures ANOVA requires careful consideration, and different methods can have various
# consequences. Here are common approaches and their potential consequences:

# 1. **Complete Case Analysis (CCA):**
#    - **Method:** Exclude cases with missing data from the analysis.
#    - **Consequences:**
#      - Reduces sample size, potentially leading to loss of statistical power.
#      - May introduce bias if missing data are not completely at random (i.e., if there is a systematic pattern to the missingness).

# 2. **Mean Imputation:**
#    - **Method:** Replace missing values with the mean of the observed values for that variable.
#    - **Consequences:**
#      - Preserves sample size but may underestimate the variability in the data.
#      - Assumes that missing values are missing completely at random and assumes that imputed values are typical.

# 3. **Last Observation Carried Forward (LOCF):**
#    - **Method:** Use the last available observation for the missing value.
#    - **Consequences:**
#      - Assumes that the last observation is a good estimate of the missing value, which may not be valid in dynamic or changing conditions.
#      - Can introduce bias, especially if missing values occur systematically over time.

# 4. **Multiple Imputation:**
#    - **Method:** Generate multiple sets of imputed values, creating a distribution of possible values.
#    - **Consequences:**
#      - Preserves sample size and accounts for uncertainty in imputed values.
#      - Requires advanced statistical techniques and assumes that missing data are missing at random.

# 5. **Interpolation or Extrapolation:**
#    - **Method:** Estimate missing values based on observed trends or patterns.
#    - **Consequences:**
#      - Can be useful if missing values follow a clear pattern, but may be unreliable if patterns are complex or subject to change.

# **Considerations:**
# - Always assess the nature of missing data. If missingness is related to the outcome or other variables, it could introduce bias.
# - Choose the method based on the assumptions it makes and the nature of the data.
# - Sensitivity analyses can be performed to assess the impact of different methods on results.

# In summary, there is no one-size-fits-all solution for handling missing data in repeated measures ANOVA. The choice of method should 
# be guided by the characteristics of the data and the assumptions of each approach. Multiple imputation is often considered a robust 
# method when missing data are not completely at random.

In [28]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.

In [29]:
# After conducting an Analysis of Variance (ANOVA) and finding a significant difference among group means, post-hoc tests are often 
# employed to identify specific pairs of groups that differ from each other. Some common post-hoc tests include:

# 1. **Tukey's Honestly Significant Difference (HSD):**
#    - **Use:** Used when there are three or more groups. It controls the overall Type I error rate.
#    - **Example:** After conducting an ANOVA comparing the mean scores of multiple teaching methods, Tukey's HSD can identify which
# specific pairs of methods have significantly different means.

# 2. **Bonferroni Correction:**
#    - **Use:** Suitable when making multiple comparisons to control the familywise error rate.
#    - **Example:** In genetics, comparing multiple genes for differential expression, the Bonferroni correction adjusts the significance
# level for each individual test to maintain an overall alpha level.

# 3. **Scheffe's Test:**
#    - **Use:** A conservative post-hoc test that can be used with unequal sample sizes and different group variances.
#    - **Example:** In a study comparing the effectiveness of various treatments on patients with different medical conditions, Scheffe's 
# test can identify specific pairs of treatments with significantly different effects.

# 4. **Dunnett's Test:**
#    - **Use:** Specifically designed for comparing each treatment group to a control group in the presence of a single control.
#    - **Example:** In a drug trial with multiple experimental groups and a control group, Dunnett's test can be used to identify which 
# experimental groups show significant differences from the control.

# 5. **Holm's Method:**
#    - **Use:** A step-down procedure that controls the familywise error rate like Bonferroni but can be less conservative.
#    - **Example:** In a study comparing the performance of different software algorithms, Holm's method can be used to identify pairs of 
# algorithms with significantly different processing times.

# **Example Scenario:**
# Suppose a study investigates the effect of different types of exercise on cardiovascular fitness. After conducting an ANOVA and finding
# a significant difference among the means of at least three exercise types, a post-hoc test 
# (such as Tukey's HSD) would be necessary to identify which specific pairs of exercise types have significantly different effects on 
#  cardiovascular fitness.

# Post-hoc tests are crucial in situations with multiple groups to pinpoint where the differences lie, ensuring a more detailed understanding
#                                                                                             of the relationships between groups.

In [30]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.

In [31]:
import numpy as np
from scipy.stats import f_oneway

# Simulated weight loss data for three diets (replace with actual data)
diet_A = np.random.normal(loc=3, scale=1, size=50)
diet_B = np.random.normal(loc=4, scale=1, size=50)
diet_C = np.random.normal(loc=2.5, scale=1, size=50)

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret results
if p_value < 0.05:
    print("There is a significant difference in mean weight loss among the three diets.")
else:
    print("There is no significant difference in mean weight loss among the three diets.")


F-statistic: 48.59084087352546
p-value: 6.322847192092407e-17
There is a significant difference in mean weight loss among the three diets.


In [32]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.

In [33]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Simulated data (replace with actual data)
np.random.seed(42)  # for reproducibility
data = pd.DataFrame({
    'Program': np.random.choice(['A', 'B', 'C'], size=90),
    'Experience': np.random.choice(['Novice', 'Experienced'], size=90),
    'Time': np.random.normal(loc=10, scale=2, size=90),
})

# Perform two-way ANOVA
formula = 'Time ~ Program + Experience + Program:Experience'
model = ols(formula, data).fit()
anova_results = anova_lm(model)

# Print F-statistics and p-values
print(anova_results)

# Interpret results
alpha = 0.05
if any(anova_results['PR(>F)'] < alpha):
    print("There is a significant main effect or interaction effect.")
else:
    print("There is no significant main effect or interaction effect.")


                      df      sum_sq   mean_sq         F    PR(>F)
Program              2.0    1.489533  0.744766  0.216246  0.805984
Experience           1.0    5.096305  5.096305  1.479736  0.227223
Program:Experience   2.0    8.396750  4.198375  1.219018  0.300694
Residual            84.0  289.301266  3.444063       NaN       NaN
There is no significant main effect or interaction effect.


In [34]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

In [35]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulated data (replace with actual data)
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)
experimental_group = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if the results are significant
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")

# Perform post-hoc test (Tukey's HSD)
data = pd.DataFrame({
    'Score': np.concatenate([control_group, experimental_group]),
    'Group': ['Control'] * 50 + ['Experimental'] * 50,
})

posthoc_results = pairwise_tukeyhsd(data['Score'], data['Group'])
print("\nPost-hoc test (Tukey's HSD) results:")
print(posthoc_results)


Two-sample t-test results:
t-statistic: -4.108723928204809
p-value: 8.261945608702611e-05
There is a significant difference in test scores between the two groups.

Post-hoc test (Tukey's HSD) results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


In [36]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.

In [40]:
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM

# Simulated data (replace with your actual data)
np.random.seed(42)
days = np.tile(np.arange(1, 31), 3)
store_labels = np.repeat(['Store A', 'Store B', 'Store C'], 30)
sales = np.random.normal(loc=[50, 60, 55], scale=10, size=90)

# Create a DataFrame
data = pd.DataFrame({'Day': days, 'Store': store_labels, 'Sales': sales})

# Convert 'Day' and 'Store' to categorical variables
data['Day'] = pd.Categorical(data['Day'])
data['Store'] = pd.Categorical(data['Store'])

# Perform repeated measures ANOVA
rm_anova = AnovaRM(data, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()

print("Repeated Measures ANOVA results:")
print(rm_results)
