Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.


In [None]:
"""
Assumptions of ANOVA:

Independence: The observations in each group should be independent of each other. This assumption is 
              violated when there is dependence or correlation among the observations within groups.

Normality: The residuals (the differences between observed values and the group means) should follow a 
           normal distribution within each group. This assumption is violated when the residuals 
           deviate significantly from a normal distribution, leading to skewed or heavy-tailed distributions.

Homoscedasticity: The variability of the residuals should be relatively constant across all groups. Violation 
                  of this assumption, known as heteroscedasticity, occurs when the variability is significantly 
                  different among groups.





Examples of Violations:

Independence Violation: In a study comparing the academic performance of students from different schools, if students 
                        from the same school are more likely to have similar scores due to shared teaching methods, 
                        the independence assumption is violated.

Normality Violation: In a research comparing the effects of a new drug on different age groups, if the residuals exhibit
                     a skewed distribution within each age group, the normality assumption is violated.

Homoscedasticity Violation: Consider a study investigating the impact of different fertilizers on crop yield. If the 
                            variability of crop yields varies significantly among the different fertilizer groups, 
                            the assumption of homogeneity of variance is violated.
"""                            

Q2. What are the three types of ANOVA, and in what situations would each be used?


In [None]:
"""
There are three main types of ANOVA:

1-One-Way ANOVA: This type of ANOVA is used when you have one categorical independent variable (factor) 
                 and one continuous dependent variable. It is used to determine whether there are statistically 
                 significant differences in the means of the dependent variable across different levels of the 
                 categorical variable.

Example: A one-way ANOVA could be used to compare the average test scores of students from three different schools 
         to see if there are significant differences in performance.
         
         
         

2-Two-Way ANOVA: This type of ANOVA involves two independent variables (factors) and one continuous dependent variable. 
                 It assesses the interactions between the two factors and their individual effects on the dependent variable.

Example: A two-way ANOVA could be used to examine the effects of both gender and different teaching methods on student test 
         scores to see if there are any interactions between these factors.




3-Repeated Measures ANOVA: This type of ANOVA is used when the same subjects are measured under different conditions or at 
                           multiple time points. It assesses whether there are significant differences in the means of the 
                           dependent variable across these conditions or time points.

Example: A repeated measures ANOVA could be used to analyze changes in participants' anxiety levels before and after different 
         therapeutic interventions to see if there are statistically significant changes.
"""         

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


In [None]:
"""
The partitioning of variance in ANOVA involves breaking down total data variability into parts attributed to different factors.
It's vital to understand because it helps us identify whether group differences are significant or due to chance, aiding in 
drawing accurate conclusions about the effects of variables or treatments being studied.
"""

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?


In [1]:
import numpy as np
from scipy import stats

# Sample data for each group
group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([20, 22, 24, 26, 28])
group3 = np.array([30, 32, 34, 36, 38])


# Calculation of mean of each sample
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Calculation of Grand Average of mean
x_bar=np.mean([group1_mean,group2_mean,group3_mean])

# Calculation of SSR
ssr=(((group1_mean-x_bar)**2)*len(group1)) + (((group2_mean-x_bar)**2)*len(group2)) + (((group3_mean-x_bar)**2)*len(group3))

# Calculation of SSE
sse=sum((group1-group1_mean)**2 + (group2-group2_mean)**2 + (group3-group3_mean)**2)

# Calcuation of SST
sst=ssr+sse


# Calculate the Ratio of F-test
df1=2   # c-1 :c-> no. of columns
df2=12  # n-c :n-> no. of data points

msc=ssr/df1
mse=sse/df2

f_test=msc/mse


p_value = 1 - stats.f.cdf(f_test,df1,df2)

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSR):", ssr)
print("Residual Sum of Squares (SSE):", sse)
print("F-Statistic:", f_test)
print("P-Value:", p_value)

Total Sum of Squares (SST): 1120.0
Explained Sum of Squares (SSR): 1000.0
Residual Sum of Squares (SSE): 120.0
F-Statistic: 50.0
P-Value: 1.5127924217761546e-06


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [19]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = {'Factor1': [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'Factor2': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'Dependent': [10, 12, 15, 20, 22, 25, 30, 32, 35]}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
formula = 'Dependent ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


                        df        sum_sq       mean_sq    F  PR(>F)
C(Factor1)             2.0  6.000000e+02  3.000000e+02  0.0     NaN
C(Factor2)             2.0  3.800000e+01  1.900000e+01  0.0     NaN
C(Factor1):C(Factor2)  4.0  1.656608e-29  4.141520e-30  0.0     NaN
Residual               0.0  1.798603e-28           inf  NaN     NaN


  (model.ssr / model.df_resid))


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?


In [None]:
"""
One-way ANOVA, the F-statistic is used to test the null hypothesis that all group means are equal against the alternative 
hypothesis that at least one group mean is different. The p-value associated with the F-statistic tells you the probability 
of observing the obtained F-statistic (or a more extreme value) if the null hypothesis were true.


Given that you obtained an F-statistic of 5.23 and a p-value of 0.02, here's how you can interpret these results:


1-Interpretation of the F-Statistic:
The F-statistic value of 5.23 indicates the ratio of variability between the group means to the variability within the groups.
A higher F-statistic suggests that there might be significant differences between the group means. However, the interpretation 
of the F-statistic alone is not sufficient to make a definitive conclusion.


2-Interpretation of the P-Value:
The p-value of 0.02 is the probability of observing an F-statistic as extreme as 5.23 (or more extreme) under the assumption that 
the group means are equal. A p-value of 0.02 indicates that there is a 2% chance of observing such a large F-statistic if the null
hypothesis (all group means are equal) were true.

3-Decision:
Typically, in hypothesis testing, you compare the p-value to a predetermined significance level (alpha). If the p-value is less than 
alpha, you reject the null hypothesis. Common significance levels are 0.05 or 0.01. In your case, with a p-value of 0.02, if you're
using a significance level of 0.05, you would reject the null hypothesis.

4-Conclusion:
With a p-value of 0.02 and the decision to reject the null hypothesis, you can conclude that there are statistically significant 
differences between at least some of the groups. In other words, there is evidence to suggest that at least one group mean is different 
from the others.
"""

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


In [None]:
"""
In a repeated measures ANOVA, you can handle missing data using methods like:

1-Listwise Deletion: Removing cases with any missing data. Can lead to loss of information and bias.

2-Mean Imputation: Replacing missing values with variable means. Underestimates variability and distorts relationships.

3-Last Observation Carried Forward (LOCF): Using the last observed value for imputing missing data. Assumes stability over time.

4-Linear Interpolation: Estimating missing values based on observed trends. Assumes linearity and may not fit all data.

5-Multiple Imputation: Generating multiple plausible values to account for uncertainty. Accurate but computationally intensive.

6-Mixed Models: Incorporating all available data with random effects. Handles missing data while considering within-subject correlation.




Potential consequences of using different methods:

1-Bias: Listwise deletion and mean imputation can introduce bias if missingness is related to studied variables.

2-Underestimation of Variability: Mean imputation and LOCF underestimate variability by ignoring uncertainty.

3-Distorted Relationships: Imputation methods can distort variable relationships, leading to incorrect conclusions.

4-Efficiency and Precision: Methods like multiple imputation and mixed models are more efficient and precise, but can be complex.
"""

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.


In [None]:
"""
Common post-hoc tests after ANOVA and their uses:

1-Tukey's HSD: Compares all possible pairs of group means; useful when you want to examine all pairwise differences simultaneously.
Situation: Comparing the effects of different fertilizer treatments on crop yield.

2-Bonferroni Correction: Controls familywise error rate when making multiple comparisons.
Situation: Investigating differences in reaction times among various experimental conditions.

3-Dunnett's Test: Compares treatment groups against a control group.
Situation: Analyzing the impact of various drugs compared to a placebo in a clinical trial.

4-Scheffe's Test: Provides broader comparisons while controlling familywise error rate.
Situation: Exploring the impact of different marketing strategies across multiple regions.

5-Holm's Method: Sequentially adjusts p-values to control familywise error rate.
Situation: Comparing sales performance of different products after a marketing campaign.

6-Games-Howell Test: Accounts for unequal group variances.
Situation: Evaluating the effectiveness of different teaching methods across classrooms with varying levels of student variability.

7-Fisher's LSD: Assumes equal sample sizes and variances.
Situation: Studying the effect of temperature on enzyme activity across different enzyme types.
"""

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


In [37]:
from scipy import stats
import numpy as np

diet_a=np.random.randint(1,6,50)
diet_b=np.random.randint(1,6,50)
diet_c=np.random.randint(1,6,50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-Statistic:", f_statistic)  # 0.20769587389893374
print("P-Value:", p_value)          # 0.8126920913022598

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")

F-Statistic: 0.20769587389893374
P-Value: 0.8126920913022598
There is no significant difference between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.


In [4]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Set a random seed for reproducibility
np.random.seed(42)

# Number of employees per combination
sample_size = 30

# Software programs and experience levels
software_programs = ["A", "B", "C"]
experience_levels = ["Novice", "Experienced"]

# Generate random completion time data
data = {
    "Software": np.random.choice(software_programs, size=sample_size*len(experience_levels)),
    "Experience": np.repeat(experience_levels, sample_size),
    "CompletionTime": np.random.normal(loc=20, scale=5, size=sample_size*len(experience_levels))
}

df = pd.DataFrame(data)

# Perform two-way ANOVA
formula = "CompletionTime ~ C(Software) + C(Experience) + C(Software):C(Experience)"
model = ols(formula, df).fit()
anova_results = anova_lm(model)

print("Two-way ANOVA results:")
print(anova_results)

Two-way ANOVA results:
                             df       sum_sq    mean_sq         F    PR(>F)
C(Software)                 2.0    14.915179   7.457590  0.284149  0.753773
C(Experience)               1.0     5.327206   5.327206  0.202977  0.654132
C(Software):C(Experience)   2.0    48.659798  24.329899  0.927018  0.401937
Residual                   54.0  1417.248092  26.245335       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.


In [2]:
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import MultiComparison

# Test scores for the control and experimental groups
control_scores = np.random.randint(70,100,100)  # Replace with actual data
experimental_scores = np.array([75,100,100])  # Replace with actual data

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(control_scores, experimental_scores)

print("Two-sample t-test results:")
print("T-statistic:", t_stat)
print("P-value:", p_value)

# Check if the p-value is significant (e.g., p < 0.05)
if p_value < 0.05:
    print("There is a significant difference between the groups.")

    # Perform a post-hoc test (Tukey's HSD) for multiple comparisons
    data = np.concatenate([control_scores, experimental_scores])
    group_labels = ['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores)
    mc = MultiComparison(data, group_labels)
    result = mc.tukeyhsd()

    print("\nPost-hoc (Tukey's HSD) test results:")
    print(result)

else:
    print("There is no significant difference between the groups.")


Two-sample t-test results:
T-statistic: -1.5145993158288127
P-value: 0.13299720382873373
There is no significant difference between the groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences
in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which 
store(s) differ significantly from each other.

In [17]:
import pandas as pd
import pingouin as pg
from pingouin.pairwise import pairwise_tukey

# Assuming you have your data in the sales_df DataFrame

# Melt the data to long format for repeated measures ANOVA
long_df = pd.melt(sales_df, id_vars=['Day'], value_vars=['Store_A', 'Store_B', 'Store_C'],
                  var_name='Store', value_name='Sales')

# Repeated Measures ANOVA
rm_anova = pg.rm_anova(data=long_df, dv='Sales', within='Store', subject='Day')
print(rm_anova)

# Post-hoc pairwise comparisons using Tukey's HSD
posthoc = pairwise_tukey(data=long_df, dv='Sales', between='Store')
print(posthoc)


  Source  ddof1  ddof2          F         p-unc     p-GG-corr       ng2  \
0  Store      2     58  37.819864  3.071000e-11  4.125821e-09  0.348213   

        eps  sphericity   W-spher   p-spher  
0  0.764743       False  0.692372  0.005818  
         A        B  mean(A)     mean(B)       diff        se         T  \
0  Store_A  Store_B    132.0  124.000000   8.000000  1.835509  4.358464   
1  Store_A  Store_C    132.0  119.666667  12.333333  1.835509  6.719299   
2  Store_B  Store_C    124.0  119.666667   4.333333  1.835509  2.360835   

        p-tukey    hedges  
0  1.051842e-04  1.214909  
1  5.424461e-09  1.699230  
2  5.291392e-02  0.561383  
