Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

- Normality of sampling distribution of means: The distribution of sample mean is normaly distributed.
- Absence od outliers: outlying score need to be removed from the dataset.
- Homogenity of varience: Each one of the population has same variance. Population variance in different levels of each independent variable are equal.
- sample are indeopendent and random

Violations that could impact the validity of the results:
- Normality: The normality assumption can be violated as long as the sample sizes are equal and sufficiently large. However, the samples must be symmetrical or at least similar in shape.
- Outliers: The presence of outliers can also cause problems.
- Homogeneity of variance: The assumption of homogeneity of variance is an assumption of the independent samples t-test and ANOVA stating that all comparison groups have the same variance.

Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA, or Analysis of Variance, is a statistical method for comparing mean differences across more than two groups. The three types of ANOVA are:
- One-way ANOVA: Used when there is one independent variable. For example, testing the relationship between shoe brand and race finish times in a marathon.
- Two-way ANOVA: Used when there are two independent variables. For example, determining the effect of two factors, such as product and gender, on a dependent variable like sales revenue.
- Repeated measures ANOVA: Used when the same subjects are measured multiple times. 


Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA is a systematic procedure that splits the total variation in a set of data into non-overlapping components. This partitioning is based on the law of total variance, which states that the observed variance in a variable can be split into components that can be attributed to different sources of variation.   
The partitioning of variance is a fundamental concept in many ANOVA analyses. It involves calculating the sums of squares for the total, error, and treatment. Only two of these sums need to be calculated because they add together.    
ANOVA is a collection of statistical models and estimation procedures that analyze the differences between means. It can be used to determine if there is a difference between the means of different groups. For example, ANOVA can help businesses make decisions about which alternative to choose among many possible options.    
ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the variance within a group. In general terms, a large difference in means combined with small variances within the groups signifies a greater difference between the groups.   

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df = pd.DataFrame(data)

# Fit the ANOVA model
model = ols('value ~ group', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=1)

# Calculate SST, SSE, and SSR
n = len(df)
k = len(df['group'].unique())
SST = np.sum((df['value'] - df['value'].mean())**2)
SSE = np.sum(model.resid**2)
SSR = np.sum((model.fittedvalues - df['value'].mean())**2)

print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)
print(anova_table)


SST: 60.0
SSE: 6.0
SSR: 54.000000000000014
           df  sum_sq  mean_sq     F  PR(>F)
group     2.0    54.0     27.0  27.0   0.001
Residual  6.0     6.0      1.0   NaN     NaN


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
 # Importing libraries 
import numpy as np 
import pandas as pd
import statsmodels.api as sm 
from statsmodels.formula.api import ols
  
# Create a dataframe 
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15), 
                          'Watering': np.repeat(['daily', 'weekly'], 15), 
                          'height': [14, 16, 15, 15, 16, 13, 12, 11, 14,  
                                     15, 16, 16, 17, 18, 14, 13, 14, 14,  
                                     14, 15, 16, 16, 17, 18, 14, 13, 14,  
                                     14, 14, 15]})  

# Performing two-way ANOVA 
model = ols('height ~ C(Fertilizer) + C(Watering) + C(Fertilizer):C(Watering)', data=dataframe).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result) 

                             df     sum_sq   mean_sq         F    PR(>F)
C(Fertilizer)               1.0   0.033333  0.033333  0.012069  0.913305
C(Watering)                 1.0   0.000369  0.000369  0.000133  0.990865
C(Fertilizer):C(Watering)   1.0   0.040866  0.040866  0.014796  0.904053
Residual                   28.0  77.333333  2.761905       NaN       NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02 What can you conclude about the differences between the groups, and how would you interpret these results?

The F-statistic measures the ratio of variation between group means to the variation within groups. In this case, with an F-statistic of 5.23, it indicates that the variation between the group means is larger than the variation within the groups.   
The p-value indicates the probability of obtaining the observed results (or more extreme results) if the null hypothesis (i.e., no difference between group means) were true. A p-value of 0.02 suggests that there is only a 2% probability of observing these results under the null hypothesis.   
Given the low p-value (0.02), we reject the null hypothesis and conclude that there are significant differences between at least one pair of group means.   

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Complete Case Analysis (CCA): In CCA, cases with missing data are completely excluded from the analysis. While this method is simple, it can lead to biased estimates if the missing data are not missing completely at random (MCAR), meaning that the probability of missingness is unrelated to the observed or unobserved data.

Mean Imputation: In mean imputation, missing values are replaced with the mean of the observed values for that variable. While this method is easy to implement, it can underestimate variability and distort relationships between variables. It also assumes that the missing values have the same mean as the observed values, which may not be true.

Last Observation Carried Forward (LOCF): In LOCF, missing values are replaced with the last observed value. This method assumes that the missing values remain constant over time, which may not be valid in longitudinal studies.

Linear Interpolation: Linear interpolation involves estimating missing values based on the values of neighboring time points. While this method can preserve the overall trend of the data, it assumes linear relationships between adjacent time points and may not accurately capture nonlinear patterns.

Multiple Imputation: Multiple imputation involves generating multiple plausible values for each missing data point based on the observed data and imputing them separately. The results from multiple imputed datasets are then combined using appropriate statistical techniques. Multiple imputation provides more accurate estimates and valid standard errors compared to single imputation methods. However, it requires more computational resources and may be complex to implement.

Maximum Likelihood Estimation (MLE): MLE is a statistical technique that estimates model parameters by maximizing the likelihood function. In the context of repeated measures ANOVA, MLE can be used to estimate parameters while accounting for missing data. It provides valid parameter estimates under the missing at random (MAR) assumption, where the probability of missingness depends on the observed data but not on the missing data.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is used to determine which specific groups differ significantly from each other following a significant ANOVA result. It controls for the familywise error rate, making it suitable for situations where multiple pairwise comparisons are conducted. Tukey's HSD is appropriate when you have equal group sizes and homogeneous variances.

Bonferroni Correction: Bonferroni correction adjusts the significance level for multiple comparisons to maintain an overall alpha level. It is a more conservative approach compared to Tukey's HSD and is suitable when the number of pairwise comparisons is large. However, it may have lower power compared to other post-hoc tests.

Sidak Correction: Similar to Bonferroni correction, Sidak correction adjusts the significance level for multiple comparisons. However, Sidak correction is slightly less conservative than Bonferroni correction, which can increase power while still controlling the familywise error rate.

Dunnett's Test: Dunnett's test is used when comparing multiple treatment groups to a single control group. It adjusts for multiple comparisons while focusing on differences between treatment groups and a control group.

Scheffé's Test: Scheffé's test is a conservative post-hoc test that can be used when the assumptions of homogeneity of variances and equal group sizes are not met. It provides a wider confidence interval for each comparison, making it less likely to incorrectly reject the null hypothesis.

Fisher's LSD (Least Significant Difference): Fisher's LSD test is the least conservative post-hoc test and is appropriate when assumptions of equal variances and equal group sizes are met. However, it is not recommended for use when conducting multiple comparisons due to its inflated Type I error rate.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Set random seed for reproducibility
np.random.seed(42)

# Generate random weight loss data for three diets: A, B, and C
weight_loss_A = np.random.normal(loc=5, scale=2, size=50)  # Mean=5, SD=2
weight_loss_B = np.random.normal(loc=6, scale=2, size=50)  # Mean=6, SD=2
weight_loss_C = np.random.normal(loc=4, scale=2, size=50)  # Mean=4, SD=2

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

# Print F-statistic and p-value
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 16.574213049400626
p-value: 3.2283781469409867e-07
There is a significant difference between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set random seed for reproducibility
np.random.seed(42)

# Generate random data
n_employees = 30
n_experience_levels = 2

# Create a DataFrame for the data
data = pd.DataFrame({
    'Software': np.random.choice(['A', 'B', 'C'], size=n_employees),
    'Experience': np.random.choice(['Novice', 'Experienced'], size=n_employees),
    'Time': np.random.normal(loc=10, scale=2, size=n_employees)  # Mean=10, SD=2
})

# Fit the ANOVA model
model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Interpret the results
print("\nInterpretation:")
if anova_table['PR(>F)']['Software'] < 0.05:
    print("There is a significant main effect of software programs on task completion time.")
else:
    print("There is no significant main effect of software programs on task completion time.")

if anova_table['PR(>F)']['Experience'] < 0.05:
    print("There is a significant main effect of employee experience level on task completion time.")
else:
    print("There is no significant main effect of employee experience level on task completion time.")

if anova_table['PR(>F)']['Software:Experience'] < 0.05:
    print("There is a significant interaction effect between software programs and employee experience level.")
else:
    print("There is no significant interaction effect between software programs and employee experience level.")

                        sum_sq    df         F    PR(>F)
Software              1.035327   2.0  0.136986  0.872659
Experience            0.521940   1.0  0.138118  0.713420
Software:Experience   2.683910   2.0  0.355113  0.704716
Residual             90.694755  24.0       NaN       NaN

Interpretation:
There is no significant main effect of software programs on task completion time.
There is no significant main effect of employee experience level on task completion time.
There is no significant interaction effect between software programs and employee experience level.


Q11. An educational researcher is interested in whether a new teaching method improves student testscores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set random seed for reproducibility
np.random.seed(42)

# Generate random test scores for the control and experimental groups
control_group_scores = np.random.normal(loc=70, scale=10, size=50)  # Mean=70, SD=10
experimental_group_scores = np.random.normal(loc=75, scale=10, size=50)  # Mean=75, SD=10

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)

# Print t-statistic and p-value
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc test (Tukey's HSD) if the results are significant
if p_value < 0.05:
    print("\nPost-hoc test (Tukey's HSD):")
    all_scores = np.concatenate([control_group_scores, experimental_group_scores])
    group_labels = ['Control'] * len(control_group_scores) + ['Experimental'] * len(experimental_group_scores)
    tukey_results = pairwise_tukeyhsd(all_scores, group_labels, alpha=0.05)
    print(tukey_results)
else:
    print("\nNo significant differences found, post-hoc test not required.")


Two-sample t-test results:
t-statistic: -4.108723928204809
p-value: 8.261945608702611e-05

Post-hoc test (Tukey's HSD):
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
Control Experimental   7.4325 0.0001 3.8427 11.0224   True
----------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set random seed for reproducibility
np.random.seed(42)

# Generate random daily sales data for three stores for 30 days
n_days = 30
n_stores = 3

# Create a DataFrame for the data
data = pd.DataFrame({
    'Day': np.repeat(np.arange(1, n_days+1), n_stores),
    'Store': np.tile(['A', 'B', 'C'], n_days),
    'Sales': np.random.normal(loc=5000, scale=1000, size=n_days * n_stores)  # Mean=5000, SD=1000
})

# Fit the repeated measures ANOVA model
model = ols('Sales ~ C(Store)', data=data).fit()

# Perform repeated measures ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print("Repeated Measures ANOVA results:")
print(anova_table)

# Perform post-hoc test (Tukey's HSD) if the results are significant
if anova_table['PR(>F)']['C(Store)'] < 0.05:
    print("\nPost-hoc test (Tukey's HSD):")
    tukey_results = sm.stats.multicomp.pairwise_tukeyhsd(data['Sales'], data['Store'])
    print(tukey_results)
else:
    print("\nNo significant differences found, post-hoc test not required.")


Repeated Measures ANOVA results:
                sum_sq    df         F    PR(>F)
C(Store)  3.596734e+05   2.0  0.202043  0.817442
Residual  7.743782e+07  87.0       NaN       NaN

No significant differences found, post-hoc test not required.
