### Problem_1: Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA makes some key assumptions about your data to ensure accurate results. Here are the main ones:

  - Normality: The data within each group should be normally distributed (bell-shaped curve). Violation: If your data is heavily skewed or has outliers, the results may not be reliable.

  - Homogeneity of variance: The variances (spread) of the data should be equal across all groups. Violation: Unequal variances can inflate the effect of groups with higher variance, leading to misleading conclusions.

  - Independence: Observations within each group should be independent and not influence each other. Violation: If observations are related (e.g., repeated measures on same subjects), ANOVA may not be appropriate.

### Problem_2: What are the three types of ANOVA, and in what situations would each be used?

There are three main types of ANOVA, used for analyzing the impact of categorical variables on a continuous outcome:

  - One-Way ANOVA: This tests for differences between one independent variable (factor) with three or more levels on a dependent variable. Use it to compare means across groups, like plant growth under different fertilizer types.

  - Two-Way ANOVA: This analyzes the effects of two independent variables (each with multiple levels) on a dependent variable. It considers both individual and interaction effects. Use it to see if fertilizer type and watering frequency together affect plant growth.

  - N-Way ANOVA (General ANOVA): This extends the concept to analyze more than two independent variables simultaneously. It's less common but can handle complex multi-factor experiments.

### Problem_3: What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning variance in ANOVA is all about breaking down the total variability in your data into different sources. Imagine you're analyzing plant heights. The total variation includes differences due to the experiment (fertilizer types) and random error (natural variations among plants).

Understanding partitioning is important because:

  - Identifies key sources of variation: It helps you see how much of the total difference is explained by your experiment (fertilizer effect) and how much is just random noise.

  - ANOVA Test Statistic: This partitioning is used to calculate the F-statistic, which is the core of ANOVA. The F-statistic tells you if the differences between groups (fertilizer effect) are statistically significant compared to random variation.

### Problem_4: How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import numpy as np
import pandas as pd

# Example data: creating a DataFrame with 'group' and 'value' columns
data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [10, 12, 15, 8, 9, 11, 13, 14, 16]
}
df = pd.DataFrame(data)

# Calculate overall mean
overall_mean = df['value'].mean()

# Calculate SST
sst = np.sum((df['value'] - overall_mean) ** 2)

# Calculate group means
group_means = df.groupby('group')['value'].mean()

# Calculate SSE
sse = np.sum((group_means - overall_mean) ** 2 * df.groupby('group').size())

# Calculate SSR
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)



Total Sum of Squares (SST): 60.0
Explained Sum of Squares (SSE): 38.0
Residual Sum of Squares (SSR): 22.0


### Problem_5: In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [5]:
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Sample data (Example)
data = {'factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'factor2': [1, 2, 3, 1, 2, 3, 1, 2, 3],
        'response': [10, 12, 8, 14, 16, 11, 13, 15, 9]}

# Define the formula for the two-way ANOVA model
formula = 'response ~ factor1 + factor2 + factor1:factor2'  # interaction term included

# Fit the model
model = ols(formula, data).fit()

# Perform ANOVA analysis
anova_results = anova_lm(model)

# Print main effects and interaction effect
print("ANOVA Results:")
print(anova_results)

ANOVA Results:
                  df     sum_sq    mean_sq         F    PR(>F)
factor1          2.0  20.666667  10.333333  1.248322  0.403214
factor2          1.0  13.500000  13.500000  1.630872  0.291457
factor1:factor2  2.0   1.000000   0.500000  0.060403  0.942501
Residual         3.0  24.833333   8.277778       NaN       NaN


### Problem_6: Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret these results?

  - In a one-way ANOVA, the F-statistic tests the null hypothesis that the means of all groups are equal against the alternative hypothesis that at least one group mean is different. The p-value associated with the F-statistic indicates the probability of observing such an extreme result (or more extreme) under the assumption that the null hypothesis is true.
  
In this case:
  - F-statistic = 5.23
  - p-value = 0.02
  
  - Given that the p-value is less than the significance level (usually 0.05), we reject the null hypothesis. This means that there is sufficient evidence to conclude that at least one of the group means is different from the others.
  
Interpretation:
  - The differences between the groups are statistically significant.
  - The factors (groups) being compared have a significant effect on the dependent variable.
  - It's not possible to determine from the ANOVA alone which specific group(s) differ from each other, only that there is a difference somewhere.
  - Post-hoc tests (e.g., Tukey's HSD, Bonferroni, etc.) can be conducted to determine pairwise differences between groups if needed.

### Problem_7:  In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling Missing Data:
  - Listwise Deletion (Least Preferred): This removes entire participants if they have any missing data points. This can be wasteful and reduce power, especially if missingness is low.
  - Mean/Median Imputation: Missing values are replaced with the mean or median of the participant's other measurements. This assumes missingness is random (MCAR) and can underestimate variability.
  - Model-based Methods (Preferred): Mixed-effects models can account for missing data while considering the repeated measures structure. This is more robust and handles missingness not completely random (MAR).      
  
Consequences of Different Methods:
  - Listwise Deletion: Loss of power, biased estimates if missingness is related to the outcome.
  - Mean/Median Imputation: May underestimate variability, might not be suitable for skewed data.
  - Mixed-effects Models: More complex but statistically sound, requires careful model assumptions check.

### Problem_8:  What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Here are some common post-hoc tests used after ANOVA and when to use them:
  - Tukey's Honestly Significant Difference (HSD): This is a conservative test used to compare all possible pairs of means while controlling for family-wise error rate (FWER). It's suitable for a small number of groups with equal variances.

  - Games-Howell test: This is a more robust alternative to Tukey's HSD when variances are unequal between groups.

  - Scheffe's test: This is another conservative test that can handle unequal variances but is generally less powerful than Tukey's HSD.

  - Fisher's Least Significant Difference (LSD): This is a less conservative test compared to Tukey's HSD, but it has a higher chance of identifying false positives (Type I errors).

Choosing the Right Test:
  - Use Tukey's HSD or Games-Howell for a small number of groups with equal or unequal variances, respectively.
  - Use Scheffe's test if you need a very conservative test even with unequal variances.
  - Use Fisher's LSD with caution due to the higher risk of false positives.
  
Example:
  - Imagine you're studying the effects of different fertilizers (3 groups) on plant growth. ANOVA reveals a significant overall effect. You can't tell which specific fertilizers differ from each other. Here, a post-hoc test like Tukey's HSD would be necessary to determine which fertilizer types lead to significantly different plant growth.

### Problem_9:  A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [6]:
import numpy as np
from scipy.stats import f_oneway

# Sample data (Example dataset)
weight_loss = {
    "Diet A": [2.5, 3.1, 1.8, 2.7, 4.2],
    "Diet B": [1.9, 2.3, 3.0, 2.8, 1.5],
    "Diet C": [3.4, 2.9, 1.7, 4.1, 3.8]
}

# Extract data into separate arrays
diet_A_data = np.array(weight_loss["Diet A"])
diet_B_data = np.array(weight_loss["Diet B"])
diet_C_data = np.array(weight_loss["Diet C"])

# Combine data into a single array with labels for groups
data = np.concatenate((diet_A_data, diet_B_data, diet_C_data))
groups = np.concatenate([['A'] * len(diet_A_data), ['B'] * len(diet_B_data), ['C'] * len(diet_C_data)])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A_data, diet_B_data, diet_C_data)

# Print results and interpretation
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
  print("There is statistically significant evidence (p < 0.05) to reject the null hypothesis of equal mean weight loss across all diets. This suggests at least one diet leads to a different average weight loss compared to the others.")
else:
  print("We fail to reject the null hypothesis (p >= 0.05). There is not enough evidence to conclude that the mean weight loss differs significantly between the three diets.")


F-statistic: 1.4481751824817515
p-value: 0.27328022439387284
We fail to reject the null hypothesis (p >= 0.05). There is not enough evidence to conclude that the mean weight loss differs significantly between the three diets.


### Problem_10:  A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [14]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: create a DataFrame with 'software', 'experience', and 'time' columns
data = {
    'software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 10,
    'experience': ['novice', 'experienced'] * 45,
    'time': [12, 14, 15, 10, 11, 13, 16, 18, 20] * 10
}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('time ~ C(software) + C(experience) + C(software):C(experience)', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract F-statistics and p-values
f_software = anova_table.loc['C(software)', 'F']
p_software = anova_table.loc['C(software)', 'PR(>F)']

f_experience = anova_table.loc['C(experience)', 'F']
p_experience = anova_table.loc['C(experience)', 'PR(>F)']

f_interaction = anova_table.loc['C(software):C(experience)', 'F']
p_interaction = anova_table.loc['C(software):C(experience)', 'PR(>F)']

print("Main Effect for Software: F =", f_software, "p =", p_software)
print("Main Effect for Experience: F =", f_experience, "p =", p_experience)
print("Interaction Effect: F =", f_interaction, "p =", p_interaction)

Main Effect for Software: F = 166.38461538461763 p = 6.082929487455021e-30
Main Effect for Experience: F = 3.1216690576613713e-27 p = 1.0
Interaction Effect: F = 2.119184536691426e-28 p = 1.0


Interpretation:
  - If the p-value for the main effect of software or experience is less than the significance level (e.g., 0.05), we conclude that there is a significant main effect of software or experience on the time taken to complete the task.
  - If the p-value for the interaction effect is less than the significance level, we conclude that there is a significant interaction effect between software and experience on the time taken to complete the task.

### Problem_11:  An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [17]:
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data: test scores for control and experimental groups
control_scores = np.array([75, 80, 85, 70, 78, 82, 79, 81, 77, 76])
experimental_scores = np.array([85, 88, 90, 92, 78, 86, 83, 89, 87, 91])

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc Tukey's HSD test if the t-test is significant
if p_value < 0.05:
    # Combine scores and labels
    scores = np.concatenate([control_scores, experimental_scores])
    labels = ['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores)
    
    # Perform Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(scores, labels, alpha=0.05)
    print("\nTukey's HSD test results:")
    print(tukey_results)
else:
    print("\nNo significant differences found, post-hoc test not conducted.")


Two-sample t-test results:
t-statistic: -4.611556534850402
p-value: 0.0002166740355855245

Tukey's HSD test results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower upper  reject
--------------------------------------------------------
Control Experimental      8.6 0.0002 4.682 12.518   True
--------------------------------------------------------


### Problem_12:  A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [18]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data: sales for Store A, Store B, and Store C for 30 days
np.random.seed(0)  # For reproducibility
store_a_sales = np.random.normal(loc=100, scale=20, size=30)
store_b_sales = np.random.normal(loc=110, scale=25, size=30)
store_c_sales = np.random.normal(loc=120, scale=30, size=30)

# Combine the sales data into a DataFrame
data = {
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])
}
df = pd.DataFrame(data)

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(store_a_sales, store_b_sales, store_c_sales)

print("One-way ANOVA results:")
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Perform post-hoc Tukey's HSD test if the ANOVA results are significant
if p_value < 0.05:
    tukey_results = pairwise_tukeyhsd(df['Sales'], df['Store'])
    print("\nTukey's HSD test results:")
    print(tukey_results)
else:
    print("\nNo significant differences found, post-hoc test not conducted.")


One-way ANOVA results:
F-statistic: 2.1379139395433953
p-value: 0.12405407287312517

No significant differences found, post-hoc test not conducted.
