#### Number 4
How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [9]:
import numpy as np
import scipy.stats as stats

group1 = np.array([3, 4, 5, 6, 7])
group2 = np.array([2, 3, 4, 5, 6])
group3 = np.array([1, 2, 3, 4, 5])
data = np.concatenate((group1, group2, group3))

groups = np.array(['Group 1', 'Group 2', 'Group 3'])

# Calculate the total sum of squares (SST)
sst = np.sum((data - np.mean(data)) ** 2)

# Calculate the explained sum of squares (SSE)
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])
sse = np.sum((group_means - np.mean(data)) ** 2) * len(data)

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

# Degrees of freedom
n_groups = len(groups)
n_total = len(data)
df_total = n_total - 1
df_groups = n_groups - 1
df_residual = df_total - df_groups

# Mean squares (MS)
ms_groups = sse / df_groups
ms_residual = ssr / df_residual

# F-statistic
f_statistic = ms_groups / ms_residual

# p-value
p_value = 1 - stats.f.cdf(f_statistic, df_groups, df_residual)

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)
print("Degrees of Freedom (Groups, Residual):", df_groups, df_residual)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


Total Sum of Squares (SST): 40.0
Explained Sum of Squares (SSE): 30.0
Residual Sum of Squares (SSR): 10.0
Degrees of Freedom (Groups, Residual): 2 12
F-Statistic: 18.0
p-value: 0.000244140625


#### Number 11
An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [8]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random test scores for the control and experimental groups
np.random.seed(0)  # Set the random seed for reproducibility
control_scores = np.random.normal(70, 10, size=100)
experimental_scores = np.random.normal(75, 12, size=100)

# Perform the two-sample t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)

# Print the t-statistic and p-value
print("Two-sample t-test results:")
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Perform post-hoc test (Tukey's HSD test)
all_scores = np.concatenate((control_scores, experimental_scores))
groups = np.array(['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores))

posthoc = pairwise_tukeyhsd(all_scores, groups)

# Print the post-hoc test results
print("\nPost-hoc test (Tukey's HSD) results:")
print(posthoc)


Two-sample t-test results:
t-statistic: -3.3511267852812807
p-value: 0.0009638719426795379

Post-hoc test (Tukey's HSD) results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   5.3861 0.001 2.2166 8.5556   True
--------------------------------------------------------


In [None]:
control_scores = np.array([80, 85, 90, 92, 87, 83, 78, 79, 81, 84])
experimental_scores = np.array([75, 78, 82, 79, 81, 84, 88, 89, 86, 90])

t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

#### Number 10
A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
import statsmodels.api as sm
import pandas as pd
from statsmodels.formula.api import ols
data = pd.DataFrame({
    'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 
                 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A'],
    'Experience': ['Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                   'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced',
                   'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice',
                   'Novice', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced'],
    'Time': [10, 11, 9, 12, 13, 11, 15, 16, 14, 9, 10, 8, 11, 12, 10, 14, 15, 13, 8, 9, 7, 10, 11, 9, 13, 14, 12,
             8, 9, 7]
})

model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
Software,137.071429,2.0,41.641953,1.570797e-08
Experience,0.321429,1.0,0.195298,0.6624995
Software:Experience,0.428571,2.0,0.130199,0.8785366
Residual,39.5,24.0,,


In [3]:
# Extract F-statistics and p-values
f_stat_software = anova_table['F'][0]
p_value_software = anova_table['PR(>F)'][0]
f_stat_experience = anova_table['F'][1]
p_value_experience = anova_table['PR(>F)'][1]
f_stat_interaction = anova_table['F'][2]
p_value_interaction = anova_table['PR(>F)'][2]

# Print the results
print("Two-Way ANOVA Results:")
print("--------------------------------------------------------")
print("Main Effect of Software:")
print("F-Statistic:", f_stat_software)
print("p-value:", p_value_software)
print()
print("Main Effect of Experience:")
print("F-Statistic:", f_stat_experience)
print("p-value:", p_value_experience)
print()
print("Interaction Effect:")
print("F-Statistic:", f_stat_interaction)
print("p-value:", p_value_interaction)

Two-Way ANOVA Results:
--------------------------------------------------------
Main Effect of Software:
F-Statistic: 41.64195298372507
p-value: 1.5707966943612276e-08

Main Effect of Experience:
F-Statistic: 0.1952983725135577
p-value: 0.6624994739968593

Interaction Effect:
F-Statistic: 0.13019891500903766
p-value: 0.8785366445470055


#### number 12
A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [10]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random sales data for three stores over 30 days
np.random.seed(0)  # Set the random seed for reproducibility
store_a_sales = np.random.normal(1000, 200, size=30)
store_b_sales = np.random.normal(950, 180, size=30)
store_c_sales = np.random.normal(1100, 220, size=30)

# Create a DataFrame with the sales data
data = pd.DataFrame({
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Day': list(range(1, 31)) * 3,
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])
})

# Perform repeated measures ANOVA
model = AnovaRM(data, 'Sales', 'Day', within=['Store']).fit()

# Print the ANOVA summary
print(model.summary())

# Perform post-hoc test (Tukey's HSD test)
posthoc = pairwise_tukeyhsd(data['Sales'], data['Store'])

# Print the post-hoc test results
print("\nPost-hoc test (Tukey's HSD) results:")
print(posthoc)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  7.9348 2.0000 58.0000 0.0009


Post-hoc test (Tukey's HSD) results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj    lower    upper   reject
--------------------------------------------------------
     A      B -190.6852 0.0012 -314.1191 -67.2513   True
     A      C  -17.9899 0.9356 -141.4238  105.444  False
     B      C  172.6953 0.0035   49.2614 296.1292   True
--------------------------------------------------------


#### Number 8
What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

After conducting an analysis of variance (ANOVA) and finding a significant overall effect, post-hoc tests are used to determine which specific groups or conditions differ significantly from each other. Some common post-hoc tests include:   

Tukey's Honestly Significant Difference (HSD) Test: This test is commonly used when comparing all possible pairs of groups in a study. It controls the family-wise error rate, which is the probability of making at least one Type I error across all comparisons.   
Example: Suppose you conducted an experiment with three different treatment conditions, and the ANOVA results showed a significant overall effect. To determine which specific treatment conditions differ significantly from each other, you can apply Tukey's HSD test.   

Bonferroni Correction: This correction is used to adjust the significance level of individual comparisons when performing multiple pairwise comparisons. It divides the desired significance level by the number of comparisons to control the family-wise error rate.   
Example: If you conduct multiple t-tests or pairwise comparisons after an ANOVA with a significance level of 0.05, the Bonferroni correction would adjust the individual significance level to 0.05 divided by the number of comparisons.   

Scheffe's Test: This test is more conservative than Tukey's HSD test and is suitable for situations where the number of comparisons is small or unequal. It can be used to compare all possible combinations of groups.
Example: In a study with four treatment conditions, the ANOVA shows a significant effect. Scheffe's test can be applied to determine which specific groups differ significantly from each other.   

Dunnett's Test: This test is used when there is a control group or reference condition against which other groups are compared. It controls the overall significance level while focusing on the comparisons between the control group and other groups.
Example: Suppose you have a control group and multiple experimental groups in a study. After conducting an ANOVA, Dunnett's test can be used to compare each experimental group with the control group.   

Fisher's Least Significant Difference (LSD) Test: This test is less conservative than Tukey's HSD test but does not control the family-wise error rate. It is suitable for situations where there are a small number of comparisons or a priori hypotheses.
Example: In a study with three treatment conditions, the ANOVA shows a significant effect. Fisher's LSD test can be used to compare pairs of groups based on specific hypotheses or research questions.   

Post-hoc tests are used when there is a significant overall effect in ANOVA and researchers want to determine the specific pairwise differences between groups. Each post-hoc test has its own assumptions and advantages, so the choice of which test to use depends on the research question, the number of groups, the nature of the data, and the desired control of Type I errors.