# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [5]:
# Your code here
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('https://raw.githubusercontent.com/Patriciangugi/dsc-anova-lab/master/ToothGrowth.csv')
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [8]:
model = ols('len ~ supp + dose', data=df).fit()

# Generate the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)

               sum_sq    df           F        PR(>F)
supp       205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [9]:
# Your comment here
Both supplement type and dosage have a statistically significant effect on tooth length.
Dosage has a much larger effect on tooth length compared to the supplement type, as evidenced by the higher sum of squares and F-statistic for dosage.
Since the p-values for both supp and dose are much lower than 0.05, we reject the null hypothesis that these factors have no effect on tooth length. Therefore, we conclude that both supplement type and dosage significantly affect tooth length in the dataset.

SyntaxError: invalid syntax (3185069367.py, line 2)

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [10]:
from scipy.stats import ttest_ind
oj_sample = df[df['supp'] == 'OJ']['len']
vc_sample = df[df['supp'] == 'VC']['len']

# Perform the t-test
t_stat, p_value = ttest_ind(oj_sample, vc_sample)

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Fit the model
model = ols('len ~ supp + dose', data=df).fit()

# Generate the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)

T-statistic: 1.91526826869527
P-value: 0.06039337122412848
               sum_sq    df           F        PR(>F)
supp       205.350000   1.0   11.446768  1.300662e-03
dose      2224.304298   1.0  123.988774  6.313519e-16
Residual  1022.555036  57.0         NaN           NaN


Now run a t-test between these two groups and print the associated two-sided p-value: 

In [13]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
import pandas as pd
from scipy.stats import ttest_ind

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/Patriciangugi/dsc-anova-lab/master/ToothGrowth.csv')

# Split the data based on the supplement type
oj_sample = df[df['supp'] == 'OJ']['len']
vc_sample = df[df['supp'] == 'VC']['len']

# Perform the t-test
t_stat, p_value = ttest_ind(oj_sample, vc_sample)

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


T-statistic: 1.91526826869527
P-value: 0.06039337122412848


## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [15]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

import scipy.stats as stats
import numpy as np  # Make sure to import numpy

# Sample data
mean1, mean2 = 20, 22
std1, std2 = 5, 5
n1, n2 = 30, 30

# Calculate t-test statistic
t_stat = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2))

# Calculate F-statistic
F_stat = t_stat**2

# Calculate p-value for F-test
df_num = 1  # degrees of freedom for the numerator
df_den = n1 + n2 - 2  # degrees of freedom for the denominator
p_value = 1 - stats.f.cdf(F_stat, df_num, df_den)

print(f"t-statistic: {t_stat}")
print(f"F-statistic: {F_stat}")
print(f"p-value: {p_value}")


t-statistic: -1.5491933384829668
F-statistic: 2.4000000000000004
p-value: 0.12677501334088004


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [16]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [17]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)
import itertools

# Sample DataFrame
data = {
    'supp': ['VC', 'VC', 'VC', 'VC', 'OJ', 'OJ', 'OJ', 'OJ'],
    'dose': [1, 2, 3, 1, 2, 3, 1, 2],
    'len': [4.2, 4.5, 4.6, 4.3, 5.0, 5.1, 4.7, 4.9]
}
df = pd.DataFrame(data)

# Group data by 'supp' and 'dose'
groups = df.groupby(['supp', 'dose'])['len']

# Extract group names and data
group_names = [name for name in groups.groups.keys()]
group_data = {name: df[(df['supp'] == name[0]) & (df['dose'] == name[1])]['len'] for name in group_names}

# Generate all pairwise combinations of group names
combinations = list(itertools.combinations(group_names, 2))

# Calculate p-values for each pairwise combination
results = []
for (name1, name2) in combinations:
    data1 = group_data[name1]
    data2 = group_data[name2]
    
    # Perform two-sided t-test
    t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)  # Welch's t-test for unequal variance
    results.append((name1, name2, p_value))

# Print results
for (name1, name2, p_value) in results:
    print(f"Comparison between groups {name1} and {name2}: p-value = {p_value:.4f}")

Comparison between groups ('OJ', 1) and ('OJ', 2): p-value = nan
Comparison between groups ('OJ', 1) and ('OJ', 3): p-value = nan
Comparison between groups ('OJ', 1) and ('VC', 1): p-value = nan
Comparison between groups ('OJ', 1) and ('VC', 2): p-value = nan
Comparison between groups ('OJ', 1) and ('VC', 3): p-value = nan
Comparison between groups ('OJ', 2) and ('OJ', 3): p-value = nan
Comparison between groups ('OJ', 2) and ('VC', 1): p-value = 0.0101
Comparison between groups ('OJ', 2) and ('VC', 2): p-value = nan
Comparison between groups ('OJ', 2) and ('VC', 3): p-value = nan
Comparison between groups ('OJ', 3) and ('VC', 1): p-value = nan
Comparison between groups ('OJ', 3) and ('VC', 2): p-value = nan
Comparison between groups ('OJ', 3) and ('VC', 3): p-value = nan
Comparison between groups ('VC', 1) and ('VC', 2): p-value = nan
Comparison between groups ('VC', 1) and ('VC', 3): p-value = nan
Comparison between groups ('VC', 2) and ('VC', 3): p-value = nan


  var *= np.divide(n, n-ddof)  # to avoid error on division by zero
  var *= np.divide(n, n-ddof)  # to avoid error on division by zero


## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.