<div style="text-align: center;"> <h3>Statistical Theory</h3>
<h5>Formative Assessment 8</h5>
<h5><u>By Romand Lansangan</u></h5>
    </div>
    
---

## Introduction
The PlantGrowth dataset is a classic dataset in R to practice the applications of different statistical technique, in our case ANOVA. 

The PlantGrowth dataset is a simulated data of weight of plants (dependent variable) subjected to different treatments/groups(independent variable), namely the controlled (ctrl), first treatment (trt1), second treatment (trt2). Trt1 and trt2 could represent different intervention such as fertilizers, growth enhancers, location, etc. The point of conducting an ANOVA is to determine whether these intervention poses different result (which means they have an effect). 

## Methodology
**Null Hypothesis ($H_0$)**: There is no significant difference in the average plant weights accross all treatment groups ($\mu_1 = \mu_2 = \mu_3$). Meaning, the interventions induce no significant effect on the plant weights.  

**Alternative Hypothesis ($H_1$)**: At least one treatment group has a significantly different average plant weight from the others. Meaning, at least one intervention induce an significant effect on the plant weights.

We ought to test the null hypothesis at a 0.05 significance level. In other words, we ought to reject the null hypothesis if and only if p-value < 0.05. But it is also worth noting the choosing a 0.05 level of significance poses a risk of commiting a type I error (false positive; rejecting null hypothesis when it should be accepted) 5% of the time.

---

In [45]:
import pandas as pd
from scipy.stats import shapiro
from scipy.stats import levene
from scipy.stats import f_oneway
from scipy.stats import ttest_ind

In [18]:
df = pd.read_csv('PlantGrowth.csv')
df.head()

Unnamed: 0,weight,group
0,4.17,ctrl
1,5.58,ctrl
2,5.18,ctrl
3,6.11,ctrl
4,4.5,ctrl


In [19]:
df_ctrl = df[df['group'] == 'ctrl']['weight']
df_trt1 = df[df['group'] == 'trt1']['weight']
df_trt2 = df[df['group'] == 'trt2']['weight']

## Checking for assumptions


### Assumption 1: You have one dependent variable that is measured at the continuous level.
As was stated in introduction, the plant weight is our dependent variable which is measured at a continuous level.

### Assumption 2: You have one independent variable that consists of three or more categorical, independent groups. 
The indendent variable is the group/treatment to plants and it has three categorical and independent groups (because no plant could be subjected to two treatments), namely trt1, trt2, and ctrl.

### Assumption 3: You should have independence of observations, which means that there is no relationship between the observations in each group of the independent variable or among the groups themselves. 
There's no clear documentation to the research design of the PlantGrowth dataset. But we could still assume that the dataset meets this criteria for no plants are involved in more than one treatment group and a plants measurements were only taken once (or each plant has only one corresponding weight). 

### Assumption 4: There should be no significant outliers in the three or more groups of your independent variable in terms of the dependent variable.
For this particular assumption, let us use the IQR method to detect outliers.

In [20]:
df_ctrl_quant = df_ctrl.quantile([0.25, 0.75])
df_trt1_quant = df_trt1.quantile([0.25, 0.75])
df_trt2_quant = df_trt2.quantile([0.25, 0.75])

In [21]:
label_group = ['ctrl', 'trt1', 'trt2']
lower_quart = [df_ctrl_quant[0.25], df_trt1_quant[0.25], df_trt2_quant[0.25]] 
upper_quart = [df_ctrl_quant[0.75], df_trt1_quant[0.75], df_trt2_quant[0.75]] 

df_data = pd.DataFrame({'group': label_group, 'lower quartile': lower_quart, 'upper quartile': upper_quart})
df_data['iqr'] = df_data['upper quartile'] - df_data['lower quartile']
df_data['lower_limit'] = df_data['lower quartile'] - (1.5 * df_data['iqr'])
df_data['upper_limit'] = df_data['upper quartile'] + (1.5 * df_data['iqr'])
df_data

Unnamed: 0,group,lower quartile,upper quartile,iqr,lower_limit,upper_limit
0,ctrl,4.55,5.2925,0.7425,3.43625,6.40625
1,trt1,4.2075,4.87,0.6625,3.21375,5.86375
2,trt2,5.2675,5.735,0.4675,4.56625,6.43625


In [22]:
filtered_df_ctrl = df_ctrl[
    (df_ctrl < df_data.loc[df_data['group'] == 'ctrl', 'lower_limit'].values[0]) |
    (df_ctrl > df_data.loc[df_data['group'] == 'ctrl', 'upper_limit'].values[0])
]
filtered_df_ctrl

Series([], Name: weight, dtype: float64)

No outlier detected on controlled group.

In [23]:
filtered_df_trt1 = df_trt1[
    (df_trt1 < df_data.loc[df_data['group'] == 'trt1', 'lower_limit'].values[0]) |
    (df_trt1 > df_data.loc[df_data['group'] == 'trt1', 'upper_limit'].values[0])
]
filtered_df_trt1

14    5.87
16    6.03
Name: weight, dtype: float64

Two outliers are detected in trt1 group. But it is not clear whether they are significant outlier or not. Let us just retain the original data frame and also create a new dataframe for trt1 group without outliers. Then we shall see if the outliers are significant or not.

In [34]:
filtered_df_trt1_without_outliers = df_trt1[
    (df_trt1 > df_data.loc[df_data['group'] == 'trt1', 'lower_limit'].values[0]) &
    (df_trt1 < df_data.loc[df_data['group'] == 'trt1', 'upper_limit'].values[0])
]
print(f'Length of original: {len(df_trt1)}')
print(f'Length of without outliers: {len(filtered_df_trt1_without_outliers)}')

Length of original: 10
Length of without outliers: 8


In [25]:
filtered_df_trtb2 = df_trt2[
    (df_trt2 < df_data.loc[df_data['group'] == 'trt2', 'lower_limit'].values[0]) |
    (df_trt2 > df_data.loc[df_data['group'] == 'trt2', 'upper_limit'].values[0])
]
filtered_df_trt2

Series([], Name: weight, dtype: float64)

No outliers in trt2

### Assumption 5: Your dependent variable should be approximately normally distributed for each group of the independent variable.
Let us use Shapiro-Wilk test of normality for checking. 

In [26]:
stats_crtl, p_val_crtl = shapiro(df_ctrl)
stats_trt1, p_val_trt1 = shapiro(df_ctrl)
stats_trt2, p_val_trt2 = shapiro(df_trt2)
stats_trt1_wo, p_val_trt1_wo = shapiro(filtered_df_trt1_without_outliers)

trt1_wo_data = {'group': 'trt1(without outlier)', 'shapiro_stat': stats_trt1_wo, 'shapiro_p_value': p_val_trt1_wo}

stats_list = [stats_crtl, stats_trt1, stats_trt2]
p_val_list = [p_val_crtl, p_val_trt1, p_val_trt2]
df_data['shapiro_stat'] = stats_list
df_data['shapiro_p_value'] = p_val_list
df_data = pd.concat([df_data, pd.DataFrame([trt1_wo_data])], ignore_index=True)
df_data[['group', 'shapiro_stat', 'shapiro_p_value']]

Unnamed: 0,group,shapiro_stat,shapiro_p_value
0,ctrl,0.956681,0.747473
1,trt1,0.930411,0.451944
2,trt2,0.941005,0.564252
3,trt1(without outlier),0.946654,0.677479


All are normally distributed.

### Assumption 5: You have homogeneity of variances.
Let us use Levenes test for equality of variances. 

In [27]:
levenes_test = levene(df_ctrl, df_trt1, df_trt2)
levenes_test

LeveneResult(statistic=1.1191856948703909, pvalue=0.3412266241254737)

Passed the test with raw trt1

In [28]:
levenes_test = levene(df_ctrl, filtered_df_trt1_without_outliers, df_trt2)
levenes_test

LeveneResult(statistic=0.3181677818182353, pvalue=0.7303831803655645)

Passed the test with the variation of trt1 that has no outliers

## One-Way Anova

In [29]:
one_way_anova = f_oneway(df_ctrl, df_trt1, df_trt2)
one_way_anova

F_onewayResult(statistic=4.846087862380136, pvalue=0.0159099583256229)

In [30]:
one_way_anova = f_oneway(df_ctrl, filtered_df_trt1_without_outliers, df_trt2)
one_way_anova

F_onewayResult(statistic=12.394319269011733, pvalue=0.0001820212004664887)

The result from both the levene's test of homogeneity and one-way anova differs significantly between the dataset that has outlier and that dataset that doesn't have any outlier. Although that calculation results differs significantly from two dataset, both one-way anova results poses a significant statistical evidence to reject the null hypothesis. Based on the consistency of signficiant difference, we ough to use the dataset that has no outlier for the post-hoc analysis, despite the two posing simillar conclusion.

In [47]:
df1 = pd.DataFrame({'value': df_ctrl, 'group': 'ctrl'})
df2 = pd.DataFrame({'value': df_trt1, 'group': 'trt1'})
df3 = pd.DataFrame({'value': filtered_df_trt1_without_outliers, 'group': 'trt2'})

combined_df = pd.concat([df1, df2, df3], ignore_index=True)
combined_df.head()

Unnamed: 0,value,group
0,4.17,ctrl
1,5.58,ctrl
2,5.18,ctrl
3,6.11,ctrl
4,4.5,ctrl


In [50]:
grouped = combined_df.groupby('group')['value']

group_names = grouped.groups.keys()
results = []
for g1 in group_names:
    for g2 in group_names:
        groups_comb = [g1, g2]
        if groups_comb in [['ctrl', 'trt1'], ['ctrl', 'trt2'], ['trt1', 'trt2']]:
            if g1 != g2 and (g2, g1) not in [(r['Group 2'], r['Group 1']) for r in results]:
                stat, pval = ttest_ind(grouped.get_group(g1), grouped.get_group(g2), equal_var=True)  # Equal variance assumed
                results.append({'Group 1': g1, 'Group 2': g2, 't-statistic': stat, 'p-value': pval})

pairwise_results = pd.DataFrame(results)

pairwise_results

Unnamed: 0,Group 1,Group 2,t-statistic,p-value
0,ctrl,trt1,1.19126,0.249023
1,ctrl,trt2,2.736845,0.014624
2,trt1,trt2,1.014711,0.325343
