In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, shapiro, levene, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [7]:
#Load the data
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/PlantGrowth.csv")

df.to_csv("PlantGrowth.csv", index=False)

print(df.head())

print(df.describe())

print(df["group"].value_counts())

   rownames  weight group
0         1    4.17  ctrl
1         2    5.58  ctrl
2         3    5.18  ctrl
3         4    6.11  ctrl
4         5    4.50  ctrl
        rownames     weight
count  30.000000  30.000000
mean   15.500000   5.073000
std     8.803408   0.701192
min     1.000000   3.590000
25%     8.250000   4.550000
50%    15.500000   5.155000
75%    22.750000   5.530000
max    30.000000   6.310000
group
ctrl    10
trt1    10
trt2    10
Name: count, dtype: int64


This dataset has 30 observations 10 in each group (ctrl, trt1, trt2).
The plantweight mean is 5.07 and has a standard deviation of 0.70.

A t-test is a statistical test used to determine if there is a significant difference between the means of 2 groups.
It can be used to compare the means of two small samples.

The Null Hypothesis is: There is no difference between the means of the two groups.
The Alternate Hypothesis: There is a difference between the means
The test statistic is calculated using the formula.
The result is compared against the critical value based on the degree of freedom and significance level.
The p-value is then used to interpret result. A value less than the chosen cut-off (e.g 0.05) indicates statistical significance. 


The type of t-test we will be carrying out is an Independent group t-test. This is because the trt1 and trt2 groups consist of separate sets of plants that are not related to each other.
There are also no paired observations.

A t-test works with the following assumptions:
Normality:
The data in each group are approximately normally distributed.

Independent Samples:
Observations in one group are not related to those in the other.

Homogeneity of Variances:
The variance of the two groups are similar.

Scale of Measurement:
The dependent variable are measured on a continuous scale.

In [16]:
trt1 = df[df['group'] == 'trt1']['weight']
trt2 = df[df['group'] == 'trt2']['weight']

#Shapiro-Wilk test
shapiro_trt1 = shapiro(trt1)
shapiro_trt2 = shapiro(trt2)

#Levene's test
levene_result = levene(trt1, trt2)

#Perform the independent samples t-test
t_test_result = ttest_ind(trt1, trt2, equal_var=True)

#Print results.
print("Shapiro-Wilk Test")
print(f"trt1: Statistic = {shapiro_trt1.statistic:.4f}, p-value = {shapiro_trt1.pvalue:.4f}")
print(f"trt2: Statistic = {shapiro_trt2.statistic:.4f}, p-value = {shapiro_trt2.pvalue:.4f}")

print("Levene's Test")
print(f"Statistic = {levene_result.statistic:.4f}, p-value = {levene_result.pvalue:.4f}")

print("Independent T-Test")
print(f"Statistic = {t_test_result.statistic:.4f}, p-value = {t_test_result.pvalue:.4f}, df = {t_test_result.df:.0f}")



Shapiro-Wilk Test
trt1: Statistic = 0.9304, p-value = 0.4519
trt2: Statistic = 0.9410, p-value = 0.5643
Levene's Test
Statistic = 2.1042, p-value = 0.1641
Independent T-Test
Statistic = -3.0101, p-value = 0.0075, df = 18


The Shapiro-Wilk test was used to check the data in the groups is normally distributed. This is used to satisfy the assumption for the t-test.
Both groups have a p value above 0.05 meaning the data is suitable.

The Levene test was used to ensure teh variances of the groups are equal.
Again the levene p value was above 0.05 confirming the assumption is met.

The independant t-test was used to determine if there is significant difference between the means of plant weigths in the 2 groups.
The p-value is less than 0.05 suggesting the treatments had a different effect on plant weights.

In [18]:
ctrl = df[df['group'] == 'ctrl']['weight']

# Perform one-way ANOVA
anova_result = f_oneway(ctrl, trt1, trt2)

# Display the results
print("ANOVA Result:")
print(f"F-statistic = {anova_result.statistic:.4f}, p-value = {anova_result.pvalue:.4f}")



ANOVA Result:
F-statistic = 4.8461, p-value = 0.0159


The one-way ANOVA test yielded an F-statistic of 4.8461 with a p-value of 0.0159. This indicates a statistically significant difference in plant weights among the three groups (ctrl, trt1, and trt2) at the 5% significance level.

The ANOVA test is more appropriate than multiple t-tests because when analyzing more than two groups, multiple t-tests increase the risk of Type I error (false positives) due to repeated testing. ANOVA addresses this by evaluating all group means simultaneously, maintaining the overall significance level and providing a single statistical test for differences across all groups.