## The Plant Growth dataset contains three columns:

### rownames: The row identifiers.
### weight: The weight of the plants.
### group: The treatment group, which includes three levels: ctrl, trt1, and trt2.

In [14]:
import pandas as pd

# Load the dataset to inspect its contents
file_path = 'plantgrowth.csv'
plant_growth_data = pd.read_csv(file_path)

# Displaying the first few rows of the dataset
plant_growth_data.head()


Unnamed: 0,rownames,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


## “PlantGrowth” dataset consists of measurements of plant weights (in grams) categorized into three treatment groups: ‘ctrl’ (control), ‘trt1’ (treatment 1), and ‘trt2’ (treatment 2). This dataset can be used for analysing the effects of different treatments on plant growth. It contains two primary variables:

## Weight: A numeric variable representing the weight of plants.

 ## Group: A categorical variable representing the treatment group.

## T-test is a statistical test used to compare the means of two groups to determine whether they are significantly different from each other. It calculates the t-statistic, which measures the size of the difference relative to the variation in the sample data.

## Key Assumptions of the t-test:

Independence: The observations in each group are independent of each other.

Normality: The data in each group is approximately normally distributed.

Homogeneity of Variances: The variances of the two groups are equal (for the two-sample t-test).

In [15]:
from scipy.stats import ttest_ind
from statsmodels.stats.anova import AnovaRM
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Extract data for t-test (trt1 vs trt2)
trt1_weights = plant_growth_data[plant_growth_data['group'] == 'trt1']['weight']
trt2_weights = plant_growth_data[plant_growth_data['group'] == 'trt2']['weight']

# Perform independent t-test
t_test_result = ttest_ind(trt1_weights, trt2_weights)

#Show result
t_test_result



TtestResult(statistic=np.float64(-3.0100985421243616), pvalue=np.float64(0.0075184261182198574), df=np.float64(18.0))

Since the p-value is less than 0.05, we can reject the null hypothesis. There is a significant difference in the mean weights of plants between treatment groups trt1 and trt2.

## ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. It tests the null hypothesis that all group means are equal versus the alternative that at least one group mean is different.

## Key Assumptions of ANOVA:

Independence: Observations are independent.

Normality: Residuals (errors) are approximately normally distributed.

Homogeneity of Variances: Variance among the groups is approximately equal.

In [16]:
# Prepare data for ANOVA
anova_model = ols('weight ~ C(group)', data=plant_growth_data).fit()
anova_result = sm.stats.anova_lm(anova_model, typ=2)

# Display results
anova_result

Unnamed: 0,sum_sq,df,F,PR(>F)
C(group),3.76634,2.0,4.846088,0.01591
Residual,10.49209,27.0,,


The p-value is less than 0.05, so we reject the null hypothesis. There is a significant difference in the mean weights of plants among the three groups (ctrl, trt1, and trt2).

## Why ANOVA Instead of Multiple t-tests?
Using multiple t-tests to compare more than two groups increases the risk of Type I error (false positives). Each t-test operates under a 5% significance threshold, meaning there is a 5% chance of incorrectly rejecting the null hypothesis for each test. When multiple t-tests are performed, these individual error rates compound, greatly increasing the overall likelihood of false positives.

ANOVA addresses this issue by comparing all group means simultaneously under a single null hypothesis. By doing so, it controls the overall Type I error rate, ensuring reliable statistical conclusions. If ANOVA detects significant differences, post-hoc tests (e.g., Tukey’s HSD) can be applied to identify specific group differences without inflating the error rate.