### Project: Analysis of the PlantGrowth R dataset 

In [None]:
# Import necessary libraries

# Data analysis and manipulation
# https://pandas.pydata.org/docs/user_guide/index.html
import pandas as pd

# Mathematical functions from the standard library.
# https://docs.python.org/3/library/math.html
import math

# Permutations and combinations.
# https://docs.python.org/3/library/itertools.html
import itertools

# Numerical structures and operations.
# https://numpy.org/doc/stable/reference/index.html#reference
import numpy as np

# Plotting.
# https://matplotlib.org/stable/contents.html
import matplotlib.pyplot as plt

# Statistical tests
# https://docs.scipy.org/doc/scipy/reference/stats.html
import scipy.stats as stats

# Statistical data visualization.
# https://seaborn.pydata.org/#seaborn-statistical-data-visualization
import seaborn as sns

#### 1. Load and read the dataset.

In [None]:
# Load the dataset from URL
url = "PlantGrowth.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()

In [None]:
# Get basic information about the dataset
data.info()

#### 2. Describe the data set.

This dataset consists of 30 observations with three columns:  
Description of Columns
- rownames: Identifiers for each row, sequentially numbered from 1 to 30.
This serves as a unique index for each observation.
- weight: A continuous numerical variable that represents the measured weights.  
Values range approximately from 3.59 to 6.31.
- group: A categorical variable representing the group membership of each observation.
There are three groups:
ctrl: Control group
trt1: Treatment group 1
trt2: Treatment group 2
Each group has 10 observations.

Dataset Breakdown
Group "ctrl" (Control): Rows 1–10, with weights ranging from 4.17 to 6.11.
Group "trt1" (Treatment 1): Rows 11–20, with weights ranging from 3.59 to 6.03.
Group "trt2" (Treatment 2): Rows 21–30, with weights ranging from 4.92 to 6.31.
Summary
This dataset is suitable for comparing weights across the three groups (control and two treatments). It can be analyzed using statistical methods such as:

Descriptive statistics (mean, standard deviation) for each group.
Visualization: Boxplots or histograms to compare weight distributions.
Inferential statistics:
ANOVA (to test for overall differences among groups).
Post-hoc tests (if ANOVA shows significant differences).

In [None]:


# Creating a figure with multiple plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Boxplot
sns.boxplot(x="group", y="weight", data=data, ax=axes[0])
axes[0].set_title("Boxplot of Weight by Group")
axes[0].set_xlabel("Group")
axes[0].set_ylabel("Weight")

# Histogram
sns.histplot(data, x="weight", hue="group", kde=True, element="step", ax=axes[1])
axes[1].set_title("Histogram of Weight")
axes[1].set_xlabel("Weight")
axes[1].set_ylabel("Count")

# Scatter plot
sns.scatterplot(x="rownames", y="weight", hue="group", style="group", data=data, ax=axes[2])
axes[2].set_title("Scatter Plot of Weight by Row Index")
axes[2].set_xlabel("Row Index")
axes[2].set_ylabel("Weight")

# Show the plots
plt.tight_layout()
plt.show()

#### 3. Describe what a t-test is, how it works, and what the assumptions are.

A t-test is a statistical method used to determine whether there is a significant difference between the means of two groups or between a sample mean and a known population mean. It is commonly used when working with small sample sizes and when the population standard deviation is unknown.  
**Types of t-tests**  
- One-sample t-test: Compares the mean of a single sample to a known or hypothesized population mean.  
- Independent two-sample t-test: Compares the means of two independent groups to see if they are significantly different  
- Paired-sample t-test: Compares the means of two related groups (e.g., pre-test and post-test scores).  
**How it works**
The t-test calculates a t-statistic, which reflects how far the observed sample mean (or difference between means) is from the null hypothesis in terms of standard error. It then compares this t-statistic to a critical value from the t-distribution.

**Steps of the t-test:**
1. State the null hypothesis  
$Ho$:  There is no difference between the means.  
and alternative hypothesis  
$𝐻𝑎$: There is a significant difference.

2. Calculate the test statistic:

- For an independent two-sample t-test, the formula is: $t = \frac{\bar{X}_1 - \bar{X}_2}{SE}$
Where:  
$\bar{X}_1$ and $\bar{X}_2$ are the sample means  
SE is the standard error of the difference between means, which depends on sample sizes and standard deviations.

- For a one-sample t-test, the formula is: $t = \frac{\bar{X} - \mu}{s / \sqrt{n}}$  
Where $\bar{X}$ is the sample mean,  
μ is the population mean,  
𝑠 is the sample standard deviation,  
𝑛 is the sample size.  
3. Determine the degrees of freedom (df):  
- For a one-sample t-test: $𝑑𝑓=n−1$  
- For independent two-sample t-test: 𝑑𝑓 depends on sample sizes and variance.  
4. Compare the t-statistic to the critical t-value from the t-distribution based on the desired significance level (𝛼, typically 0.05) and the degrees of freedom.  
5. Decision:  
- If ∣𝑡∣ > 𝑡-critical, reject the null hypothesis.
If ∣𝑡∣ ≤ 𝑡-critical, fail to reject the null hypothesis.  
**Assumptions of the t-test**  
For valid results, the t-test relies on the following assumptions:  
1. Independence of observations: the data points within and between the groups should be independent of each other.  
2. Normality:  
- The data in each group should follow a roughly normal distribution.  
- For small sample sizes (𝑛 < 30), this assumption is crucial.  
- For larger sample sizes, the Central Limit Theorem allows the t-test to be robust to slight deviations from normality.  
3. Equal variances (homogeneity of variance):  
- For the independent two-sample t-test, the variances of the two groups should be approximately equal.  
- If this assumption is violated, a variation of the test, known as Welch’s t-test, can be used.
4. Scale of measurement: the dependent variable (outcome being measured) must be on an interval or ratio scale (quantitative data).  
**Summary**  
The t-test is a versatile statistical test used for comparing means. It works by calculating the t-statistic, which determines how extreme the observed difference is under the null hypothesis. To ensure valid conclusions, the assumptions of independence, normality, and equal variances (in some cases) must be met.

#### 4. Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.

In [None]:
# Filter the data for the two treatment groups
trt1 = data[data['group'] == 'trt1']['weight']
trt2 = data[data['group'] == 'trt2']['weight']

# Perform an independent t-test

t_statistic_scipy, p_value_scipy = stats.ttest_rel(trt1, trt2)

# Output results
t_statistic_scipy, p_value_scipy

The result indicates that there is a significant difference between the two treatment groups, as the p-value is less than 0.05.

#### 5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.

In [None]:
# Filter data for each group
ctrl = data[data['group'] == 'ctrl']['weight']
trt1 = data[data['group'] == 'trt1']['weight']
trt2 = data[data['group'] == 'trt2']['weight']

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(ctrl, trt1, trt2)

# Print results
print("F-statistic:", f_stat)
print("p-value:", p_value)

The F-statistic measures the ratio of variance between the groups to the variance within the groups. In this case, an F-statistic of 4.85 suggests that the variability between the groups is relatively large compared to the variability within the groups.  
The p-value of 0.0159 is less than the typical significance level (α=0.05).
Since the p-value is less than 0.05, we reject the null hypothesis. This means that there is a statistically significant difference between at least one pair of the group means (ctrl, trt1, or trt2).
Conclusion:
There is sufficient evidence to conclude that at least one treatment group (control or treatments) has a mean that is significantly different from the others.

#### 6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

When analyzing more than two groups, it is generally more appropriate to use ANOVA (Analysis of Variance) instead of performing multiple t-tests for the following reasons:  
1. Control of Type I Error Rate  
- Each time a t-test is performed, there is a chance of making a Type I error (rejecting the null hypothesis when it is actually true). The probability of at least one Type I error increases as you perform more tests.  
- For example, with a significance level (α) of 0.05, the chance of making at least one Type I error when comparing multiple pairs of groups grows quickly:  
Overall Type I error = $1 - (1 - \alpha)^k$  
where 𝑘 is the number of tests. For 3 groups (comparing all pairs), you would need 3 t-tests, and the overall error would be larger than 0.05.  
- ANOVA avoids this issue by testing all group means simultaneously under one overall null hypothesis:  
$Ho$: All group means are equal.  
and alternative hypothesis  
$𝐻𝑎$: At least one group mean is different.  
By using a single test, ANOVA controls the Type I error rate at the desired significance level.  
2. Efficiency  
- Performing multiple t-tests involves redundant comparisons and more calculations. ANOVA is a more efficient, streamlined way to analyze multiple groups in a single test.  
- ANOVA provides a single overall F-statistic that determines whether group means are significantly different without requiring pairwise testing.  
3. Interpretation of Results  
- ANOVA provides a clear framework to test for overall differences among groups, rather than focusing on individual pairwise comparisons.  
- If the ANOVA result is significant (rejecting the null hypothesis), post-hoc tests (e.g., Tukey's HSD or Bonferroni correction) can be used to identify which specific group means are different while still controlling for Type I error.
4. Assumptions Are the Same  
ANOVA and t-tests share similar assumptions:  
- Independence of observations  
- Normally distributed data within groups  
- Equal variances among groups (homogeneity of variance)  
**Conclusion**  
Using ANOVA instead of multiple t-tests prevents the inflation of Type I error, improves computational efficiency, and simplifies interpretation. It tests for overall differences among groups in a single test and allows for follow-up comparisons (post-hoc tests) only if necessary. This makes ANOVA the preferred method when comparing more than two groups.