## Programming for Data Analysis
**Course: HDip in Computing in Data Analytics**  
**Module: Applied Statistics**  
**Author: Stefania Verduga**  

***

## Table of Contents
1. Introduction
    - 1.1. Description of the Project
    - 1.2. Objectives of the Project
    - 1.3. Technology and Libraries used for this project
2. T-Test Statistic
3. ANOVA
4. Analysis
5. Conclusion
6. References
***

## 1. Introduction

**1.1. Description of the Project**
The purpose of this project is to analyze the PlantGrowth R dataset[01]. The PlantGrowth data are the results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions. This dataset contains two main variables, a treatment group and the weight of plants within those groups. 

In order to analze this dataset, I will follow the following steps:
1. Download and save the dataset to the repository.
2. Describe the data set in the notebook.
3. Describe what a t-test is, how it works, and what the assumptions are.
4. Perform a t-test to determine whether there is a significant difference between the two treatment groups `trt1` and `trt2`.
5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups `ctrl`, `trt1` and `trt2`.
6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

**1.2. Objectives of the Project**
The aim of this project is to carry out a statistical analysis using the t-Stadistic and ANOVA to determine the degree of effect on plant growth of the control treatment or two different treatments conditions. 

The database consists of 30 cases on 2 variables `weight` and `group`.
The levels of group are `ctrl`, `trt1` and `trt2`.

**1.3. Technology and Libraries used for this project**
This project was developed using Python [02] and the following packages:

- **Pandas**: Used to perform data manipulation and analysis. Pandas is a Python library for data analysis. It is built on top of two core Python libraries—matplotlib for data visualization and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code. [03]
- **Numpy**: Used to perform a wide variety of mathematical operations on arrays. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation. [04]
- **Matplotlib**: Used for data visualization and graphical ploting. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. [05]
- **Seaborn**: Built on top of matplotlib with similar functionalities. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. [06]
- **Scipy**: Is an open-source Python library which is used to solve scientific and mathematical problems. SciPy provides algorithms for optimization, integration, algebraic equations, differential equations, statistics and many other classes of problems. [07]
***

## 2. T-Test Statistic.
The t-Test Statistic is a measure used in statistical hypothesis testing to determine whether there is a significant difference between the means of two groups or conditions. Specifically, the paired t-Test checks whether the mean between paired observations (e.g., before and after measurements) is significantly different from zero. It is often used to determine whether a process or treatment actually has an effect on the population of interest [08].

### How does the t-Test work?
The t-Test estimates de difference between two group means, using the ratio of the difference in group means over the pooled standard error of both groups. We can calculate this using the t-test formula.

1. #### Formula for the t-Test statistic.
The t-Test statistic is calculated using the formula:

$t = \frac{\bar{d}}{s_d / \sqrt{n}}$

Where:
- ${\bar{d}}$: Mean of the differences between paired samples.
- ${s_d}$ : Sample standard deviation of the differences.
- ${n}$ : Number of paired observations.

A larger $t$ value shows that the difference between group means is greater than the pooled standard error, indicating a more significant difference between the groups.

2. #### Hypothesis
In the sample t-test, the null hypothesis and the alternative hypothesis are:
- Null Hypothesis ($H_0$): The mean difference between the paired samples is zero ($\mu_d = 0$).
- Alternative Hypothesis (𝐻𝑎): The mean difference is not zero ($H_a: \mu_d \neq 0$)

3. #### Distribution
- The t-Test statistic follows a t-distribution with ${n-1}$ degrees of freedom ${df = {n-1}}$ when the null hypothesis is true.
- The shape of the t-distribution depends on the sample size:
    - For small samples ${n<30}$, it is wider and has heavier tails than the normal distribution.
    - For large samples, it approaches the normal distribution.

4. #### P-value
- The p-value is the probability of observing a t-statistic as extreme as the calculated one if the null hypothesis is true.
- A small p-value (e.g.,${p<0.05}$) suggests rejecting the null hypothesis, indicating a significant difference.

In [1]:
# Imports.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [2]:
# Load the dataset.
file_path = '/Users/stefania/Applied Statistics/applied-statistics/PlantGrowth.csv'
data = pd.read_csv(file_path)

# Initial exploration
data.info()
print(data.head())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   rownames  30 non-null     int64  
 1   weight    30 non-null     float64
 2   group     30 non-null     object 
dtypes: float64(1), int64(1), object(1)
memory usage: 848.0+ bytes
   rownames  weight group
0         1    4.17  ctrl
1         2    5.58  ctrl
2         3    5.18  ctrl
3         4    6.11  ctrl
4         5    4.50  ctrl
        rownames     weight
count  30.000000  30.000000
mean   15.500000   5.073000
std     8.803408   0.701192
min     1.000000   3.590000
25%     8.250000   4.550000
50%    15.500000   5.155000
75%    22.750000   5.530000
max    30.000000   6.310000


## 6. References
[01] [Dobson, A. J. (1983) An Introduction to Statistical Modelling. London: Chapman and Hall.] (https://vincentarelbundock.github.io/Rdatasets/articles/data.html)  
[02] [Python Software Foundation. Python. (2024).] (https://www.python.org/)  
[03] [Pandas via NumFOCUS. Pandas. (2024).] (https://pandas.pydata.org/)  
[04] [NumPy team. Numpy. (2024).] (https://numpy.org/) 
[05] [Matplotlib development team. Matplotlib: Visualization with Python. (2012 - 2024).] (https://matplotlib.org/)  
[06] [Seaborn Development Team. Seaborn: Statistical Data Visualization. (2012 - 2024).] (https://seaborn.pydata.org/)  
[07] [The SciPy community. SciPy: Fundamental algorithms for scientific computing in Python. (2008 - 2024)] (https://docs.scipy.org/doc/scipy/index.html)  
[08] [Scribbr. An Introduction to t Tests. (2024)] (https://www.scribbr.com/statistics/t-test/)