# Applied Statistics Project

**Francesco Troja**

***

### Table of Contents
1. [Python Libraries](#Python_Libraries)
2. [Problem Statement](#problem_statement)
3. [Introduction](#Introduction)
4. [Importing the Dataset](#Importing_the_Dataset)

### Python Libraries <a class="anchor" id="Python_Libraries"></a>

This notebook utilizes a selection of Python libraries, each chosen for its specific functionalities and capabilities to address the diverse needs of the project. These libraries were carefully selected to ensure the successful execution of the project, aligning with the project’s objectives and technical requirements.

In [2]:
import pandas as pd

### 2. Problem Statement <a class="anchor" id="problem_statement"></a>

> In this project, you will analyze the PlantGrowth R dataset. You will find a short description of it on Vicent Arel-Bundock's Rdatasets page. The dataset contains two main variables, a treatment group and the weight of plants within those groups. Your task is to perform t-tests and ANOVA on this dataset while describing the dataset and explaining your work. In doing this you should:
>1. Download and save the dataset to your repository.
>2. Describe the data set in your notebook.
>3. Describe what a t-test is, how it works, and what the assumptions are.
>4. Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.
>5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.
>6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

### 3. Introduction <a class="anchor" id="Introduction"></a>

The **PlantGrowth dataset** originates from an experiment aimed at *evaluating the effects of different conditions on plant yields*, specifically *measuring the dried weight of plants*. The dataset comprises observations from three groups: a **control group** and **two treatment groups**, each containing *ten plants*. To ensure fairness and reduce variability, the experiment employed genetically similar seeds, which were randomly assigned to either a nutritionally enriched environment (treatment groups) or standard growing conditions (control group). This setup reflects a completely randomized experimental design. Following a set growing period, the plants were harvested, dried, and weighed, with their weights recorded in grams. The dataset offers a valuable resource for analyzing and comparing the impact of nutritional enhancements on plant growth, serving as a foundational tool for studying treatment efficacy in controlled agricultural experiments $^1$.

### 4. Importing the Dataset <a class="anchor" id="Importing_the_Dataset"></a>

The **PlantGrowth dataset** has been downloaded and imported into the repository from [Vincent Arel-Bundock's Rdatasets's](https://vincentarelbundock.github.io/Rdatasets/articles/data.html) Rdatasets page, which offers a curated collection of various datasets in CSV format.

To begin the project, the first step is to **import the dataset*. The most efficient Python library for importing and analyzing datasets is *pandas*. This library offers an array of tools for handling, manipulating, and analyzing data. In this case, the *dataset* is stored in a CSV file, and the most efficient way to work with such files is by utilizing the `read_csv()` function from the Pandas library. This function allows for easy importation of CSV files into a Pandas DataFrame, a versatile structure that simplifies data exploration, manipulation, and analysis. To import the file, the file path is provided as an argument, which specifies the file's location on the system. The `read_csv()` function then reads the data from the specified file and loads it into a DataFrame, making it ready for further processing and analysis $^2$.

In [5]:
# read the csv file
df = pd.read_csv('PlantGrowth.csv')
print("Original Dataframe:")
df

Original Dataframe:


Unnamed: 0,rownames,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl
5,6,4.61,ctrl
6,7,5.17,ctrl
7,8,4.53,ctrl
8,9,5.33,ctrl
9,10,5.14,ctrl


Once the dataset is imported and displayed, it is apparent that the **PlantGrowth dataset** consists of a *small sample size of just 30 entries*. Typically, in pandas, the `head()` and `tail()` functions are available to preview the first and last 5 rows of the dataset by default. However, given the small size of this dataset, it's not necessary to use these functions. A quick glance at the dataset structure reveals *three variables*: **rownames**, **weight**, and **group**.

### References

$^1$ Dobson, A. J. (1983). "*An Introduction to Statistical Modelling*". Chapman and Hall, Third Edition (2008), pp. 40

$^2$ Geeks for Geeks (2024). "*Pandas Read CSV in Python*". [Geeks for Geeks](https://www.geeksforgeeks.org/python-read-csv-using-pandas-read_csv/)

***
End