## Programming for Data Analysis
**Course: HDip in Computing in Data Analytics**  
**Module: Applied Statistics**  
**Author: Stefania Verduga**  

***

## Table of Contents
1. Introduction
    - 1.1. Description of the Project
    - 1.2. Objectives of the Project
    - 1.3. Technology and Libraries used for this project
2. Analysis
3. Conclusion
4. References
***

## 1. Introduction

**1.1. Description of the Project**
The purpose of this project is to analyze the PlantGrowth R dataset[01]. The PlantGrowth data are the results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions. This dataset contains two main variables, a treatment group and the weight of plants within those groups. 

In order to analze this dataset, I will follow the following steps:
1. Download and save the dataset to the repository.
2. Describe the data set in the notebook.
3. Describe what a t-test is, how it works, and what the assumptions are.
4. Perform a t-test to determine whether there is a significant difference between the two treatment groups `trt1` and `trt2`.
5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups `ctrl`, `trt1` and `trt2`.
6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

**1.2. Objectives of the Project**
The aim of this project is to carry out a statistical analysis using the t-Stadistic and ANOVA to determine the degree of effect on plant growth of the control treatment or two different treatments conditions. 

The database consists of 30 cases on 2 variables `weight` and `group`.
The levels of group are `ctrl`, `trt1` and `trt2`.

**1.3. Technology and Libraries used for this project**
This project was developed using Python [02] and the following packages:

- **Pandas**: Used to perform data manipulation and analysis. Pandas is a Python library for data analysis. It is built on top of two core Python libraries—matplotlib for data visualization and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code. [03]
- **Numpy**: Used to perform a wide variety of mathematical operations on arrays. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation. [04]
- **Matplotlib**: Used for data visualization and graphical ploting. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. [05]
- **Seaborn**: Built on top of matplotlib with similar functionalities. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. [06]
- **Scipy**: Is an open-source Python library which is used to solve scientific and mathematical problems. SciPy provides algorithms for optimization, integration, algebraic equations, differential equations, statistics and many other classes of problems. [07]
***

In [4]:
# Imports.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [6]:
# Load the dataset.
file_path = '/Users/stefania/Applied Statistics/applied-statistics/PlantGrowth.csv'
data = pd.read_csv(file_path)

print(data)

    rownames  weight group
0          1    4.17  ctrl
1          2    5.58  ctrl
2          3    5.18  ctrl
3          4    6.11  ctrl
4          5    4.50  ctrl
5          6    4.61  ctrl
6          7    5.17  ctrl
7          8    4.53  ctrl
8          9    5.33  ctrl
9         10    5.14  ctrl
10        11    4.81  trt1
11        12    4.17  trt1
12        13    4.41  trt1
13        14    3.59  trt1
14        15    5.87  trt1
15        16    3.83  trt1
16        17    6.03  trt1
17        18    4.89  trt1
18        19    4.32  trt1
19        20    4.69  trt1
20        21    6.31  trt2
21        22    5.12  trt2
22        23    5.54  trt2
23        24    5.50  trt2
24        25    5.37  trt2
25        26    5.29  trt2
26        27    4.92  trt2
27        28    6.15  trt2
28        29    5.80  trt2
29        30    5.26  trt2


## References
[01] [Dobson, A. J. (1983) An Introduction to Statistical Modelling. London: Chapman and Hall.] (https://vincentarelbundock.github.io/Rdatasets/articles/data.html)  
[02] [Python Software Foundation. Python. (2024).] (https://www.python.org/)  
[03] [Pandas via NumFOCUS. Pandas. (2024).] (https://pandas.pydata.org/)  
[04] [Matplotlib development team. Matplotlib: Visualization with Python. (2012 - 2024).] (https://matplotlib.org/)  
[05] [Seaborn Development Team. Seaborn: Statistical Data Visualization. (2012 - 2024).] (https://seaborn.pydata.org/)  
[06] [The SciPy community. SciPy: Fundamental algorithms for scientific computing in Python. (2008 - 2024)] (https://docs.scipy.org/doc/scipy/index.html)  