## Fundamentals of Data Analysis Project

**Francesco Troja**

***

#### Project

>Create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.
>
>• In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics,>statistics, and Python.
>
>• Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.
>
>• Select, demonstrate, and explain the most appropriate plot(s) foreach variable.
>
>• The notebook should follow a cohesive narrative about the dataset.

#### Installations

To execute this project, several Python libraries have been utilized. These libraries were chosen for their specific functionalities and capabilities, tailored to the requirements of the project:
1. `padas`: The library's powerful data structures, including DataFrames and Series, allowed for efficient organization and structuring of data, making it easy to perform various data operations, such as filtering, grouping, and aggregating.Pandas offered a wide range of functions for data cleaning and preparation, making it ideal for addressing real-world data challenges[1].

In [None]:
import pandas as pd

#### Fisher's Iris Data set History

The **Iris flower** dataset, also known as *Fisher's Iris* dataset, was introduced by British biologist and statistician Sir **Ronald Aylmer Fisher** in his 1936 article titled "*The Use of Multiple Measurements in Taxonomic Problems*". Additionally, the data set is sometimes referred to as *Anderson's Iris dataset*, as **Edgar Anderson** collected the data to quantify the morphological variation among *Iris flowers* from three related species. The collection process involved gathering two of the three species in the *Gaspé Peninsula* [2], 

>all from the same pasture, on the same day, and measuring them simultaneously using the same apparatus by the same person (ANDERSON, 1935, p. 1306),

while for the Virgica species:

> Some of my earliest experiences with species iris were with Iris setosa. Before setosa, though, was the "versicolor," growing in our front yard when we moved here almost 40 years ago (we later learned this was I. virginica var. Shrevei, and thanks to the daughter who dug me a clump for Mother's Day before the low area was filled, I still have that one) (ANDERSON, 1935, p. 1312).

In his paper, Fisher introduces the concept of **linear discriminant analysis** (LDA), a statistical method employed for dimensionality reduction and classification. The primary objective of LDA is to identify a linear combination of features that maximizes the separation between different classes while minimizing the variation within each class. Fisher's specific goal was to discover a linear combination of features that effectively characterizes or discriminates between two or more classes. To illustrate his approach, Fisher utilized the Iris dataset, which comprises measurements of `sepal length`, `sepal width`, `petal length`, and `petal width` for three species of iris flowers, `setosa`, `versicolor`, and `virginica` (refer below image). This dataset served as a practical example for demonstrating LDA. Fisher applied LDA to the iris dataset with the intention of differentiating between the three iris species. By analyzing the linear discriminants, he aimed to identify the features that contribute the most to the separation of the species[3].

<center>

<img src=https://miro.medium.com/v2/resize:fit:1400/1*f6KbPXwksAliMIsibFyGJw.png width="500">

</center>

#### Importing the data


When dealing with *datasets* in Python, there are multiple methods for importing data, providing a flexible approach. One common strategy is utilizing the *Pandas library* to directly import the dataset from its online source. This involves employing the `pandas.read_csv()` function to read the data directly from the provided *URL*. Alternatively, the dataset can be downloaded and *stored locally* as a *CSV file*. Subsequently, the data can be read from the CSV file by specifying its local file path. Both methods facilitate convenient access to the dataset. The choice between these approaches may depend on factors such as network connectivity and the requirement for offline access to the dataset. In this context, the **csv_url** variable is employed to store the URL of the Iris dataset, which is accessible from the [**UC Irvine Machine Learning Repository**](https://archive.ics.uci.edu/dataset/53/iris) [4].

In [None]:
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'


iris = pd.read_csv(csv_url)

### References

[1]: Chugh v., (2023). "*Python pandas tutorial: The ultimate guide for beginners*".[Datacamp](https://www.datacamp.com/tutorial/pandas)

[2]: Papers with code, (n.d.). "*iris*". [Papers with code](https://paperswithcode.com/dataset/iris-1#:~:text=for%20Spectral%20Clustering-,The%20Iris%20flower%20data%20set%20or%20Fisher's%20Iris%20data%20set,example%20of%20linear%20discriminant%20analysis.)

[3]: Unzueta D., (2021). "*Fisher’s Linear Discriminant: Intuitively Explained*". [Towards Data Science](https://towardsdatascience.com/fishers-linear-discriminant-intuitively-explained-52a1ba79e1bb)

[4]: Hadzhiev B., (2023). "*How to read a CSV file from a URL using Python [4 Ways]*". [bobbyhadz](https://bobbyhadz.com/blog/read-csv-file-from-url-using-python)

### Additional readings

- Anderson E., (1935). "*The Irises of the Gaspe Penisula*". From AIS Bulletin #59, Missouri Botanical Garden.
- R. A. Fisher (1936). "*The use of multiple measurements in taxonomic problems*". Annals of Eugenics. 7 (2): 179–188.

***
End