## Fundamentals of Data Analysis Project

**Francesco Troja**

***

#### Project

>Create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.
>
>• In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics,>statistics, and Python.
>
>• Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.
>
>• Select, demonstrate, and explain the most appropriate plot(s) foreach variable.
>
>• The notebook should follow a cohesive narrative about the dataset.

#### Import Python Libraries

To execute this project, several Python libraries have been utilized. These libraries were chosen for their specific functionalities and capabilities, tailored to the requirements of the project:
1. `padas`: The library's powerful data structures, including DataFrames and Series, allowed for efficient organization and structuring of data, making it easy to perform various data operations, such as filtering, grouping, and aggregating.Pandas offered a wide range of functions for data cleaning and preparation, making it ideal for addressing real-world data challenges[1].

In [6]:
import pandas as pd

#### Fisher's Iris Data set History

The **Iris flower** dataset, also known as *Fisher's Iris* dataset, was introduced by British biologist and statistician Sir **Ronald Aylmer Fisher** in his 1936 article titled "*The Use of Multiple Measurements in Taxonomic Problems*". Additionally, the data set is sometimes referred to as *Anderson's Iris dataset*, as **Edgar Anderson** collected the data to quantify the morphological variation among *Iris flowers* from three related species. The collection process involved gathering two of the three species in the *Gaspé Peninsula* [2], 

>all from the same pasture, on the same day, and measuring them simultaneously using the same apparatus by the same person (ANDERSON, 1935, p. 1306),

while for the Virgica species:

> Some of my earliest experiences with species iris were with Iris setosa. Before setosa, though, was the "versicolor," growing in our front yard when we moved here almost 40 years ago (we later learned this was I. virginica var. Shrevei, and thanks to the daughter who dug me a clump for Mother's Day before the low area was filled, I still have that one) (ANDERSON, 1935, p. 1312).

In his paper, Fisher introduces the concept of **linear discriminant analysis** (LDA), a statistical method employed for dimensionality reduction and classification. The primary objective of LDA is to identify a linear combination of features that maximizes the separation between different classes while minimizing the variation within each class. Fisher's specific goal was to discover a linear combination of features that effectively characterizes or discriminates between two or more classes. To illustrate his approach, Fisher utilized the Iris dataset, which comprises measurements of `sepal length`, `sepal width`, `petal length`, and `petal width` for three species of iris flowers, `setosa`, `versicolor`, and `virginica` (refer below image). This dataset served as a practical example for demonstrating LDA. Fisher applied LDA to the iris dataset with the intention of differentiating between the three iris species. By analyzing the linear discriminants, he aimed to identify the features that contribute the most to the separation of the species[3].

<center>

<img src=https://miro.medium.com/v2/resize:fit:1400/1*f6KbPXwksAliMIsibFyGJw.png width="500">

</center>

#### Reading Dataset


When dealing with *datasets* in Python, there are multiple methods for importing data, providing a flexible approach. One common strategy is utilizing the *Pandas library* to directly import the dataset from its online source. This involves employing the `pandas.read_csv()` function to read the data directly from the provided *URL*. Alternatively, the dataset can be downloaded and *stored locally* as a *CSV file*. Subsequently, the data can be read from the CSV file by specifying its local file path. Both methods facilitate convenient access to the dataset. The choice between these approaches may depend on factors such as network connectivity and the requirement for offline access to the dataset. In this context, the **csv_url** variable is employed to store the URL of the Iris dataset, which is accessible from the [**UC Irvine Machine Learning Repository**](https://archive.ics.uci.edu/dataset/53/iris) [4].

In [7]:
# store the dataset link in a variable
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(csv_url)

print('Find below the Iris Dataset:\n')
iris

Find below the Iris Dataset:



Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


#### Analyzing the Data


The *Iris dataset* displayed above reveals a remarkable issue: the attribute names information is not included in the main file from the *UCI Machine Learning Repository*. Instead, this information is stored in a separate file located in the **[iris.names](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names)** section. This additional file offers essential information regarding the four measurement attributes (**sepal length in cm**, **sepal width in cm**, **petal length in cm**, **petal width in cm**) and the three distinct classes (**Iris Setosa**, **Iris Versicolor**, and **Iris Virginica**). This attribute information is crucial for accurately interpreting and understanding the dataset's contents, providing context for each attribute's significance and contributing to a more comprehensive analysis. To enhance accessibility and readability, the file containing attribute names has been added to the repository. This additional step ensures that essential attribute information is readily available for a comprehensive understanding of the dataset.

To overcome the issue of missing attribute names in the dataset, the `read_csv()` function provides an argument called `names`, which allows users to specify a list of names to be used for the columns in the DataFrame. In the case of the present dataset, a list named **attribute_names** has been created, containing the attribute names. This list is then passed as an argument to the `names` parameter of the `read_csv()` function, ensuring that the specified names are used for the DataFrame columns instead of relying on the first row in the CSV file as the header row[5].

In [8]:
# Define the names of the columns in the dataset
attribute_names = ['Sepal_Length (cm)', 'Sepal_Width (cm)', 'Petal_Length (cm)', 'Petal_Width (cm)', 'Class']
iris = pd.read_csv(csv_url, names = attribute_names)

print('Modifying the attribute names in the dataset:\n')
iris


Modifying the attribute names in the dataset:



Unnamed: 0,Sepal_Length (cm),Sepal_Width (cm),Petal_Length (cm),Petal_Width (cm),Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


The dataset's structure and characteristics can be explored through statistical analysis. Insights into the data, including the number of rows and columns, values, data types, and missing values, are crucial for a comprehensive understanding. Utilizing the Pandas `head()` method provides a view of the top rows (default is 5), while the `tail()` method showcases the bottom rows (default is 5), aiding in initial exploration and assessment[6].

In [9]:
print('the first 5 rows of the dataset:')
iris.head()

the first 5 rows of the dataset:


Unnamed: 0,Sepal_Length (cm),Sepal_Width (cm),Petal_Length (cm),Petal_Width (cm),Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
print('the last 10 rows of the dataset:')
iris.tail(10)

the last 10 rows of the dataset:


Unnamed: 0,Sepal_Length (cm),Sepal_Width (cm),Petal_Length (cm),Petal_Width (cm),Class
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


As evident from the provided code, the selected dataset comprises **149 rows** and **5 columns**. The dimensionality of the dataset, can be confirmed using the Pandas function `shape` that when used it returns a tuple where the first element represents the number of rows (observations) and the second element indicates the number of columns (variables) in the dataset[7].

In [11]:
print(f'The dimensions of the dataset are: {iris.shape}')
print(f'The number of row are: {iris.shape[0]}')
print (f'The number of Attributes are: {iris.shape[1]}')

The dimensions of the dataset are: (150, 5)
The number of row are: 150
The number of Attributes are: 5


### References

[1]: Chugh v., (2023). "*Python pandas tutorial: The ultimate guide for beginners*".[Datacamp](https://www.datacamp.com/tutorial/pandas)

[2]: Papers with code, (n.d.). "*iris*". [Papers with code](https://paperswithcode.com/dataset/iris-1#:~:text=for%20Spectral%20Clustering-,The%20Iris%20flower%20data%20set%20or%20Fisher's%20Iris%20data%20set,example%20of%20linear%20discriminant%20analysis.)

[3]: Unzueta D., (2021). "*Fisher’s Linear Discriminant: Intuitively Explained*". [Towards Data Science](https://towardsdatascience.com/fishers-linear-discriminant-intuitively-explained-52a1ba79e1bb)

[4]: Hadzhiev B., (2023). "*How to read a CSV file from a URL using Python [4 Ways]*". [bobbyhadz](https://bobbyhadz.com/blog/read-csv-file-from-url-using-python)

[5]: Zach, (2023). "*Pandas: Set Column Names when Importing CSV File*". [Statology](https://www.statology.org/pandas-read-csv-column-name/)

[6]: Shazra H., (2023). "*head () and tail () Functions Explained with Examples and Codes*". [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2023/07/head-and-tail-functions/)

[7]: Pandas, (n.d.). "*pandas.DataFrame.shape*".[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html)


### Additional readings

- Anderson E., (1935). "*The Irises of the Gaspe Penisula*". From AIS Bulletin #59, Missouri Botanical Garden.
- R. A. Fisher (1936). "*The use of multiple measurements in taxonomic problems*". Annals of Eugenics. 7 (2): 179–188.

***
End