#  **Iris Data Set**
## **An investigation of its variables and data points**

***
#### **Author:** &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Daniel Mc Donagh
#### **Student No:**&nbsp;&nbsp;&nbsp;&nbsp;G00410864
#### **Module:** &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Fundamentals in Data Analysis
***

### **Background of Iris flower data set**

The Iris flower data set is one of the most widely used and famous data sets in the field of statistics and machine learning. It was first introduced by the British Biologist and statistician Ronald A. Fisher in his article published in 1936 ["The use of multiple measurements in taxonomic problems"](https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x)It has been used widly for educational purposes in the areas of classification, clustering and pattern recognition tasks.

##### **Origin**

The dataset was originally collected by Edgar Shannon Anderson an american botanist during the early 1930's. The data was collected on the Gaspe Peninsula in Quebec, Canada. He collected measurements on three different species of Iris flowers: 
- Iris Setosa
- Iris Virginica
- Iris Versicolor

![Species of Iris in Dataset](img\irises.png)


Fisher used the dataset to discriminate between the three types of iris flowers based on their features such as petal and sepal lenght and width. This statistical technique can be used to distinguish between the three flowers.

##### **Source**

The dataset is publicly available to download at [**UC Irvine** Machine Learning Repository](https://archive.ics.uci.edu/dataset/53/iris). The downloaded zip file contains the data in CSV format (comma seperated values). It can be viewed by opening the iris.data file in software such as notepad. Additionally included in the zip folder is a names.data file which provide us with usefull inforamtion about the dataset as well as the atribute names for each of the comma seperated columns.

![CSV format file of Iris dataset shown in Notepad](img\csv+attributes.png)

However this format needs to be imported into Python as a table with headings to make it more readable and functional.

In [10]:
# Imports the pandas library into python 
import pandas as pd

In [11]:
# Reads in the CSV file - Header set to none so Pandas does not take 1st row as column names
dataset = pd.read_csv('data\iris.data', header=None)

# adds the columns name
dataset.columns = ['sepal_length','sepal_width','petal_length','petal_width','class']

# displays the first n rows of the table
dataset.head(n=7)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa


In [12]:
# shows top and bottom 5 rows of table
dataset

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


#### **Alternate Source**
The latest version of Python is also able to import a CSV file directly from an internet link. The [University of Illinois](https://github.com/illinois-cse/data-fa14/blob/gh-pages/data/iris.csv) have put up a git hub page with the Iris CSV data file properly formatted to be viewed as a HTML file. We will use the raw code of this page to link to and read in our Iris CSV file into the Juypter Notebook.

In [13]:
# Imports the pandas library into python 
import pandas as pd

In [15]:
# Reads in CSV file to python, column names taken as allready added to this version of CSV. Pandas automatically uses them if present.
df = pd.read_csv('https://raw.githubusercontent.com/illinois-cse/data-fa14/gh-pages/data/iris.csv')

# shows df (dataframe) , top and bottom 5 rows.
df


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


##### **Structure**

To understand the structure of the dataset 

In [19]:

pd.set_option("display.precision",2)                                        # rounds all table values to two decimal places




print(iris_data.describe(), file = iris_summary)                             # The describe() function shows count, mean, standard deviation, minimum, median, interquartile ranges
print ("\n", file = iris_summary)                                           # adds a line break

print("The statistical summary of the flower classes ", file = iris_summary)
print(iris_data.describe(include=['object', 'bool']), file = iris_summary)   # Describe function used to show statistics for the class of flower
print ("\n", file = iris_summary)                                           # adds a line break

print("The total number of records for each class", file = iris_summary)
print(iris_data['class'].value_counts(), file = iris_summary)                # Counts the number of records for each class 
print ("\n", file = iris_summary)                                           # adds a line break

print("Summary of each variable grouped by the flower class:", file = iris_summary)
print(iris_data.groupby(['class']).describe(), file = iris_summary)          # Summary of variables grouped by class




font1 = {'family':'oswald','color':'darkred','size':16}                     # creation of font1 to be used in plots
font2 = {'family':'oswald','color':'black','size':14}                       # creation of font2 to be used in plots


iris_virginica = iris_data[iris_data["class"] == "Iris-virginica"]            # data grouped by flower class for plotting
iris_versicolor = iris_data[iris_data["class"] == "Iris-versicolor"] 
iris_setosa = iris_data[iris_data["class"] == "Iris-setosa"]  


sns.set(style="darkgrid")                                                   # this sets the background colour of the grid


NameError: name 'iris_data' is not defined

In [None]:
loc and iloc extraction of data that matches criteria
df.loc[df.loc[:,'species'] == 'setosa']


In [None]:
# Returns the top 5 rows by default of your dataset
df.head()

In [None]:
# df.tail will give you back the last 5 rows by default

df.tail()

In [None]:
# summarises data
x = df.loc[df.loc[:,'species'] == 'setosa']
x.describe()


In [None]:
df.describe()


## Variables
The dataset consists of a total of 150 samples of iris flowers. There are 50 of each of three seperate species of iris flower. 
- Setosa
- Versicolor
- Virginica

The other four variables measured for each flower in centimeters are.
- Sepal Width 
- Sepal Length
- Petal Width
- Petal Length

The sepal is outermost part of the flower that protects it when it is a bud. The petal is the inner part of the flower that produces the reproductive organs.

### Categorical Variable
The species of Iris flower would be seen as a categorical variable. There are only three options it can have. Setosa, Versicolor or Virginica

### Numeric Variable
The length and width of both the sepal's and petal's is a numeric value in centimeters. It will follow a distribution around a central mean. The distibution will be different for each of the species but through visualisation the correlation between petals and sepals of a particular species of iris can be seen.


## Data Analysis

The python program below utilising seaborn will create a scatter plot of the variables plotted against each other. The use of these plots can help visualise how trends and patterns in the data can be shown. From a machine learning prespective being able to deliniate seperation on the plots between different species can be very usefull in being able to determine the species soley on its numeric variables.


In [None]:
import seaborn as sns

In [None]:
# plots of each column plotted against all other columns
# when the data is plotted against its own column then it appears as a histogram with 10 default bins
sns.pairplot(df, hue='species')

Data types in csv file
How and where to download it

[Iris Pictures Reference](https://en.wikipedia.org/wiki/Iris_flower_data_set)

**Iris Species**

![Iris Setosa](img\iris_setosa.jpg)
![Iris Versicolor](img\iris_versicolor.jpg)
![Iris virginica](img\iris_virginica.jpg)

$$ f(x) = \frac{1}{sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \big( \frac{x - \mu}{\sigma} \big)^2 $$


$$ f(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2} $$

### **References**

https://github.com/illinois-cse/data-fa14/blob/gh-pages/data/iris.csv

**Pandas read csv** pandas.pydata.org 2023 (Pandas, 2023)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv

(Seaborn,2012-2023)  seaborn.pydata.org   pairplot function
https://seaborn.pydata.org/generated/seaborn.pairplot.html

***

**END**