# Iris Dataset

This notebook contains information (description & examples) about the Iris Dataset, and its use within the world of machine learning.

![Image of 3 Iris flowers](images/iris-machinelearning.png)

## What is the Iris Dataset?

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper _The use of multiple measurements in taxonomic problems_ which is an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris:
 - Iris setosa
 - Iris virginica
 - Iris versicolor

The dataset focuses on four features which were measured from each sample:
 - The lenght of the sepals
 - The width of the sepals
 - The lenght of the petals
 - The width of the petals
 
Each measurement is measured in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

## Use of the Iris Dataset

To properly demonstrate the dataset, we will use the __pandas__ library within Python. Pandas is an open source, BSD-licensed library which provides high-performance, easy-to-use data structures and data analysis tools.<br><br>

To use the dataset, we will import it from an external source. We will also import the pandas library. We will also import the __seaborn__ library which allows us to create more visual pleasing graphs.

In [1]:
# Imports
import pandas as pd
import seaborn as sb

In [2]:
# Retrieve dataset from external URL
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

In [4]:
# Output dataset
print(df)

# Class distribution of the data set
print(df.groupby('species').size())

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
5             5.4          3.9           1.7          0.4     setosa
6             4.6          3.4           1.4          0.3     setosa
7             5.0          3.4           1.5          0.2     setosa
8             4.4          2.9           1.4          0.2     setosa
9             4.9          3.1           1.5          0.1     setosa
10            5.4          3.7           1.5          0.2     setosa
11            4.8          3.4           1.6          0.2     setosa
12            4.8          3.0           1.4          0.1     setosa
13            4.3          3.0    

Within this dataset, there is 50 of each species of iris. Each row contains a numerical value for one of each feature as described above.

## Selecting Specific Data

This dataset is quite small compared to other datasets you can obtain, but relatively large none the less. Using pandas library, we can request certain and specific data from our dataset. The functions I will be giving examples are:
 - Selecting specific rows/columns
 - Selecting a certain 'snapshot' of data
 - The head function
 - The info function

### Rows/columns

We can select certain rows or columns we wish to see by using the example below:

In [5]:
df[['petal_length', 'species']]

Unnamed: 0,petal_length,species
0,1.4,setosa
1,1.4,setosa
2,1.3,setosa
3,1.5,setosa
4,1.4,setosa
5,1.7,setosa
6,1.4,setosa
7,1.5,setosa
8,1.4,setosa
9,1.5,setosa


### Certain snapshot

It is possible to examine a certain 'section' of the dataset by using the example below. It will retrieve the data which is defined between the specified parameters.

In [8]:
df[4:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


### Head function

This function is used to retrieve the top _nth_ rows of the dataset. See example:

In [9]:
df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Note: There is also a .tail function that is used to retrieve the bottom _nth_ rows of the dataset. This is used in the same way.

### Info function

This function is considerably useful as it retrieves information about the entire dataset. See below:

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
