# IRIS Dataset

Iris is a flower classification dataset containing 3 species of flowers. Each species has 50 samples of sepal and petal sizes of the flower.

![Iris Dataset Description](../assets/iris-dataset-desc.png)

The dataset is used to classify followers using the sepal and petal width and heights. The species are as follows:

![Iris Flower Species in the Iris dataset](../assets/iris-species.png)

In order to use the dataset, we can utilize scikit-learn library which provides easy to use data loader for iris dataset.

# Loading dataset

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

In [2]:
iris['data'].shape

(150, 4)

The contents of the iris dataset is as follows:

In [3]:
dir(iris)

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

The samples of the dataset can be accessed like `iris['data']` which contains 4 columns. The names of these columns can be accessed like that: `iris['feature_names']`.

In [4]:
iris["data"].shape

(150, 4)

In [5]:
iris["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The labels (species of flowers) of the dataset can be accessed like `iris['target']` which is already enumerated.

In [6]:
iris["target"]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The names of these enumerated labels are:

In [7]:
iris["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

# Creating training and test set

In machine learning applications, we require at least two different sets of data: training and test sets. The training set will be used for training a machine learning model, whereas test set will be used to measure its performance. 

![Train test set splitting](../assets/train-test-set-splitting.png)

In order to split this dataset into train and test set, we can use scikit-learn library's helper functions.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    iris["data"], iris["target"], test_size=0.20
)

In [9]:
X_train.shape, X_test.shape

((120, 4), (30, 4))

As you can see, we split 80% of the data as training set and rest of the data as the testing set.