# An introduction to visualisation with the Iris dataset
A setpiece of entry-level machine learning is the Iris dataset, a collection of metrics about a number of samples of three different types of iris plant, ideal for statistical analysis, decision trees and clustering. However, before we tackle more advanced techniques, we need to be able to present and visualise our data.

As python on its own doesn't have the functionality to easily plot data, we need to augment the abilities of the language with two libraries. First, we're going to import the libraries we need - `matplotlib` for plotting, and `sklearn` for our data.

In [4]:
import matplotlib.pyplot as plt
from sklearn import datasets

%matplotlib notebook

### Preparing our data
With our libraries loaded in, we can load in our data. usually, we'd need to read in a flatfile, but for today we can skip this section, as the iris dataset is so commonly used that the sklearn library can automatically load in our data.

Once we have it loaded, let's take a look at our dataset.

In [3]:
iris_dataset = datasets.load_iris()

iris_dataset

 'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

It seems that our data is in a dict - a structured nested key-value pair data object. We don't need the descriptions in the first part, only the data section and the identifying labels, so we'll call and slice the data we need out into two objects - `data` and `labels`

In [5]:
data = iris_dataset.data[:, :2]
labels = iris_dataset.target

# Visualisation

We want to create a simple scatterplot to explore the distribution of the data.

Because our graph is simpler, we can use the plt option, rather than neededing to split down to fig and ax. We will call the `plt.scatter()` option to automatically configure our plot as a scatterplot.

Then all we need to do is plug in our inputs and parameters - the first two arguemnets are alway for X and Y data (which we slice from the data object). We can then use the `c=` parameter to specify the colours per point from our labels. We are going to use a preset colour shecme `plt.cm.Set1` and a black (`k`) rim to each dot.

Finally, we can set our title and labels, and the object will automatically plot in the notebook environment!

In [7]:
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap=plt.cm.Set1,
            edgecolor='k')
plt.title('Iris dataset')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

<IPython.core.display.Javascript object>

Text(0,0.5,'Sepal width')