# Getting started in scikit-learn with the famous iris dataset
https://www.youtube.com/watch?v=hd1W4CyPX58

## Agenda
* What is the iris dataset
* How do we load the data set
* How do we describe a dataset using ML terminology?
* What are scikit-learn's four key requirements for working with data?

In [1]:
from IPython.display import HTML
HTML('<iframe src=http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data width=300 height=200></iframe>')



## Machine learning on the iris dataset
* Framed as **supervised learning** problem: Predict the species of an iris using the measurements
* Famous dataset for machine learning because prediction is **easy**

## Loading the iris dataset into scikit-learn

In [2]:
from sklearn.datasets import load_iris

A special container called a 'Bunch':

In [3]:
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [5]:
print(iris.data)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

## Machine learning terminology
* Each row is an **observation** (aka: sample, example, instance, record)
* Each column is a **feature** (aka: predictor, attribute, independent variable, input, regressor, covariate)

In [6]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [7]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [8]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


* Each value we are predicting is the **response** (aka: target, outcome, label, dependent variable)
* **Classification** is supervised learning in which the response is **categorical**
* **Regression** is supervises learning in which the response is **ordered and continuous**

We'll use classification techniques for this

## Requirements for working with data in scikit-learn
1) Features and response are **separate** objects  
2) Features and response should be **numeric**  
3) Features and response should be **NumPy arrays**  
4) Features and response should have **specific shapes**

In [15]:
print("Features:",type(iris.data))
print("Response:",type(iris.target))

Features: <class 'numpy.ndarray'>
Response: <class 'numpy.ndarray'>


#### Note: in scikit-learn, the response object should always be numeric, regardless of regression or classification.

The feature object should have 2 dimensions, in which the 1st dimension (represented by rows) is the number of observations and the 2nd dimension (represented by columns) is the number of features. 

In [13]:
# 1st dimension = number of obs; 2nd dimension = number of features
print(iris.data.shape)

(150, 4)


The response object is expected to have a single dimension, which should have the same magnitude of the 1st dimension of the feature object.

In [14]:
# single dimension matching number of obs
print(iris.target.shape)

(150,)


Yay, the iris data set meets all the requirements.

In [16]:
X = iris.data
y = iris.target

Note: the X is capitalized because it represents a Matrix and the y is lower because it represents a vector.