# Iris Dataset

We take the  **[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)** as example. This had already been introduced by famous statistician
Ronald Fisher in 1936 and is used since then as instructive use case for classification. 
The data consists of
* measurements of length and width of both sepal (Bl&uuml;tenkelch) and petal (Bl&uuml;te) 
* classification of Iris sub-species



In [None]:
# the usual setup: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# seaboorn provides easy way to import iris dataset as pandas dataframe
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

## Data visualization
First step should always be some investigation of data properties, i.e.
* basic statistical properties
* visualization of distributions


In [None]:
# basic statistics with pandas
iris.describe()

In [None]:
# distribution of single feature
sns.histplot(data=iris,x='sepal_length',hue='species')

In [None]:
# combined plot of 2 features
sns.jointplot(data=iris,x='sepal_length',y='sepal_width', hue='species')

In [None]:
# combined plot matrix of all features in dataframe
#
# will provide scatter plot of all combinations of numerical columns in dataframe
# target (=species) can be given and will cause different colors
sns.pairplot(iris, hue='species', diag_kind='hist', height=2.0)

***

## Dimensionality Reduction 
The Iris data is also a good show case  **dimensionality reduction**, i.e. check if there is a lower dimensional representation which retains the essential features.
* In case of Iris data there are four feature dimensions
* scatter plot showed clear correlations between features
  * indication that less dimensions might be sufficient
  
One standard method is principal component analysis (PCA), which can be applied in case of (reasonably) linear correlations.

As before we have to do the usual scikit steps:
* Setup PCA model 
* fit/train
* get reduced dimensions as output of transform

In [None]:
# read in again iris dataset and store feature matrix
import seaborn as sns
iris = sns.load_dataset('iris')
# feature matrix
X=iris.loc[:,'sepal_length':'petal_width']
X.shape

**Optional scaling**

Next codebox does scaling of features to common mean and spread, this can be important for PCA.


In [None]:
Xn = X
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(X)
# uncomment next line to get effect
#Xn = scaler.transform(X)

**Setup and fit model**

In [None]:
from sklearn.decomposition import PCA  # 1. Choose the model class
#model = PCA(n_components=2)            # 2. Instantiate the model, fix to 2 parameters
model = PCA()                         # 2. Instantiate -- don't restrict # params
#model = PCA(n_components=0.9)         # 2. Instantiate -- n-parameters until 90% of variance isreached 
model.fit(Xn)                           # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(Xn)              # 4. Transform the data to two dimensions
X_2D.shape

**Visualize transformed data:**

In [None]:
iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot(x="PCA1", y="PCA2", hue='species', data=iris, fit_reg=False);

We can display the coefficients of the PCA transformation using the `model.components_` property:

In [None]:
model.components_

In [None]:
plt.matshow(model.components_)
plt.colorbar()
plt.xticks(range(len(X.columns)), X.columns)
plt.xlabel("Feature")
plt.ylabel("Principal components");

Or we plot the correlation like we did before:

In [None]:
sns.lmplot(x="PCA1", y="petal_width", hue='species', data=iris, fit_reg=False);

**Model provides information on covered variance per par**

In [None]:
plt.bar(np.arange(len(model.explained_variance_ratio_))+1,model.explained_variance_ratio_)
plt.title('PCA explained variance')

In [None]:
model.explained_variance_ratio_*100,np.cumsum(model.explained_variance_ratio_*100)

### Exercises
* Redo PCA
  * no constraint on n_components --> 4 pars
  * n_components = 0.95 --> model uses as many componentes as needed to obtain 95% variance coverage
  * repeat with scaled X features
  *
  
* Do some basic classification (eg kNN, logistic regression) using the 2 PCA Iris components and compare with the original kNN using all 4 Iris features

## Clustering

Of course we can also try our clustering methods on the Iris dataset.


### k-Means model
A very simple model is ...

### Gaussian mixture model
One powerful method is Gaussian mixture model (GMM) *(Details see Data Science Handbook: 05.12-Gaussian-Mixtures.ipynb)*  
A GMM attempts to model the data as a collection of Gaussian blobs.

We can fit the Gaussian mixture model as follows:

In [None]:
from sklearn.mixture import GaussianMixture       # 1. Choose the model class
#
model =  GaussianMixture(n_components=3,
                         covariance_type='full')  # 2. Instantiate the model with hyperparameters

model.fit(X)                                      # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X)                          # 4. Determine cluster labels
#model.fit(X_scaled)                                      # 3. Fit to data. Notice y is not specified!
#y_gmm = model.predict(X_scaled)                          # 4. Determine cluster labels

In [None]:
iris['cluster'] = y_gmm
sns.lmplot(x="PCA1", y="PCA2", data=iris, hue='species',
           col='cluster', fit_reg=False);

##### Plot PCA data for each identified cluster  
Indicates good clustering, basically identical to species.


***
# PCA applied to digit data

Another classic example case for ML is handwritten digits data.

A suitable dataset is included with sklearn, first we look into it:


In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape

In [None]:
type(digits)

In [None]:
digits?

In [None]:
print(digits.DESCR)

The data is sklearn specific container, basically a list of 8x8 pixels images

We plot a sub-set:

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(10, 10, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')

Plot shows pixel image together with label (in green).

* Some images are obvious
* Others seem difficult 

In [None]:
# Look at data from 1st image --> 2D table resembles 0
print (digits.images[1])

In [None]:
digits.images[0].shape

## Image data with sklearn:
To use the data with sklearn as before we need a 2D structure: `samples x features` , i.e. the 8x8 images should be transformed into flat 1x64 array.   

Already provided in Dataset, element `data` :

In [None]:
print (digits.data[0])

In [None]:
# to use as before just re-label
X = digits.data
y = digits.target

### PCA

In [None]:
# first try PCA
from sklearn.decomposition import PCA  # 1. Choose the model class
#model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters -- 2 pars
model = PCA(n_components=0.9)         # 2. Instantiate the model with hyperparameters -- # pars up to 90% coverage
model.fit(X)                           # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X)              # 4. Transform the data to two dimensions



**Check covered variance**

In [None]:
model.explained_variance_ratio_*100

In [None]:
np.cumsum(model.explained_variance_ratio_*100)

In [None]:
#plt.bar(np.arange(len(model.explained_variance_ratio_))+1,model.explained_variance_ratio_)
plt.bar(np.arange(len(model.explained_variance_ratio_))+1,np.cumsum(model.explained_variance_ratio_))
plt.title('PCA explained variance')

---

**now reduce 64 to 2 dimensions and visualize it**

In [None]:
# 
from sklearn.decomposition import PCA  # 1. Choose the model class
model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters -- 2 pars
model.fit(X)                           # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X)              # 4. Transform the data to two dimensions

In [None]:
xout=pd.DataFrame()
xout['tag']=y
xout['PCA1'] = X_2D[:, 0]
xout['PCA2'] = X_2D[:, 1]
sns.lmplot(x="PCA1", y="PCA2", hue='tag', data=xout, fit_reg=False, markers='.');


Some digits are nicely isolated, others less so

Think about it, which digits tend to look similar?

We can also have a look at the *principle components* that the PCA has extracted from the digits dataset:

In [None]:
model.components_.shape

In [None]:
# plot PCA components of digits
fig, ax = plt.subplots(1, 2, subplot_kw={'xticks': (), 'yticks': ()})
for idx, comp in enumerate(model.components_[:2]):
    img = comp.reshape(8,8)
    ax.ravel()[idx].imshow(img, cmap="coolwarm")

The left image shows the most important, the right image the second-most important component extracted by the PCA.
Compare this to the previous plot to see that this actually makes sense: 
* If you focus on the blue ("negative") pixels in the left image, those resemble the digit "3". Indeed, from the previous plot we see that the figures 3 cluster at low values of PCA1 (and around 0 for PCA2, i.e. they typically have not much of the second component in the right image). 
* The red in the left image could be part of a "4" which indeed has high values for PCA1.
* Similarly, the red in the right image is somewhat the outline of a "0" which has large positive values for PCA2 (and small absolute values for PCA1).

Can you find the digit "1"?