<img align="right" width="120" height="120" style='padding: 0px 30px;' src="images/RR-logo.png"/>

# Introduction to Machine Learning with scikit-learn





*Click into a cell and hit "Shift+Enter" to execute it!*

In [None]:
""" Let's apply some Dartmouth-style colors to our plots """

import matplotlib as mpl

mpl.rcParams.update({
                        'figure.facecolor': "#EBF3EF",
                        'figure.figsize': [7.50, 3.50],
                        'axes.prop_cycle': mpl.cycler(color=["#00693E", "#12312B", "#C3DD88", "#6EAA8D", "#797979", "#EBF3EF"]),
                        'axes.facecolor': "#FFFFFF",
                        'axes.labelcolor': '#12312B',
                        'text.color': '#12312B'
                    })

## Loading a dataset

In this notebook, we will work with the famous *Iris* flower dataset. It is a multivariate dataset consisting of 50 samples from each of three species of *Iris*. Each sample is described in terms of four features: the length and width of its sepals and its petals.

<style type="text/css" >
table {
    border-style: hidden;
    border-collapse: collapse;
    text-align: center;
    border-top: 3px solid;
    border-bottom: 3px solid;
}

tr, td, th {
    border-bottom: none !important;
    border-left: none !important;
    border-right: none !important;
}

</style>

<table>
  <tr>
    <th>Iris setosa</th>
    <th>Iris versicolor</th>
    <th>Iris virginica</th>    
  </tr>
  <tr>
    <td><img align="center" width="200" height="200" src="images/iris_setosa.jpg"/></td>
    <td><img align="center" width="200" height="200" src="images/iris_versicolor.jpg"/></td>
    <td><img align="center" width="200" height="200" src="images/iris_virginica.jpg"/></td>
  </tr>
  <tr>
    <td><a href="https://commons.wikimedia.org/wiki/File:Iris_setosa_2.jpg" target="_blank">"Iris setosa"</a> by <a href="https://commons.wikimedia.org/wiki/User:Kulmalukko" target="_blank">Tiia Monto</a><br> is licensed under <a href="http://creativecommons.org/licenses/by-sa/4.0" target="_blank">CC BY-SA 4.0</a></td>
    <td><a href="https://commons.wikimedia.org/wiki/File:Blue_Flag_Iris_(15246206044).jpg" target="_blank">"Blue Flag Iris"</a> by <a href="https://www.flickr.com/people/49208525@N08" target="_blank">USFWSmidwest</a><br> is licensed under <a href="http://creativecommons.org/licenses/by/2.0" target="_blank">CC BY 2.0</a></td>
    <td><a href="https://commons.wikimedia.org/wiki/File:Iris_virginica_L_JdP_2013-05-28_n01.jpg" target="_blank">"Virginia Iris"</a> by <a>Marie-Lan Nguyen</a><br> is licensed under <a href="http://creativecommons.org/licenses/by/2.5" target="_blank">CC BY 2.5</a></td>    
  </tr>
  <tr>
    <td>Class 0</td>
    <td>Class 1</td>
    <td>Class 2</td>    
  </tr>
</table>

In [None]:
from sklearn.datasets import load_iris

dataset = load_iris()

X = dataset['data']
y = dataset['target']
feature_names = dataset['feature_names']
target_names = dataset['target_names']

target_names

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.bar(x=target_names, height=np.bincount(y))
plt.title('Class distribution')

In [None]:
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

def getImage(path):
   return OffsetImage(plt.imread(path, format="png"), zoom=.05)

paths = ['iris_versicolor.png', 'iris_setosa.png', 'iris_virginica.png']

fig = plt.figure(figsize=(5, 2.5), dpi=300)
ax = fig.add_subplot()
ax.scatter(X[:, 1], X[:, 2], alpha=0.0)
for x0, y0, target in zip(X[:, 1], X[:, 2], y):
   img_path = 'images/iris_' + target_names[target] + '.png'
   ab = AnnotationBbox(getImage(img_path), (x0, y0), frameon=False)
   ax.add_artist(ab)
plt.xlabel(feature_names[1])
plt.ylabel(feature_names[2])
plt.grid(True)
plt.xlim([1.5, 4.5])


## Preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

## Dimensionality reduction

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_scaled)

X_pc = pca.transform(X_scaled)

## Training an estimator

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_pc, y)

## Testing an estimator

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=8)

X_train_scaled = scaler.fit_transform(X_train)
X_train_pc = pca.fit_transform(X_train_scaled)

knn.fit(X_train_pc, y_train)

X_test_scaled = scaler.transform(X_test)
X_test_pc = pca.transform(X_test_scaled)

y_pred = knn.predict(X_test_pc)

print(classification_report(y_test, y_pred))


## Hyperparameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [10, 20, 30, 40, 50, 60, 70, 80, 90],
    'p': [1, 2, 3]    
}

search = GridSearchCV(KNeighborsClassifier(), param_grid, verbose=1, n_jobs=-1)
search.fit(X_train_pc, y_train)
search.best_params_

In [None]:
print(classification_report(y_test, search.predict(X_test_pc)))

## Putting it all together

In [None]:
from sklearn.pipeline import Pipeline

steps = [
    ('scaling', StandardScaler()),
    ('pca', PCA()),
    ('knn', KNeighborsClassifier())
]

pipeline = Pipeline(steps)


param_grid = {
    'scaling__with_mean': [True, False],
    'scaling__with_std': [True, False],
    'pca__n_components': [1, 2, 3, 4],
    'knn__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'knn__weights': ['uniform', 'distance'],
    'knn__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'knn__leaf_size': [10, 20, 30, 40, 50, 60, 70, 80, 90],
    'knn__p': [1, 2, 3]    
}

find_best = GridSearchCV(pipeline, param_grid, verbose=1, n_jobs=-1)

find_best.fit(X_train, y_train)
find_best.best_params_

In [None]:
import pandas as pd
results = pd.DataFrame(find_best.cv_results_)
results.sort_values('rank_test_score')

In [None]:
print(classification_report(y_test, find_best.predict(X_test)))