## Lab 10a
## Contents
* Part 1: Machine learning using scikit-learn
* Part 2: Getting started in scikit-learn with the iris dataset
* Part 3: Training a machine learning model with scikit-learn
* Part 4: Comparing machine learning models in scikit-learn

## Part 1: Scikit-learn Overview
* What are the benefits and drawbacks of scikit-learn?
* How is scikit-learn organized?
* What methods should be used for a particular problem?
* Read more about [how scikit-learn is organized](http://scikit-learn.org/stable/index.html)


**Benefits:**
* Consistent interface to machine learning models
* Provides many tuning parameters but with sensible defaults
* Exceptional documentation
* Rich set of functionality for companion tasks
* Active community for development and support

**Potential drawbacks:**
* Harder (than R) to get started with machine learning
* Less emphasis (than R) on model interpretability

## Part 2: Getting started with the Iris Dataset
* 50 samples of 3 different species of iris (150 samples total)
* Measurements: sepal length, sepal width, petal length, petal width
* Famous dataset for machine learning because prediction is easy
* Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

### Import required modules and load data file

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# import iris dataset from sklearn
# Can also use pd.read_table to load another dataset of your choice
iris = load_iris()

In [None]:
# create X (features) and y (response)
X = iris.data  # only take the first two features
y = iris.target

# create Pandas DataFrame from a Numpy array 
pd_X = pd.DataFrame(X)
pd_y = pd.DataFrame(y)

# you can view the Pandas DataFrame of pd_X and pd_y with head()

# create a lookup table for later testing
lookup_iris_name = list(zip(X, y))
lookup_iris_name


This iris dataset consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. 

[Scikit Documentation] (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

### Plot the Iris Dataset

In [None]:
# Choose two features from X and plot in a scatter plot 

### Split dataset into testing and training data

In [None]:
# Use train_test_split for X_train, X_test, y_train, y_test
# YOUR CODE HERE

## Part 3: Training a machine learning model with scikit-learn
### KNeighbors from sklearn in 4-step

In [None]:
# Step 1: Import the estimator class you plan to use
from sklearn.neighbors import KNeighborsClassifier

# Step 2: Instantiate the estimator KNeighborsClassifier
# Optional: specify tuning parameters (aka "hyperparameters") during this step
# YOUR CODE HERE

# Step 3: Train the classifier (fit the estimator) with the training data
# YOUR CODE HERE

# Step 4: Estimate the accuracy of the classifier on future data, using the test data
# YOUR CODE HERE

### Use the trained k-NN classifier model to classify new, previously unseen objects

In [None]:
# predict with an arbitrary set of weights assigned to the four attributes in Iris  
# feel free to try changing the weights 
iris_prediction = knn.predict([[20, 4.3, 1, 2], [1, 4.3, 1, 2]])
iris_prediction

### How sensitive is k-NN classification accuracy to the choice of the 'k' parameter?

In [None]:
k_range = range(1,20)
# plot the accuracy score for each 'k' neighbors from 1 to 20 
# YOUR CODE HERE


### Checkpoint: How sensitive is k-NN classification accuracy to the train/test split proportion?

In [None]:
tt_split_proportions = [0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]

knn = KNeighborsClassifier(n_neighbors = 5)
# plot the accuracy score for each split_proportion from 0.2 to 0.8
# YOUR CODE HERE

## Part 4: Comparing machine learning models in scikit-learn
We have the first two steps given to you when applying this pattern to another machine learning model -- Logistic Regression (aka logit, MaxEnt) classifier.

Please fill in the next two steps.

Please check the [reference for Logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) and [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


In [None]:
# import the class
from sklearn.linear_model import LogisticRegression
# Note: despite it's name LogisticRegression is used for classification

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
# YOUR CODE HERE

# predict the response for new observations
# YOUR CODE HERE

Now you might have the following questions
* How do I choose which model to use for my supervised learning task?
* How do I choose the best tuning parameters for that model?
* How do I estimate the likely performance of my model on out-of-sample data?
Let's review what we've learned so far in this lab
* Classification task: Predicting the species of an unknown iris
* Used three classification models: KNN (K=1), KNN (K=5), logistic regression
* Need a way to choose between the models

So our next topic is: Model evaluation procedures. It's a [big topic](http://scikit-learn.org/stable/model_selection.html#model-selection), but we'll start at the beginning.¶


### 4.1 Evaluation procedure #1: Train and test on the entire dataset
1. Train the model on the entire dataset.

2. Test the model on the same dataset, and evaluate how well we did by comparing the predicted response values with the true response values.

In [None]:
assert (X == iris.data).all()  # if these statement fail, rerun your notebook
assert (y == iris.target).all()

In [None]:
# Logistic regression¶

# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response values for the observations in X
logreg.predict(X)

In [None]:
# store the predicted response values
y_pred = logreg.predict(X)

# check how many predictions were generated
len(y_pred)

Classification accuracy:

* Proportion of correct predictions
* Common evaluation metric for classification problems

In [None]:
# Compute the accuracy of your logreg classifier on the training data
# YOUR CODE HERE

This quantity is known as training accuracy.

Computing accuracy and other machine learning metrics is such a common task sklean has a whole module sklearn.metrics dedicated to this task. Below is an example of using it to compute the accuracy of y_pred versus y.

In [None]:
# compute classification accuracy for the logistic regression model
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))

Below we compute the accuracy of both the K=5 and K=1 KNN-Classifiers discussed earlier.



In [None]:
# KNN (K=5)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))

In [None]:
# KNN (K=1)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))

In [None]:
# Explain why does K=1 has an accuracy of 1.0?
# YOUR CODE HERE

Congrats, you're done!