# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Importing our libraries](#Importing-our-libraries)
	* [Some Simple Data](#Some-Simple-Data)
	* [A Simple kNN Classifier](#A-Simple-kNN-Classifier)
	* [Simple Evaluation](#Simple-Evaluation)
	* [Visualization using two features](#Visualization-using-two-features)
	* [Exercise (exploring grid_step and number of neighbors)](#Exercise-%28exploring-grid_step-and-number-of-neighbors%29)
* [Simple Comparison](#Simple-Comparison)
* [Synthetic Datasets](#Synthetic-Datasets)
	* [make_blobs](#make_blobs)
	* [make_classification](#make_classification)
* [Downloading Common Datasets](#Downloading-Common-Datasets)
	* [Exercise](#Exercise)


# Learning Objectives:

After completion of this module, learners should be able to:

* Explain what KNN classification and logistic regression are
* Apply the KNN classifier
* Develop training/testing sets and perform model validation.


* Work with primary component analysis and support vector machines.
* Compare optimization and curve fitting techniques.

K-Nearest neighbor algorithms fall into regression and classification.  In classification, a K-nearest neighbor method uses local vote counts for class membership based on K nearest neighbors considered.  A K==1 model considers only the nearest neighbor.

Logistic regression is fitting a logistic distribution to continuous data to model a binomial or multinomial response.  An example, described [here](https://en.wikipedia.org/wiki/Logistic_regression), is a logistic regression that predicts probability of success/failure on an exam given observations of passing/failing and the hours studied in advance.
 

# Importing our libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import (cross_validation, datasets,
                     decomposition,
                     grid_search, linear_model, 
                     neighbors, metrics)
%matplotlib inline

## Some Simple Data

In [None]:
iris = datasets.load_iris()
examples = iris.data
classes  = iris.target
print(iris.DESCR)

In [None]:
# Let's take a look at the "shape" of the data
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['species'] = iris.target
df_iris['species_name'] = [iris.target_names[i] for i in iris.target]
df_iris

In [None]:
# Create a training and a testing set from this data by choosing indices
# (wait a few cells for a better API)

# Random order of indices
n_examples = len(examples)
shuffled_indices = np.random.permutation(n_examples)

# Pick a trainig/testing split
train_pct = 0.8
train_ct  = int(n_examples * train_pct)

# Select indices for training and testing
train_idx, test_idx = shuffled_indices[:train_ct], shuffled_indices[train_ct:]
train_idx, test_idx

## A Simple kNN Classifier

In [None]:
knn5 = neighbors.KNeighborsClassifier(n_neighbors=5)

## Simple Evaluation

In [None]:
knn5.fit(examples[train_idx], classes[train_idx])
predictions = knn5.predict(examples[test_idx])
print(metrics.accuracy_score(predictions, classes[test_idx]))

## Visualization using two features

In [None]:
datasets.make_classification?

In [None]:
# the punch line is to predict for a large grid of data points
# http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
def KNN_2D_map(twodim):
    grid_step = 0.1
    knn5 = neighbors.KNeighborsClassifier(n_neighbors=5)
    knn5.fit(twodim, classes)

    # create testing data points on the standard 
    # Cartesian grid (over our data range)
    # to color the background
    maxes = np.max(twodim, axis=0) + 2*grid_step
    mins  = np.min(twodim, axis=0) -   grid_step

    xs,ys = np.mgrid[mins[0]:maxes[0]:grid_step, 
                     mins[1]:maxes[1]:grid_step]
    grid_points = np.c_[xs.ravel(), ys.ravel()]

    p = knn5.predict(grid_points)

    # plot the predictions at the grid points
    fig = plt.figure(figsize=(10,5))
    ax = fig.gca()
    ax.pcolormesh(xs,ys,p.reshape(xs.shape))

    ax.set_xlim(mins[0], maxes[0]-grid_step)
    ax.set_ylim(mins[1], maxes[1]-grid_step)
    
twodim = examples[:,:2] # select first two features
KNN_2D_map(twodim)

In [None]:
twodim2 = examples[:,2:]  # choose different features
KNN_2D_map(twodim2)

## Exercise (exploring grid_step and number of neighbors)

Quick question: why did we add an extra `grid_step` value to the maxes, above?

Investigate what happens to the decision boundary as we raise or lower the number of neighbors?  You could start answering this trying a range of neighbor values:  $k=3,5,10,15$.  Could the `grid_step` parameter mislead us, if we aren't paying close attention?

# Simple Comparison

We'll learn about a more efficient comparison method in the next section

In [None]:
knn5 = neighbors.KNeighborsClassifier(n_neighbors=5)
logreg = linear_model.LogisticRegression()

In [None]:
knn5.fit(examples[train_idx], classes[train_idx])
logreg.fit(examples[train_idx], classes[train_idx])

lr_preds = logreg.predict(examples[test_idx])
knn5_preds = knn5.predict(examples[test_idx])

for preds in [lr_preds, knn5_preds]:
    print(metrics.accuracy_score(preds, classes[test_idx]))

# Synthetic Datasets

## make_blobs

`sklearn.datasets.make_blobs(n_samples=100, 
                             n_features=2,
                             centers=3,         # number of classes
                             cluster_std=1.0)   # shared -or- class-by-class`

In [None]:
x, y = datasets.make_blobs(n_samples=50)
plt.scatter(x[:,0], x[:,1], c=y, s=50)

## make_classification

`sklearn.datasets.make_classification()`

Many, many arguments.  See: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html

For examples, see:  http://scikit-learn.org/stable/auto_examples/datasets/plot_random_dataset.html

In [None]:
x,y = datasets.make_classification(n_features=2, n_redundant=0, n_informative=2,
                                   n_clusters_per_class=1, n_classes=3)
plt.scatter(x[:, 0], x[:, 1], c=y, s=50);

# Downloading Common Datasets

In [None]:
iris_dwn_1 = datasets.fetch_mldata('iris', data_home="./data")
print(iris_dwn_1.data.shape)
print(iris_dwn_1.target.shape)

In [None]:
iris_dwn_2 = datasets.fetch_mldata('datasets-UCI Iris',
                                   target_name='class', 
                                   data_name='double0',
                                   data_home="./data")
print(iris_dwn_2.data.shape)
print(iris_dwn_2.target.shape)