Summary
===============

### Topics covered:
- Obtain unbiased estimates of a model's performance
- Diagnose the common problems of machine learning algorithms 
- Fine-tune machine learning models
- Evaluate predictive models using different performance metrics
_______________________________________________

### Streaming workflows with pipelines**
- **The Pipeline Class** in scikit-learn allows us to fit a model including an arbitrary number of transformation steps and appy it to make predictions about new data.
_________________________________________________

### Loading the Breast Cancer Wisconsin dataset
- Contains 569 samples of malignant and benign tumor cells
- First two columns store the unique ID numbers of the samples and the corresponding diagnosis(M=malignant, B=benign), respectively.
- The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.


In [3]:
# Reading in the dataset directly from the UCI website using pandas:
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', 
                 header=None)

In [4]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
# Assign the 30 features to a Numpy array x. Using labelEncoder,
# transform the class labels from their original string representation
# (M and B) into integers

from sklearn.preprocessing import LabelEncoder
X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)

In [6]:
X[0]

array([  1.79900000e+01,   1.03800000e+01,   1.22800000e+02,
         1.00100000e+03,   1.18400000e-01,   2.77600000e-01,
         3.00100000e-01,   1.47100000e-01,   2.41900000e-01,
         7.87100000e-02,   1.09500000e+00,   9.05300000e-01,
         8.58900000e+00,   1.53400000e+02,   6.39900000e-03,
         4.90400000e-02,   5.37300000e-02,   1.58700000e-02,
         3.00300000e-02,   6.19300000e-03,   2.53800000e+01,
         1.73300000e+01,   1.84600000e+02,   2.01900000e+03,
         1.62200000e-01,   6.65600000e-01,   7.11900000e-01,
         2.65400000e-01,   4.60100000e-01,   1.18900000e-01])

In [7]:
y[0]

1

In [8]:
# After encoding the class labels (diagnosis) in an array y, the malignant
# tumors are now represented as class 1, and the benign tumors are 
# represented as class 0, respectively
le.transform(['M', 'B'])

array([1, 0], dtype=int64)

In [9]:
# Divide the dataset into a separate training dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,
                                                   random_state=1)

### Combining transformers and estimators in a pipeline
- Many learning algorithms require input features on the same scale for optimal performance
- We need to standardize the columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression
- Assuming that we want to compress our data from the initial 30 dimensions onto a lower 2-dimensional subspace via **Principal Component Analysis(PCA)**. Instead of going through the fitting and transformation steps for the training and test dataset separately, we can chain the **StandardScaler, PCA** and **LogisticRegression** objects in a pipeline


In [10]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [11]:
pipe_lr = Pipeline([('scl', StandardScaler()),
                   ('pca', PCA(n_components=2)),
                   ('clf', LogisticRegression(random_state=1))])
pipe_lr.fit(X_train, y_train)
print'Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test)

Test Accuracy: 0.947


- The Pipeline object takes a list of tuples as input
 - first value in each value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline 
 - second element in every tuple is a scikit-learn transformer or estimator
 
- The intermediate steps in a pipeline constitute scikit-learn transformers
- The last step is an estimator

#### The concept of how pipelines work is summarized in the following figure
![](assets/pipeline.JPG)

### Using k-fold cross-validation to assess model performance
One of the key steps in building a machine learning model is to estimate its performance on data that the model hasn't seen before. To achieve this, we use two techinques:
- **Holdout** cross-validation
- **K-fold** cross-validation

The Holdout method
-------------------

- Split our initial dataset into a separate training and test dataset.
- The former is used for model training, and the latter is used to estimate its  performance.
- However, in a typical ML applications, we are also interested in tuning and comparing different parameter settings to further improve the performance for making predictions on unseen data. This process is calle **model selection**
![](assets/holdout.JPG)

**The disadvantage**
The performance estimate is sensitive to how we partition the training set into the training and validation subsets; the estimate will vary for different samples of data.

K-fold cross-validation
-----------------------
In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k-1 folds are used for the model training and one fold is used for testing. This procedure is repeated k times so that we obtain k models and performance estimates.

Since k-fold cross-validation is a resampling technique without replacement, the advantage of this approach is that each sample point will be part of a training and test dataset exactly once, which yields a lower variance estimate of the model performance than the holdout method.

### StratifiedKFold iterator in scikit-learn

In [12]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

In [28]:
kfold = StratifiedKFold(n_splits=10, random_state=1)
kfold.get_n_splits(X, y)
scores = []
print kfold


StratifiedKFold(n_splits=10, random_state=1, shuffle=False)


In [31]:
for train, test in enumerate (kfold.split(X,y)):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    #print 'Fold: %s, Class dist: %s, Acc: %.3f' %(k+1,
                                            #np.bincount(y_train[train]),
                                            #score)



ValueError: bad input shape ()