# Doing Machine Learning in Scikit Learn

Now we'll take a look at how to conduct the whole machine learning pipeline in python with scikit-learn and pandas.

We'll use Homework 3 from 15.680 as the guideline for this section.

## Classification: MAGIC Gamma Telescope
https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope

The data represents registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruc- tion of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). The aim is build a model that can distinguish between the signal and background cases.

Attribute Information:
1. fLength: continuous; major axis of ellipse [mm]
2. fWidth: continuous; minor axis of ellipse [mm]
3. fSize: continuous; 10-log of sum of content of all pixels [in ;phot]
4. fConc: continuous; ratio of sum of two highest pixels over fSize [ratio]
5. fConc1: continuous; ratio of highest pixel over fSize [ratio]
6. fAsym: continuous; distance from highest pixel to center, projected onto major axis [mm] 
7. fM3Long: continuous; 3rd root of third moment along major axis [mm]
8. fM3Trans: continuous; 3rd root of third moment along minor axis [mm]
9. fAlpha: continuous; angle of major axis with vector to origin [deg]
10. fDist: continuous; distance from origin to center of ellipse [mm] 
11. class: g,h; gamma (signal), hadron (background)


First, let's load in the data:

In [4]:
import pandas as pd
df = pd.read_csv("magic04.csv", header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.0110,-8.2027,40.0920,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610,g
2,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620,g
5,51.6240,21.1502,2.9085,0.2420,0.1340,50.8761,43.1887,9.8145,3.6130,238.0980,g
6,48.2468,17.3565,3.0332,0.2529,0.1515,8.5730,38.0957,10.5868,4.7920,219.0870,g
7,26.7897,13.7595,2.5521,0.4236,0.2174,29.6339,20.4560,-2.9292,0.8120,237.1340,g
8,96.2327,46.5165,4.1540,0.0779,0.0390,110.3550,85.0486,43.1844,4.8540,248.2260,g
9,46.7619,15.1993,2.5786,0.3377,0.1913,24.7548,43.8771,-6.6812,7.8750,102.2510,g


Extract the X and y

In [16]:
X = df.iloc[:, :-1]
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.0110,-8.2027,40.0920,81.8828
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610
2,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620
5,51.6240,21.1502,2.9085,0.2420,0.1340,50.8761,43.1887,9.8145,3.6130,238.0980
6,48.2468,17.3565,3.0332,0.2529,0.1515,8.5730,38.0957,10.5868,4.7920,219.0870
7,26.7897,13.7595,2.5521,0.4236,0.2174,29.6339,20.4560,-2.9292,0.8120,237.1340
8,96.2327,46.5165,4.1540,0.0779,0.0390,110.3550,85.0486,43.1844,4.8540,248.2260
9,46.7619,15.1993,2.5786,0.3377,0.1913,24.7548,43.8771,-6.6812,7.8750,102.2510


In [23]:
y_orig = df.iloc[:, -1]
y_orig

0        g
1        g
2        g
3        g
4        g
5        g
6        g
7        g
8        g
9        g
10       g
11       g
12       g
13       g
14       g
15       g
16       g
17       g
18       g
19       g
20       g
21       g
22       g
23       g
24       g
25       g
26       g
27       g
28       g
29       g
        ..
18990    h
18991    h
18992    h
18993    h
18994    h
18995    h
18996    h
18997    h
18998    h
18999    h
19000    h
19001    h
19002    h
19003    h
19004    h
19005    h
19006    h
19007    h
19008    h
19009    h
19010    h
19011    h
19012    h
19013    h
19014    h
19015    h
19016    h
19017    h
19018    h
19019    h
Name: 10, Length: 19020, dtype: object

The y is labelled g and h. Let's transform that to 0/1 labels. Luckily sklearn has an easy function for this:

In [24]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y_orig)
le.transform(y_orig)

array([0, 0, 0, ..., 1, 1, 1])

In [25]:
y = le.transform(y_orig)
y

array([0, 0, 0, ..., 1, 1, 1])

Next, we need to split the data into training, validation and test. Again, sklearn has a function for this, it's basically the same as `splitobs` from MLDataUtils.jl:

In [37]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25)

In [38]:
train_X.shape

(14265, 10)

In [39]:
test_X.shape

(4755, 10)

We also need to split the training to get the actual training set and the validation set

In [40]:
tr_X, vl_X, tr_y, vl_y = train_test_split(train_X, train_y, test_size=0.33)

Now we can start running our methods!

### Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression
m = LogisticRegression()
m.fit(tr_X, tr_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [47]:
from sklearn.metrics import accuracy_score, roc_auc_score
print accuracy_score(vl_y, m.predict(vl_X))
print roc_auc_score(vl_y, m.predict_proba(vl_X)[:, 1])

0.78950722175
0.837642002012


This logistic regression is regularized, so let's try it out with different regularization types and amounts

In [62]:
best_score = 0
for penalty in ['l1', 'l2']:
    for C in [1e-2, 1e0, 1e2, 1e4, 1e6, 1e8]:
        m = LogisticRegression(penalty=penalty, C=C)
        m.fit(tr_X, tr_y)
        score = roc_auc_score(vl_y, m.predict_proba(vl_X)[:, 1])
        print penalty, '\t', str(C).rjust(12), '\t', score
        if score > best_score:
            best_score = score
            best_penalty = penalty
            best_C = C
print best_penalty, best_C

l1 	        0.01 	0.821503547344
l1 	         1.0 	0.838295977831
l1 	       100.0 	0.838374067447
l1 	     10000.0 	0.838356472571
l1 	   1000000.0 	0.838383359123
l1 	 100000000.0 	0.838346785505
l2 	        0.01 	0.822480359473
l2 	         1.0 	0.837654852202
l2 	       100.0 	0.838627315036
l2 	     10000.0 	0.838640956007
l2 	   1000000.0 	0.838348169372
l2 	 100000000.0 	0.838642735264
l2 100000000.0


Now train the final model

In [64]:
m = LogisticRegression(penalty=penalty, C=C)
m.fit(train_X, train_y)
print accuracy_score(test_y, m.predict(test_X))
print roc_auc_score(test_y, m.predict_proba(test_X)[:, 1])

0.78738170347
0.829876376149


We can also try cross validation to select the parameters. There's a really nice interface for this in scikit-learn:

In [70]:
from sklearn.model_selection import GridSearchCV
params = {
    'penalty': ['l1', 'l2'],
    'C': [1e-2, 1e0, 1e2, 1e4, 1e6, 1e8]
}
model = GridSearchCV(LogisticRegression(), params, cv=5, scoring='roc_auc')
model.fit(train_X, train_y)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.01, 1.0, 100.0, 10000.0, 1000000.0, 100000000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='roc_auc', verbose=0)

In [71]:
model.best_params_

{'C': 10000.0, 'penalty': 'l1'}

And now we can get performance on the test set:

In [73]:
print accuracy_score(test_y, model.predict(test_X))
print roc_auc_score(test_y, model.predict_proba(test_X)[:, 1])

0.788433228181
0.829887201693


#### Tuning via regularization paths

You've probably experienced that it's annoying trying to tune the regularization penalties in our various regression models. We don't know where to start our values, how far apart to space them, etc.

There's another way of tuning these parameters that [some methods support](http://scikit-learn.org/stable/modules/grid_search.html#model-specific-cross-validation). Where possible it's usually easier to use these approaches.

The core idea is that it turns out to be possible to train the model for **all** values of the regularization penalty at once, for not much more cost than training the model once normally (or sometimes even less cost than training it once!). This is called finding the regularization path of the model, and from there we are able to identify the best parameter value without having to worry about specifying a range of possible values.

Let's see how it works!

In [77]:
from sklearn.linear_model import LogisticRegressionCV
m = LogisticRegressionCV(scoring="roc_auc")
m.fit(train_X, train_y)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring='roc_auc', solver='lbfgs', tol=0.0001,
           verbose=0)

We can then inspect the best parameter, the validation scores, as well as calculate the test performance

In [84]:
# The best value of C
m.C_

array([ 2.7825594])

In [85]:
# The different values tried for C
m.Cs_

array([  1.00000000e-04,   7.74263683e-04,   5.99484250e-03,
         4.64158883e-02,   3.59381366e-01,   2.78255940e+00,
         2.15443469e+01,   1.66810054e+02,   1.29154967e+03,
         1.00000000e+04])

In [86]:
# The scores for C in each fold
m.scores_

{1: array([[ 0.82440731,  0.82488045,  0.82749616,  0.83552776,  0.84123524,
          0.84123504,  0.84123427,  0.84123621,  0.84123465,  0.8412366 ],
        [ 0.83124579,  0.83184104,  0.83454616,  0.8425252 ,  0.84874083,
          0.84874103,  0.84874064,  0.84874083,  0.84874025,  0.84874064],
        [ 0.81853581,  0.81895535,  0.82150081,  0.82896987,  0.83440785,
          0.83529961,  0.83516785,  0.83516824,  0.83516747,  0.83516805]])}

In [79]:
print accuracy_score(test_y, m.predict(test_X))
print roc_auc_score(test_y, m.predict_proba(test_X)[:, 1])

0.786330178759
0.829772566911


## Feature normalization

We've seen that feature normalization can help, especially for linear regression-based methods. This is easy to do in sklearn with by using a `Pipeline` to compose steps in the model building process.

We'll use it to build a model that standardizes the data before fitting (zeroing the mean and setting unit variance). The main advantage of the pipeline approach is that we don't have to worry about applying the same transformation to our datasets, or de-transforming results.

In [90]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(StandardScaler(), LogisticRegressionCV(scoring="roc_auc"))
pipeline.fit(train_X, train_y)

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregressioncv', LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring='roc_auc', solver='lbfgs', tol=0.0001,
           verbose=0))])

In [91]:
print accuracy_score(test_y, pipeline.predict(test_X))
print roc_auc_score(test_y, pipeline.predict_proba(test_X)[:, 1])

0.788643533123
0.830109318667


## Other methods

We can use the same generic approaches to fit our other classification models

### CART

In [96]:
from sklearn.tree import DecisionTreeClassifier
params = {'min_samples_leaf': [10, 50, 100], 'max_depth': range(1, 11)}
model = GridSearchCV(DecisionTreeClassifier(), params)
model.fit(train_X, train_y)
print accuracy_score(test_y, model.predict(test_X))
print roc_auc_score(test_y, model.predict_proba(test_X)[:, 1])

0.840378548896
0.876541335215


### Random Forests

In [97]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=1000)
model.fit(train_X, train_y)
print accuracy_score(test_y, model.predict(test_X))
print roc_auc_score(test_y, model.predict_proba(test_X)[:, 1])

0.874658254469
0.926227781053


### Boosting

In [98]:
from sklearn.ensemble import GradientBoostingClassifier
params = {'learning_rate': [0.1, 0.2, 0.3], 'n_estimators': [50, 100, 500]}
model = GridSearchCV(GradientBoostingClassifier(), params)
model.fit(train_X, train_y)
print accuracy_score(test_y, model.predict(test_X))
print roc_auc_score(test_y, model.predict_proba(test_X)[:, 1])

0.871924290221
0.923541886158


# Regression: Parkinsons Telemonitoring
https://archive.ics.uci.edu/ml/datasets/parkinsons+telemonitoring

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease recruited to a six-month trial of a telemonitoring device for remote symptom progres- sion monitoring. The recordings were automatically captured in the patient’s homes.

The columns in the table are subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor UPDRS scores (`motor_UPDRS`) from the 16 voice measures.

### Exercise

Conduct the same comparison of methods on the parkinsons dataset. You can and should refer to the examples above and also to your code from Homework 3 if needed

#### Read in and prepare the data

In [99]:
df = pd.read_csv("parkinsons_updrs.csv")
df

Unnamed: 0,subject#,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,1,72,0,5.64310,28.199,34.398,0.00662,0.000034,0.00401,0.00317,...,0.230,0.01438,0.01309,0.01662,0.04314,0.014290,21.640,0.41888,0.54842,0.160060
1,1,72,0,12.66600,28.447,34.894,0.00300,0.000017,0.00132,0.00150,...,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.108100
2,1,72,0,19.68100,28.695,35.389,0.00481,0.000025,0.00205,0.00208,...,0.181,0.00734,0.00844,0.01458,0.02202,0.020220,23.047,0.46222,0.54405,0.210140
3,1,72,0,25.64700,28.905,35.810,0.00528,0.000027,0.00191,0.00264,...,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.48730,0.57794,0.332770
4,1,72,0,33.64200,29.187,36.375,0.00335,0.000020,0.00093,0.00130,...,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.193610
5,1,72,0,40.65200,29.435,36.870,0.00353,0.000023,0.00119,0.00159,...,0.214,0.01006,0.01337,0.02263,0.03019,0.009438,22.946,0.53949,0.57243,0.195000
6,1,72,0,47.64900,29.682,37.363,0.00422,0.000024,0.00212,0.00221,...,0.445,0.02376,0.02621,0.03488,0.07128,0.013260,22.506,0.49250,0.54779,0.175630
7,1,72,0,54.64000,29.928,37.857,0.00476,0.000025,0.00226,0.00259,...,0.212,0.00979,0.01462,0.01911,0.02937,0.027969,22.929,0.47712,0.54234,0.238440
8,1,72,0,61.66900,30.177,38.353,0.00432,0.000029,0.00156,0.00207,...,0.371,0.01774,0.02134,0.03451,0.05323,0.013381,22.078,0.51563,0.61864,0.200370
9,1,72,0,68.68800,30.424,38.849,0.00496,0.000027,0.00258,0.00253,...,0.310,0.02030,0.01970,0.02569,0.06089,0.018021,22.606,0.50032,0.58673,0.201170


In [102]:
X = df.iloc[:, 6:]
X

Unnamed: 0,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,Jitter:DDP,Shimmer,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,0.00662,0.000034,0.00401,0.00317,0.01204,0.02565,0.230,0.01438,0.01309,0.01662,0.04314,0.014290,21.640,0.41888,0.54842,0.160060
1,0.00300,0.000017,0.00132,0.00150,0.00395,0.02024,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.108100
2,0.00481,0.000025,0.00205,0.00208,0.00616,0.01675,0.181,0.00734,0.00844,0.01458,0.02202,0.020220,23.047,0.46222,0.54405,0.210140
3,0.00528,0.000027,0.00191,0.00264,0.00573,0.02309,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.48730,0.57794,0.332770
4,0.00335,0.000020,0.00093,0.00130,0.00278,0.01703,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.193610
5,0.00353,0.000023,0.00119,0.00159,0.00357,0.02227,0.214,0.01006,0.01337,0.02263,0.03019,0.009438,22.946,0.53949,0.57243,0.195000
6,0.00422,0.000024,0.00212,0.00221,0.00637,0.04352,0.445,0.02376,0.02621,0.03488,0.07128,0.013260,22.506,0.49250,0.54779,0.175630
7,0.00476,0.000025,0.00226,0.00259,0.00678,0.02191,0.212,0.00979,0.01462,0.01911,0.02937,0.027969,22.929,0.47712,0.54234,0.238440
8,0.00432,0.000029,0.00156,0.00207,0.00468,0.04296,0.371,0.01774,0.02134,0.03451,0.05323,0.013381,22.078,0.51563,0.61864,0.200370
9,0.00496,0.000027,0.00258,0.00253,0.00773,0.03610,0.310,0.02030,0.01970,0.02569,0.06089,0.018021,22.606,0.50032,0.58673,0.201170


In [105]:
y = df.loc[:, 'motor_UPDRS']
y

0       28.199
1       28.447
2       28.695
3       28.905
4       29.187
5       29.435
6       29.682
7       29.928
8       30.177
9       30.424
10      30.670
11      30.917
12      31.309
13      31.776
14      32.243
15      32.710
16      33.178
17      33.643
18      34.109
19      34.646
20      35.043
21      35.509
22      35.976
23      36.977
24      28.199
25      28.447
26      28.695
27      28.905
28      29.187
29      29.435
         ...  
5845    22.485
5846    21.988
5847    21.495
5848    21.007
5849    20.513
5850    19.725
5851    20.026
5852    20.627
5853    21.078
5854    21.533
5855    21.977
5856    22.437
5857    22.880
5858    23.339
5859    23.791
5860    24.242
5861    25.147
5862    25.598
5863    25.961
5864    25.452
5865    25.029
5866    24.401
5867    23.979
5868    23.482
5869    22.908
5870    22.485
5871    21.988
5872    21.495
5873    21.007
5874    20.513
Name: motor_UPDRS, Length: 5875, dtype: float64

In [106]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25)

#### Linear regression

In [108]:
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [109]:
from sklearn.metrics import r2_score
r2_score(test_y, m.predict(test_X))

0.098600732668098989

Let's look at the coefficients

In [110]:
m.coef_

array([  4.02078143e+02,  -4.59155891e+04,  -3.27941989e+04,
        -2.25909059e+02,   1.08972966e+04,   1.33486557e+02,
        -3.89474932e+00,  -2.74772299e+04,  -1.96108601e+02,
         9.64863006e+01,   9.12272768e+03,  -2.12301869e+01,
        -4.18096428e-01,   8.12305406e-01,  -2.98617050e+01,
         2.09281027e+01])

#### Ridge regression

In [1]:
from sklearn.linear_model import RidgeCV
m = RidgeCV()
m.fit(train_X, train_y)
print r2_score(test_y, m.predict(test_X))
print m.coef_
print m.alpha_

NameError: name 'train_X' is not defined

#### Lasso Regression

In [115]:
from sklearn.linear_model import LassoCV
m = LassoCV()
m.fit(train_X, train_y)
print r2_score(test_y, m.predict(test_X))
print m.coef_
print m.alpha_

0.0755436411871
[  0.          -0.          -0.          -0.          -0.          -0.          -0.
  -0.          -0.           0.          -7.59851766 -19.55508163
  -0.3779478    0.         -28.58519583  18.11505882]
0.00551437772584


#### CART

In [116]:
from sklearn.tree import DecisionTreeRegressor
params = {'min_samples_leaf': [10, 50, 100], 'max_depth': range(1, 11)}
model = GridSearchCV(DecisionTreeRegressor(), params)
model.fit(train_X, train_y)
print r2_score(test_y, model.predict(test_X))

0.175867737764


#### Random Forests

In [117]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=1000)
model.fit(train_X, train_y)
print r2_score(test_y, model.predict(test_X))

0.355436431572


#### Boosting

In [118]:
from sklearn.ensemble import GradientBoostingRegressor
params = {'learning_rate': [0.1, 0.2, 0.3], 'n_estimators': [50, 100, 500]}
model = GridSearchCV(GradientBoostingRegressor(), params)
model.fit(train_X, train_y)
print r2_score(test_y, model.predict(test_X))

0.297177207245
