# Preprocessing Continuous Variables

This tutorial will present various methods on how to preprocess continuous variables.

## Recap on UCI Breast Cancer Dataset (breast.data)

* Easy dataset to start off with
* Dataset contains all continuous variables, except one ID column, and one label (M, B) column
    * The continous variables are just statistics collected from a tumor's biopsy
    * More information can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
* Goal of the dataset is to classify whether a tumor is maligant (M) or benigh (B)

In [1]:
prefix = "../datasets/"
import pandas as pd

df = pd.read_csv(prefix + "breast.data", header=None)

In [2]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
df.drop(0, axis=1, inplace=True)

# Shuffling dataset
import numpy as np
perm = np.random.permutation(len(df))
df = df.iloc[perm]

# Creating features and response variable set
y = df[1]
X = df.drop(1, axis=1)

In [4]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [5]:
predictions = SVC().fit(X_train, y_train).predict(X_test)
non_scale_accuracy = accuracy_score(y_test, predictions)
print "Accuracy of SVM: ", non_scale_accuracy

predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)
non_scale_accuracy = accuracy_score(y_test, predictions)
print "Accuracy of Logistic Regression: ", non_scale_accuracy

Accuracy of SVM:  0.713286713287
Accuracy of Logistic Regression:  0.944055944056


## Improving the classification rate by feature engineering

In general, we're not getting the bang for our buck using the support vector machine. And it's because we're not preprocessing the continuous features correctly.

A variety of ways to improve model accuracy with continuous features

* Feature scaling
    * Standard scaling: For each continuous feature, $\mu = 0$ and $\sigma = 1$
    * Simple scaling: Scale all continuous features between the range $[0, 1]$ or $[-1, 1]$.
* Univariate feature selection
    * Univariate feature selection using $p$-values
    * Correlation based feature selection using Spearman Rho or Kendall Tau

## Part 1: Feature scaling

* Idea is that continuous features can take anywhere in a certain range; need a way to shrink (or inflate) everything
* Reduce the variation in the dataset using scaling.
* **Standard scaling** applies the following formula to transform a feature into a space with mean 0 and standard deviation 1. This is also called "recentering" the dataset.

    Given the $i$th continuous feature $X_i$, we apply the following formula for each $x \in X_i$:
    $$x' = \frac{x - \bar{X_i}}{\sigma_{X_i}}$$
    where $\bar{X_i}$ is the mean of feature $X_i$ and $\sigma_{X_i}$ is its standard deviation. Our new dataset composed of $x'$ will have mean 0 and standard deviation 1.
* **Min-max scaling** applies the following formula to shrink (or inflate) features into a space between a given interval. If we want our features to lie within the interval [0, 1], the following formula would work.
    $$x' = \frac{x - \min(X_i)}{\max(X_i) - \min(X_i)}$$

* More information on [Wikipedia](https://en.wikipedia.org/wiki/Feature_scaling)

In [6]:
X.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,22,23,24,25,26,27,28,29,30,31
152,9.731,15.34,63.78,300.2,0.1072,0.1599,0.4108,0.07857,0.2548,0.09296,...,11.02,19.49,71.04,380.5,0.1292,0.2772,0.8216,0.1571,0.3108,0.1259
26,14.58,21.53,97.41,644.8,0.1054,0.1868,0.1425,0.08783,0.2252,0.06924,...,17.62,33.21,122.4,896.9,0.1525,0.6643,0.5539,0.2701,0.4264,0.1275
184,15.28,22.41,98.92,710.6,0.09057,0.1052,0.05375,0.03263,0.1727,0.06317,...,17.8,28.03,113.8,973.1,0.1301,0.3299,0.363,0.1226,0.3175,0.09772
38,14.99,25.2,95.54,698.8,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504,...,14.99,25.2,95.54,698.8,0.09387,0.05131,0.02398,0.02899,0.1565,0.05504
482,13.47,14.06,87.32,546.3,0.1071,0.1155,0.05786,0.05266,0.1779,0.06639,...,14.83,18.32,94.94,660.2,0.1393,0.2499,0.1848,0.1335,0.3227,0.09326


In [7]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# I want to show that SVMs are sensitive to feature scaling.
# In partcular, because sklearn.svm.SVC uses the RBF kernel, this kernel
# is sensitive to scaling.
#
# More information on how to properly train an SVM is here:
#    http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

for Scaler in [StandardScaler, MinMaxScaler]:
    
    # "Scaler" is a class object whose constructor and attributes we can call
    scaler = Scaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    
    svm = SVC().fit(X_train_scaled, y_train)
    
    X_test_scaled = scaler.transform(X_test)  # Note we don't "refit" for testing data
    predictions = svm.predict(X_test_scaled)
    print "Accuracy of SVM using {0}: {1}".format(Scaler.__name__, accuracy_score(y_test, predictions))
    

Accuracy of SVM using StandardScaler: 0.986013986014
Accuracy of SVM using MinMaxScaler: 0.958041958042


## Part 2: Feature Selection
    
Idea is that we have all of these continuous attributes.... who is to say that any of them are useful?

The full Scikit-Learn module on feature selection is presented [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## UCI Sonar Dataset

* The task is to train a classifier to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. (From website.)
* More dataset description [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.names)
* In general, this is one of my favorite datasets because the classification task is difficult

In [60]:
df = pd.read_csv(prefix + "sonar.data", header=None)

In [61]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,60
0,0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,...,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
2,0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,...,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
3,0.01,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,...,0.0121,0.0036,0.015,0.0085,0.0073,0.005,0.0044,0.004,0.0117,R
4,0.0762,0.0666,0.0481,0.0394,0.059,0.0649,0.1209,0.2467,0.3564,0.4459,...,0.0031,0.0054,0.0105,0.011,0.0015,0.0072,0.0048,0.0107,0.0094,R


In [62]:
# The class labels (R, M) are not shuffled, so we have to shuffle them
import numpy as np
perm = np.random.permutation(len(df))
df = df.loc[perm]

In [63]:
X = df.drop(60, axis=1)
y = df[60]  # Rock or mine class label

In [64]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
1,0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,...,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044
89,0.0235,0.0291,0.0749,0.0519,0.0227,0.0834,0.0677,0.2002,0.2876,0.3674,...,0.0242,0.0083,0.0037,0.0095,0.0105,0.003,0.0132,0.0068,0.0108,0.009
131,0.115,0.1163,0.0866,0.0358,0.0232,0.1267,0.2417,0.2661,0.4346,0.5378,...,0.0228,0.0099,0.0065,0.0085,0.0166,0.011,0.019,0.0141,0.0068,0.0086
171,0.0179,0.0136,0.0408,0.0633,0.0596,0.0808,0.209,0.3465,0.5276,0.5965,...,0.0086,0.0123,0.006,0.0187,0.0111,0.0126,0.0081,0.0155,0.016,0.0085
27,0.0177,0.03,0.0288,0.0394,0.063,0.0526,0.0688,0.0633,0.0624,0.0613,...,0.0168,0.0102,0.0122,0.0044,0.0075,0.0124,0.0099,0.0057,0.0032,0.0019


In [65]:
# As a baseline, let's classify this with Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [66]:
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
194,0.0392,0.0108,0.0267,0.0257,0.041,0.0491,0.1053,0.169,0.2105,0.2471,...,0.0089,0.0083,0.008,0.0026,0.0079,0.0042,0.0071,0.0044,0.0022,0.0014
53,0.0293,0.0378,0.0257,0.0062,0.013,0.0612,0.0895,0.1107,0.0973,0.0751,...,0.0076,0.0065,0.0072,0.0108,0.0051,0.0102,0.0041,0.0055,0.005,0.0087
110,0.021,0.0121,0.0203,0.1036,0.1675,0.0418,0.0723,0.0828,0.0494,0.0686,...,0.0104,0.0117,0.0101,0.0061,0.0031,0.0099,0.008,0.0107,0.0161,0.0133
158,0.0107,0.0453,0.0289,0.0713,0.1075,0.1019,0.1606,0.2119,0.3061,0.2936,...,0.0079,0.0164,0.012,0.0113,0.0021,0.0097,0.0072,0.006,0.0017,0.0036
161,0.0305,0.0363,0.0214,0.0227,0.0456,0.0665,0.0939,0.0972,0.2535,0.3127,...,0.0271,0.02,0.007,0.007,0.0086,0.0089,0.0074,0.0042,0.0055,0.0021


In [67]:
predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)
print "Accuracy of Logistic Regression: ", accuracy_score(predictions, y_test)

Accuracy of Logistic Regression:  0.711538461538


In [68]:
# Try scaling the features?

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train.loc[:, :] = scaler.fit_transform(X_train)
X_test.loc[:, :] = scaler.transform(X_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [69]:
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
194,0.435043,-0.819749,-0.439553,-0.565471,-0.574866,-0.893864,-0.265333,0.403488,0.28831,0.312113,...,-0.553657,-0.520404,-0.402723,-1.21003,-0.224223,-0.702133,-0.080804,-0.569264,-0.902686,-0.953516
53,0.011811,-0.017746,-0.466582,-0.987722,-1.094262,-0.688821,-0.5163,-0.27963,-0.690867,-0.97149,...,-0.657459,-0.696062,-0.518369,0.007615,-0.59381,0.385761,-0.593913,-0.407372,-0.474758,0.38211
110,-0.34302,-0.781134,-0.612539,1.121365,1.771687,-1.017567,-0.789503,-0.606543,-1.1052,-1.019999,...,-0.433887,-0.188607,-0.099152,-0.690303,-0.857801,0.331366,0.073129,0.357936,1.221673,1.223737
158,-0.783352,0.205033,-0.380089,0.421945,0.658697,0.000869,0.613049,0.90616,1.115247,0.659133,...,-0.633504,0.270055,0.175508,0.081862,-0.989796,0.295103,-0.0637,-0.333784,-0.979102,-0.550998
161,0.063112,-0.062302,-0.582807,-0.630433,-0.489537,-0.599009,-0.44641,-0.437814,0.660259,0.801673,...,0.899559,0.62137,-0.547281,-0.556659,-0.131826,0.15005,-0.029493,-0.598699,-0.398342,-0.825442


In [70]:
X_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
13,-0.856028,-0.956387,-0.477394,-0.063102,0.885005,0.966774,0.273133,-0.420238,-0.706437,-0.117745,...,0.228844,-0.754614,-0.185886,1.284657,-0.211023,1.292339,1.407212,-0.436807,1.649602,0.656554
32,-0.407146,-0.507859,-1.004462,-0.710552,-0.74367,-0.758298,-0.343165,-0.794019,-0.934796,-1.351347,...,0.819712,0.201743,-0.489458,2.428055,1.412519,0.15005,0.329683,0.181327,-0.91797,-0.239962
81,-0.813277,-0.564296,-0.742279,-0.063102,0.220921,-0.016077,0.646405,0.849917,0.799524,0.699433,...,-0.290162,-0.061743,-0.503913,-0.452714,-0.277021,-0.375765,0.073129,-0.9372,-0.428908,-0.862035
33,0.648797,0.276322,-1.028788,0.136114,-0.819724,-0.576979,0.705176,0.169142,-0.690002,-0.585663,...,0.412492,0.660405,1.563262,0.408547,-0.541012,-0.140055,0.073129,0.328501,-0.337209,0.711442
183,-0.830378,0.059484,0.682157,0.36781,0.309961,-0.146559,-0.420996,0.930766,0.669774,0.671074,...,1.210962,0.982443,-0.431635,0.542191,0.976935,-1.246081,0.569134,-0.687003,-0.58174,-0.880331


In [71]:
predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)
print "Accuracy of Logistic Regression: ", accuracy_score(predictions, y_test)

Accuracy of Logistic Regression:  0.769230769231
