# Preprocessing Continuous Variables

This tutorial will present various methods on how to preprocess continuous variables.

## Recap on UCI Breast Cancer Dataset (breast.data)

* Easy dataset to start off with
* Dataset contains all continuous variables, except one ID column, and one label (M, B) column
    * The continous variables are just statistics collected from a tumor's biopsy
    * More information can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
* Goal of the dataset is to classify whether a tumor is maligant (M) or benigh (B)

In [None]:
prefix = "../datasets/"
import pandas as pd

df = pd.read_csv(prefix + "breast.data", header=None)

In [None]:
df.head()

In [None]:
df.drop(0, axis=1, inplace=True)

# Creating features and response variable set
y = df[1]
X = df.drop(1, axis=1)

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
predictions = SVC().fit(X_train, y_train).predict(X_test)
non_scale_accuracy = accuracy_score(y_test, predictions)
print "Accuracy of SVM: ", non_scale_accuracy

predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)
non_scale_accuracy = accuracy_score(y_test, predictions)
print "Accuracy of Logistic Regression: ", non_scale_accuracy

## Improving the classification rate by feature engineering

In general, we're not getting the bang for our buck using the support vector machine. And it's because we're not preprocessing the continuous features correctly.

A variety of ways to improve model accuracy with continuous features

* Feature scaling
    * Standard scaling: For each continuous feature, $\mu = 0$ and $\sigma = 1$
    * Min-max scaling: Scale all continuous features between the range $[0, 1]$ or $[-1, 1]$.
* Univariate feature selection
    * Using the Chi-squared statistic to improve classification
* Discretization of continuous features
    * No real example, just a survey of techniques

## Part 1: Feature scaling

* Idea is that continuous features can take anywhere in a certain range; need a way to shrink (or inflate) everything
* Reduce the variation in the dataset using scaling.
* **Standard scaling** applies the following formula to transform a feature into a space with mean 0 and standard deviation 1. This is also called "recentering" the dataset.

    Given the $i$th continuous feature $X_i$, we apply the following formula for each $x \in X_i$:
    $$x' = \frac{x - \bar{X_i}}{\sigma_{X_i}}$$
    where $\bar{X_i}$ is the mean of feature $X_i$ and $\sigma_{X_i}$ is its standard deviation. Our new dataset composed of $x'$ will have mean 0 and standard deviation 1.
* **Min-max scaling** applies the following formula to shrink (or inflate) features into a space between a given interval. If we want our features to lie within the interval [0, 1], the following formula would work.
    $$x' = \frac{x - \min(X_i)}{\max(X_i) - \min(X_i)}$$

* More information on [Wikipedia](https://en.wikipedia.org/wiki/Feature_scaling)

In [None]:
X.head()

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# I want to show that SVMs are sensitive to feature scaling.
# In partcular, because sklearn.svm.SVC uses the RBF kernel, this kernel
# is sensitive to scaling.
#
# More information on how to properly train an SVM is here:
#    http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

for Scaler in [StandardScaler, MinMaxScaler]:
    
    # "Scaler" is a class object whose constructor and attributes we can call
    scaler = Scaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    
    svm = SVC().fit(X_train_scaled, y_train)
    
    X_test_scaled = scaler.transform(X_test)  # Note we don't "refit" for testing data
    predictions = svm.predict(X_test_scaled)
    print "Accuracy of SVM using {0}: {1}".format(Scaler.__name__, accuracy_score(y_test, predictions))
    

## Part 2: Univariate Feature Selection
    
Idea is that we have all of these continuous attributes.... who is to say that any of them are useful?

The full Scikit-Learn module on feature selection is presented [here](http://scikit-learn.org/stable/modules/feature_selection.html).

## UCI Sonar Dataset

* The task is to train a classifier to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. (From website.)
* More dataset description [here](https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.names). The dataset was originally used for neural network classification tasks.
* In general, this is one of my favorite datasets because the classification task is difficult

In [None]:
df = pd.read_csv(prefix + "sonar.data", header=None)

In [None]:
df.head()

In [None]:
X = df.drop(60, axis=1)
y = df[60]  # Rock or mine class label

In [None]:
# As a baseline, let's classify this with Logistic Regression, no scaling

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
X_train.head()

In [None]:
predictions = LogisticRegression().fit(X_train, y_train).predict(X_test)
baseline_acc = accuracy_score(predictions, y_test)
print "Accuracy of Logistic Regression: ", baseline_acc

### Univariate feature selection definition

* Not all features in your training set help predict the output
* **Univariate feature selection** is a technique to weed out features that are uninformative.
* Using a *scoring function*, we rank variables by their scores, and discard variables that yield a low score
    * For classification: [chi-squared](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2) or [ANOVA F-value](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)
    * In Scikit-Learn, can either select variables using percentiles ([SelectPercentile](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile))  or choose a fixed $k$ variables ([SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html))
    
References
* [Scikit-Learn user guide](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) (describes what univariate feature selection is, and how to use it)
* [Scikit-Learn example using SVMs and iris dataset](http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html)
* [Another Scikit-Learn example using SVMs](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html)
* [Wikipedia's example about the Chi-squared statistic](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Examples)

## Mathematical procedure using the chi-squared statistic

Let 
* $C_j$ be the random variable representing the *class label* of sample $j$. The observed value of $C_j$ is also called the **response**.
* $X$ be the random variable representing the value of some feature in the dataset. We assume that $X$ is normally distributed.

Our null hypothesis is that $X$ and $C_j$ are independent, that is:

$$P(C_j \,\vert\, X) = P(C_j).$$

In other words, there is *no* relationship between our feature and the response.

The chi-squared statistic helps us assign a - the **$\chi^2$ statistic** - to each of our features. Depending on the threshold we set, we will either accept or reject our null hypothesis. 
* If we *accept* our null hypothesis, then the response is independent of the feature, and we can throw the feature away. 
* If we *reject* our null hypothesis, then the response is dependent of the feature, and we should keep this feature.

### Reference
* [Chi-squared feature selection in document classification](http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html)

In [None]:
from sklearn.feature_selection import chi2
statistics, pvalues = chi2(X, y)

In [None]:
features = sorted(enumerate(statistics), key=lambda (index, value): value)
for (index, value) in features[:20]:
    print "Feature {0} with chi-squared {1}".format(index, value)

### Example using SelectKBest and the Chi-squared statistic

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

n_features = X.shape[1]

from sklearn.feature_selection import SelectKBest, chi2

accuracies = list()

selector = SelectKBest(score_func=chi2)
for k_chosen in xrange(1, n_features, 1):
    selector.set_params(k=k_chosen)
    
    X_altered = selector.fit_transform(X, y)
    
    predictions = LogisticRegression().fit(X_altered, y).predict(X_altered)
    
    accuracy = accuracy_score(predictions, y)
    accuracies.append(accuracy)

%matplotlib inline
import matplotlib.pyplot as plt

features = range(1, n_features, 1)
plt.plot(features, accuracies, label="Logistic Regression")
plt.title("Performance of Logistic Regression using univariate feature selection\nwith respect to ANOVA F-score")
plt.xlabel("Accuracy Score")
plt.ylabel("Number of features selected")
plt.ylim(0.65, 1)

# Plot the baseline
plt.plot(features, [baseline_acc] * len(features), label="Baseline LR")
plt.legend()
plt.show()

index, acc = max(enumerate(accuracies), key=lambda (i, acc): acc)
print "Best accuracy is when using {0} features".format(index)

### Warning: Don't overfit

In [None]:
n_features = X_train.shape[1]

from sklearn.feature_selection import SelectKBest, chi2

accuracies2 = list()

selector = SelectKBest(score_func=chi2)
for k_chosen in xrange(1, n_features, 1):
    selector.set_params(k=k_chosen)
    
    X_train_altered = selector.fit_transform(X_train, y_train)
    X_test_altered = selector.transform(X_test)
    
    predictions = LogisticRegression().fit(X_train_altered, y_train).predict(X_test_altered)
    
    accuracy = accuracy_score(predictions, y_test)
    accuracies2.append(accuracy)

plt.title("Performance of Logistic Regression using univariate feature selection\nwith respect to ANOVA F-score")
plt.xlabel("Accuracy Score")
plt.ylabel("Number of features selected")
plt.ylim(0.65, 1)
    
    
features = range(1, n_features, 1)
plt.plot(features, accuracies, label="Overfit LR")
plt.plot(features, accuracies2, label="Not-overfit LR")

# Plot the baseline
plt.plot(features, [baseline_acc] * len(features), label="Baseline LR")
plt.legend()
plt.show()

#index, acc = max(enumerate(accuracies), key=lambda (i, acc): acc)
#print "Best accuracy is when using {0} features".format(index)

### More resources on feature selection

* [Quora answer by Olivier Grisel](https://www.quora.com/How-do-I-perform-feature-selection)
* 
[Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (March 2003), 1157-1182.](http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf)

## Part 3: Discretization of continuous features

* A large value for a continuous feature could mean a tendency to be of class $A$ instead of class $B$.
* Doesn't take into account that continuous features could be "clustered"
    * *Example*: Suppose a continuous feature $x$ for some sample $i$ took values between $[0, 1]$. We want to classify $i$ to be either class $A$, $B$, or $C$
        * $x \in [0, 0.33)$ implies $i$ tends to be in class $A$
        * $x \in [0.33, 0.66)$ implies $i$ tends to be in class $B$
        * $x \in [0.66, 1]$ implies $i$ tends to be in class $C$
* A continuous (numeric) feature doesn't help distinguish the three classes. But there are several **discretization** (ie *binning*) methods which convert the continuous feature into a categorical one

Historically, this was a fun research problem back when. For decision trees, it was non-trivial to figure out where to split a continuous interval. The [C4.5](https://en.wikipedia.org/wiki/C4.5_algorithm#Improvements_from_ID.3_algorithm) algorithm improved Ross Quinlan's [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm.

With all of this said, I don't think discretization is a popular machine learning problem anymore. (Citation needed.)

### Glossery of techniques
* Unsupervised binning
    * **Simple uniform binning**: Split an interval into $k$ evenly sized intervals.
    * **Percentile / quartile based binning**: Split an interval based on percentiles or quartiles
    * **Random forest embedding**: Implementation [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomTreesEmbedding.html#sklearn.ensemble.RandomTreesEmbedding)
* Supervised binning
    * **Minimum description length principal.** Implementation [here](https://github.com/UIUC-data-mining/mdlp-discretization), paper [here](http://ijcai.org/Past%20Proceedings/IJCAI-93-VOL2/PDF/022.pdf)

A survey of discretization methods applied to decision tree learning is provided [here](http://download.springer.com/static/pdf/961/art%253A10.1023%252FA%253A1016304305535.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1016304305535&token2=exp=1454459259~acl=%2Fstatic%2Fpdf%2F961%2Fart%25253A10.1023%25252FA%25253A1016304305535.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1016304305535*~hmac=daa83bbb881f0e61183afb0904117e3e194d3ce0dce7aaf518f624483a146500).

![](images/discretization.png)