# Feature scaling

Different features can be measured on different scales. For example:
 * height can be measure in centimeters
 * weight in kilograms
 * blood pressure in mmHg
 * etc. 

Some classifiers combine and compare feature values, e.g. computing distance using Euclidean distance. The problem is that if one of the features has a broad range of values, the distance will be governed by this particular feature! For example, consider the two features:
* the percentage of unemployment in a city - ranges from 0.0 to 1.0
* the population of the city - can range up to 500,000

In this example, the percentage will be swamped by the population. 

<b>Feature scaling</b> transforms the data so that the features have, more or less, uniform range. <i>The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.</i>

As we'll see later, scaling features can lead to a faster optimization process and better results. However, for other algorithms like the Decision Tree Classifier, scaling is *not* necessary (i.e. it's scale-invariant). 


### Wine Dataset
About the dataset: `wine.csv`

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines

The data consists of:
- 13 features (see dataset/wine.headers)
- 3 classes - types of wine (I'm not sure what these types particularly are. In the dataset, they're just denoted as 1, 2, and 3.) 

See more detail here: https://archive.ics.uci.edu/ml/datasets/wine

## Let's get started
Let's start by importing the scikit-learn package for data preprocessing `sklearn.preprocessing`.

In [53]:
from sklearn import preprocessing

<b>1.)</b> Import all the libraries that we've used in previous sessions. 

We'll be plotting some things so let's go ahead and import `matplotlib`.

In [54]:
import matplotlib.pyplot as plt

<b>2.)</b> Load the dataset `wine.csv` located in the /datasets folder. Set `header=None` since the dataset has no headers.

In [55]:
filename = None
df = None

The headers are actually stored in the file /datasets/wine.names. We load this file as follows:

In [56]:
headers = []
with open('../datasets/wine.names','r') as file:
    for line in file:
        headers.append(line.strip()) # appends each line to the 'headers' list. 

The function `strip()` strips each line of line breaks (\n) and trailing white spaces. We then set the headers as follows:

In [57]:
## Uncomment this to set the headers
# df.columns = headers #Sets the headers

<b>3.)</b> Explore and inspect your data. Use the methods and techniques you've learned in the section `Exploratory Analysis`.

<b>4.)</b> Create matrix `X` containing the features and target vector `y` containing the classes.

In [58]:
X = None
y = None

<b> 8.)</b> Split the dataset into training and testing data.
* Set the train size to 80%, and hold back 20% for testing 
* For random_state, set the seed to 42 (para uniform results natin lahat :D).

In [59]:
X_train, X_test, y_train, y_test = None, None, None, None

## Standardization 
Some machine learning algorithms behave badly when their individual features don't look like standard normally distributed data (i.e. Gaussian distribution with zero mean and unit standard deviation). Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. 

<b>Statistics Review</b>: 
* Remember that the <b>mean</b> is simply the *average* of the values. 
* The <b>standard deviation</b> is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. ([Calculating Standard Deviation](https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step))
* (Source: Understanding [Mean](https://en.wikipedia.org/wiki/Arithmetic_mean) and [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation))
* [Understanding Normal Distribution.](https://www.mathsisfun.com/data/standard-normal-distribution.html)

![alt](images/gaussian_.png)

The result of <b>standardization</b> (or <b>Z-score normalization</b>) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with zero mean and unit standard deviation. <b>Standard scores</b> (also called <b>z scores</b>) of the samples are calculated as follows:
![alt](images/z_score_.png)

One way to do this is to use `StandardScaler` as follows:
```python
standard_scaler = preprocessing.StandardScaler().fit(X_train)
X_train_standard = standard_scaler.transform(X_train)
X_test_standard = standard_scaler.transform(X_test)
```
If you get an error, simply cast `X_train_standard` back into a DataFrame.
```python
X_train_standard = pd.DataFrame(X_train_standard, columns = X.columns)
```
This is because `StandardScaler` returns a numpy array. We want to preserve `X`'s `DataFrame` structure.

<b>5.)</b> Scale `X_train` using standardization:

In [60]:
X_train_standard = None

<b>6.)</b> To see how our data has transformed:
* Inspect `X_train_standard` using `describe()`
* Print the first 10 instances of `X_train_standard`

Each feature vector should have a mean = 0, variance = 1.

## Scaling Features to a range
For some classifiers, it's useful to scale the features down to a value within the range 0 to 1.

A Min-Max scaling is typically done via the following equation:
![alt](images/min_max_.png)

In scikit-learn, we use `MinMaxScaler` to scale the matrix to a `[0,1]` range:
```python 
minmax_scaler = preprocessing.MinMaxScaler().fit(X_train)
X_train_minmax = minmax_scaler.transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
```

Again, if you get an error, simply cast `X_train_minmax` back into a DataFrame.
```python
X_train_minmax = pd.DataFrame(X_train_minmax, columns = X.columns)
```

Scale `X_train` so that it is within the range 0 to 1.

In [61]:
X_train_minmax = None

<b>7.)</b> To see how our data has transformed:
* Inspect `X_train_minmax` using `describe()`
* Print the first 10 instances of `X_train_minmax`

The lowest value (min) in each feature vector should be 0. The highest value (max) should be 1.

We'll now plot the data to see (visually) how the data has been transformed. Here, we use only the first two attributes: Alcohol and Malic acid

In [62]:
def plot():
    plt.figure(figsize=(10, 10))
    feature1 = headers[1]
    feature2 = headers[2]

    plt.scatter(X[feature1], X[feature2], color='green', label='raw input scale', alpha=0.5)
    plt.scatter(X_train_standard[feature1], X_train_standard[feature2], color='red', label='Standardized (mean centered on 0, variance = 1)', alpha=0.5)
    plt.scatter(X_train_minmax[feature1], X_train_minmax[feature2], color='blue', label='Scaled between 0 to 1 (MinMaxScaler)', alpha=0.5)
    plt.xlabel(feature1)
    plt.ylabel(feature2)
    plt.legend(loc='upper left')
    plt.show()

# Uncomment this to see the plot
#plot()

Alright! It's time to compare the performances of the classifiers for the scaled and unscaled data. 

Note that to compare the performances after applying the different feature scaling techniques, we use cross validation rather than evaluating the test set. This is because if we choose a feature scaling technique by examining the test set, then we may end up choosing a technique that works specifically for that test set; thus the accuracy score derived from the test set will no longer be a good estimate of how well it generalizes to new examples. 
* k-cross validation cheat sheet: See [(Slide Set 2) K-Nearest Neighbor.pdf](https://github.com/wwcodemanila/WWCodeManila-ML.AI/tree/master/slides)
* [More on Cross Validation here.](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

In [66]:
def validate(string, X_train, y_train):
    print('\n' + string)

    # As we've done before, let's apply our machine learning algorithms!
    # 9. Train X using various classifiers of your own choosing.
    #    I suggest trying K-Nearest Neighbor (KNN), DecisionTreeClassifier... try others!
    #    For each classifier used, calculate and print the cross validation score
    #    (using scikit-learn's cross_val_score) along with the classifier's name. 

In [67]:
validate('Not scaled', X_train, y_train)
validate('Standardized', X_train_standard, y_train)
validate('Scaled to range 0-1', X_train_minmax, y_train)


Not scaled

Standardized

Scaled to range 0-1


Using the <b>classifier</b> and <b>feature scaling technique</b> that yields the best cross validation score, predict `X_test`. 
* Make sure you scale `X_test` and `X_train` first using the best feature scaling technique from the previous step.

In [68]:
X_test_scaled = None
clf = None # Best classifier
accuracy = None 

print accuracy

None


### Share your results! :D

This tutorial is based on http://sebastianraschka.com/Articles/2014_about_feature_scaling.html. 
* This also shows you how to implement scaling manually (equations are given).
* If you want a more detailed explanation, check it out!

### FAQ's

1.) Is it always a good idea to scale/normalize our data? Which technique should you use to rescale your data?
Answer: 
- [This answer here is pretty extensive.](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)
- [Sebastian Raschka's answer](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#z-score-standardization-or-min-max-scaling)