# Feature scaling

Different features can be measured on different scales. For example:
 * height can be measure in centimeters
 * weight in kilograms
 * blood pressure in mmHg
 * etc. 

Some classifiers combine and compare feature values, e.g. computing distance using Euclidean distance. The problem is that if one of the features has a broad range of values, the distance will be governed by this particular feature! For example, consider the two features:
* the percentage of unemployment in a city - ranges from 0.0 to 1.0
* the population of the city - can range up to 500,000

In this example, the percentage will be swamped by the population. 

<b>Feature scaling</b> transforms the data so that the features have, more or less, uniform range. <i>The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.</i>

As we'll see later, scaling features can lead to a faster optimization process and better results. However, for other algorithms like the Decision Tree Classifier, scaling is *not* necessary (i.e. it's scale-invariant). 


### Wine Dataset
About the dataset: `wine.csv`

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines

The data consists of:
- 13 features (see dataset/wine.headers)
- 3 classes - types of wine (I'm not sure what these types particularly are. In the dataset, they're just denoted as 1, 2, and 3.) 

See more detail here: https://archive.ics.uci.edu/ml/datasets/wine

## Let's get started
Scikit-learn comes with a package for data preprocessing `sklearn.preprocessing`. Let's start by importing that.

In [230]:
from sklearn import preprocessing

<b>1.)</b> Import as well all libraries that we've used from the previous sessions. We'll also be plotting some things so rememeber to import `matplotlib as plt`.

<b>2.)</b> Load the dataset `wine.csv` located in the /datasets folder. Set `header=None` since the dataset has no headers.

In [231]:
filename = None
df = None

The headers are actually stored in the file /datasets/wine.names. We load this file as follows:

In [232]:
headers = []
with open('../datasets/wine.names','r') as file:
    for line in file:
        headers.append(line.strip()) # appends each line to the 'headers' list. 

`strip()` strips each line of line breaks (\n) and trailing white spaces.

We can then set the headers as follows:

In [233]:
df.columns = headers #Sets the headers

AttributeError: 'NoneType' object has no attribute 'columns'

<b>3.)</b> Inspect your data using methods found in the following tutorial: [Understand your Data with Descriptive Statistics](http://machinelearningmastery.com/understand-machine-learning-data-descriptive-statistics-python/)

In [None]:
#print()

<b>4.)</b> Create matrix `X` containing the features and target vector `y` containing the classes.

Note that with `DataFrames`, you can access certain features using their column names/headers. This will be useful for us later.

In [None]:
X = None
y = None

## Standardization 
Some machine learning algorithms behave badly when their individual features don't look like standard normally distributed data (i.e. Gaussian distribution with zero mean and unit standard deviation). Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. 

<b>Statistics Review</b>: 
* Remember that the <b>mean</b> is simply the *average* of the values. 
* The <b>standard deviation</b> is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. ([Calculating Standard Deviation](https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-spread-distributions/a/calculating-standard-deviation-step-by-step))
* (Source: Understanding [Mean](https://en.wikipedia.org/wiki/Arithmetic_mean) and [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation))

![alt](images/gaussian_.png)

The result of <b>standardization</b> (or <b>Z-score normalization</b>) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with zero mean and unit standard deviation. <b>Standard scores</b> (also called <b>z scores</b>) of the samples are calculated as follows:
![alt](images/z_score_.png)

The function `scale` provides a quick and easy way to perform this operation on a single array-like dataset:
```python 
X_standard = preprocessing.scale(X)
```
Another way to do this is to use `StandardScaler` as follows:
```python
scaler = preprocessing.StandardScaler().fit(X)
X_standard = scaler.transform(X)
```
If you get an error, simply cast `X_standard` back into a DataFrame.
```python
X_standard = pd.DataFrame(X_standard, columns = X.columns)
```
This is because `StandardScaler` returns a numpy array. We want to preserve `X`'s `DataFrame` structure.

<b>5.)</b> Scale X using standardization

In [None]:
X_standard = None

In [None]:
print('Standardization')
for feature in X_standard:
    mean = X_standard[feature].mean() # get mean
    variance = X_standard[feature].std() # get variance
    print(feature + 'mean = {:.2f}, std = {:.2f}'.format(mean, variance))

Each feature vector should have a mean = 0, variance = 1.

<b>6.)</b> You may also inspect your data using `describe()`; then print the first 10 instances of `X_standard` (using `head()`, again) to see how our data has transformed

In [None]:
print()

## Scaling Features to a range
For some classifiers, it's useful to scale the features down to a value within the range 0 to 1.

A Min-Max scaling is typically done via the following equation:
![alt](images/min_max_.png)

In scikit-learn, we use `MinMaxScaler` to scale the matrix to a `[0,1]` range:
```python 
scaler = preprocessing.MinMaxScaler()
X_minmax = scaler.fit_transform(X)
```

Again, if you get an error, simply cast `X_minmax` back into a DataFrame.
```python
X_minmax = pd.DataFrame(X_minmax, columns = X.columns)
```

Scale X so that it is within the range 0 to 1.

In [None]:
X_minmax = None

In [None]:
print('\n\nMin-max Scaling')
for feature in X_minmax:
    min_ = X_minmax[feature].min()
    max_ = X_minmax[feature].max()
    print(feature + 'min = {:.2f}, max = {:.2f}'.format(min_, max_))

The lowest value (min) in each feature vector should be 0. The highest value (max) should be 1.

<b>7.)</b> Inpsect your data using `describe()` and print the first 10 instances of `X_minmax` (using `head()`, again) to see how our data has transformed.

In [None]:
print()

We'll now plot the data to see (visually) how the data has been transformed. Here, we use only the first two attributes: Alcohol and Malic acid

In [None]:
def plot():
    plt.figure(figsize=(10, 10))
    feature1 = 'Alcohol' #alternatively, you can call headers[1] to avoid hard-coding
    feature2 = 'Malic acid' #headers[2]

    plt.scatter(X[feature1], X[feature2], color='green', label='Raw input scale', alpha=0.5)
    plt.scatter(X_standard[feature1], X_standard[feature2], color='red', label='Standardized', alpha=0.5)
    plt.scatter(X_minmax[feature1], X_minmax[feature2], color='blue', label='Scaled between 0 to 1', alpha=0.5)
    plt.xlabel(feature1)
    plt.ylabel(feature2)
    plt.legend(loc='upper left')
    plt.show()
plot()

Alright! It's time to compare the performances of the classifiers for the scaled and unscaled data.

In [None]:
def classify(string, X):
    print('\n' + string)

    # As we've done before, let's apply our machine learning algorithms!

    # 8. Split the dataset into training and testing data
    #    Set the train size to 80%, and hold back 20% for testing 
    #    For random_state, set the seed to 7 (para uniform results natin lahat :D).

    # 9. Train X using different classifiers.
    #    I suggest trying K-Nearest Neighbor (KNN), DecisionTreeClassifier... try others!
    #    For each classifier used, calculate and print the accuracy along with the classifier's name. 

In [None]:
classify('Not scaled', X)
classify('Standardized', X_standard)
classify('Scaled to range 0 -1', X_minmax)

### Share your results! :D

This tutorial is based on http://sebastianraschka.com/Articles/2014_about_feature_scaling.html. 
* This also shows you how to implement scaling manually (equations are given).
* If you want a more detailed explanation, check it out!

### FAQ's

1.) Is it always a good idea to scale/normalize our data? Which technique should you use to rescale your data?

Answer: 
- [This answer here is pretty extensive.](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html)
- [The answer here is also pretty insightful](http://sebastianraschka.com/Articles/2014_about_feature_scaling.html)