# Feature Scaling

While many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of **Principle Component Analysis (PCA)** as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. **If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos)**, PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are **not scaled**. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly **incorrect**.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline

<br>We will use a small dataset that contains (Physics, Biology and Maths) marks of a classroom of students.

In [None]:
df = pd.read_csv("https://archive.org/download/ml-fundamentals-data/machine-learning-fundamentals-data/grades.csv",index_col=0)

<br>Show the first 5 rows of data.

In [None]:
df.head()

<br>We can use boxplot to visualize the data

In [None]:
#TODO : Use boxplot to visualize data
df.()

We could notice that the data is spread around the range of 1 to 100

<br>We will use scaling functions from scikit learn to perform some preprocessing techniques to scale our data.
<br>**Min-Max normalization** involves scaling features to lie between a given minimum and maximum value, often between zero and one.

In [None]:
#TODO : Use Min-Max normalization for scaling
scaler = preprocessing.MinMax()

<br>We will use Min-Max Scaling to scale all the columns of data.

In [None]:
#TODO : Fit the data to perform scaling
data_scaled = scaler.(df)

<br>Transform the numpy array containing our scaled data into a pandas data frame.

In [None]:
df_new = pd.DataFrame(data_scaled, index=df.index)

<br>Shows the first 5 rows of the scaled data

In [None]:
df_new.head()

As you can see, our values are scaled into a *range from 0 to 1*.

In [None]:
df_new.boxplot()

Another common scaling technique is the **Standard Scaler**.
<br>Standard Scaler Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

$$z = \frac{x - u}{s}$$

where $u$ is the mean of the training samples or zero if `with_mean = False`, and $s$ is the standard deviation of the training samples or one if `with_std = False`.

**Standardization** of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

<br>Now use the standard scaler to scale the data

In [None]:
#TODO : Use standardization for scaling
scaler = preprocessing.Scaler()

<br>Apply the scaler on the dataset and use df.head to visual the data

In [None]:
data_scaled = scaler.fit_transform(df)
df_new = pd.DataFrame(data_scaled, index=df.index)
df_new.head()

<br>
By plotting boxplot we can see that the range of the data changed where their average mean will be 0

In [None]:
df_new.boxplot()

# Example of improved performance using Standard Scaling 

In this example, we will use iris dataset from sklearn to visualize the improve of performance when using feature scaling

In [None]:
iris=datasets.load_iris()

Assigning useful variables for training use
<ul> 
    <li> iris.data = The features data (X) that are needed for training </li>
    <li> iris.target = The corresponding category (y) that the data belongs to </li>
</ul>

In [None]:
data = iris.data
target = iris.target

Splitting the data into training set and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=123)

Training the model with Logistic Regression

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)

Get the prediction using the test set

In [None]:
predict = model.predict(X_test)

Calculating the accuracy score for the trained model

In [None]:
accuracy = metrics.accuracy_score(predict,y_test)
print(accuracy)

As from here we can tell that the accuracy will be around `0.933333`, but we can further improve it by using feature scaling

Scale the training and test set data with Standard Scaler

In [None]:
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

Now let us train the model again with the scaled data and look at the output of the accuracy

In [None]:
model.fit(X_train,y_train)
predict = model.predict(X_test)
accuracy = metrics.accuracy_score(predict,y_test)
print(accuracy)

We can see that there is an imporve in the accuracy after we scaled the data