## Importance of Normalization

In this assignment we will study the effect of normalization. When the ranges of different dimensions do not have the same magnitude, the learning outcome may go wrong significantly. We first start with a sythetic dataset with two dimensions. On the first dimension the data is uniform over $[-10^4, 10^4]$ and on the second dimension the data is uniform over $[-1, 1]$. Labels are assigned according to the sign of the second dimension.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

n_train = 200
n_test = 200

# Train set
X_train = np.random.uniform(-1, 1, [n_train, 2])
X_train[:, 0] = 1e4 * X_train[:, 0]
Y_train = X_train[:, 1]>0

# Test set
X_test = np.random.uniform(-1, 1, [n_test, 2])
X_test[:, 0] = 1e4 * X_test[:, 0]
Y_test = X_test[:, 1]>0

### Problem 1

Train a linear SVM with $C=1$, k-nearest neighbor (with k=5), and a decision tree classifier using (X_train, Y_train) and test them on (X_test, Y_test). Print their test accuracies.

In [2]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# ====== Your code here ========
clf1 = SVC(C=1, kernel='linear')
clf1.fit(X_train, Y_train)
print("SVC test accuracy: ", accuracy_score(Y_test, clf1.predict(X_test)))
# ====== Your code here ========

SVC test accuracy:  0.93


In [3]:
from sklearn.neighbors import KNeighborsClassifier
# ====== Your code here ========
clf2 = KNeighborsClassifier(n_neighbors=5)
clf2.fit(X_train, Y_train)
print("KNN test accuracy: ",accuracy_score(Y_test, clf2.predict(X_test)))
# ====== Your code here ========

KNN test accuracy:  0.51


In [4]:
from sklearn import tree
# ====== Your code here ========
clf3 = tree.DecisionTreeClassifier()
clf3.fit(X_train, Y_train)
print("Decision tree test accuracy: ",accuracy_score(Y_test, clf3.predict(X_test)))
# ====== Your code here ========

Decision tree test accuracy:  1.0


### Problem 2

Now we implement the normalization step. You cannot use existing tools for normalization. 

First we compute the mean and standard deviation of each column in X_train: for dimension $i$ let $\mu_i$ and $\sigma_i$ be the mean and standard deviation respectively. Then we normalize both X_train and X_test using the mean and standard deviation we just obtained, meaning that for each row $x=(x_1, x_2)$ of X_train and X_test, we compute $x'=(x_1', x_2')$ by
$$
x_i' = \frac{x_i-\mu_i}{\sigma_i}, i=1, 2
$$

In [5]:
mean = np.zeros((1, X_train.shape[1]))
std = np.ones((1, X_train.shape[1]))


# ====== Your code here ========

# Compute the mean and standard deviation
mean = np.mean(X_train, axis=0)
std = np.std(X_train, axis=0)

# ====== Your code here ========

def transform(X, mean, std):
    # X: n x d matrix
    # mean and std: 1 x d matrix
    # X_out: n x d matrix
    X_out = np.zeros(X.shape)
    # ====== Your code here ========
   
    X_out = (X - mean) / std
   
    # ====== Your code here ========
    return X_out 

X_train_scaled = transform(X_train, mean, std)
X_test_scaled = transform(X_test, mean, std)

print(mean.shape)
print(std.shape)

(2,)
(2,)


Now train the same classifiers in Problem 1, except this time using X_train_scaled, Y_train. Test your models using X_test_scaled, Y_test and compare the test accuracies with Problem 1. 
1. How does normalization affect the accuracy of SVM? 
2. How does normalization affect the accuracy of k-NN?
3. How does normalization affect the accuracy of decision tree?
4. Are they affected similarly or differently? Briefly explain why the effects might be similar or different?

In [6]:
# ====== Your code here ========
clf1 = SVC(C=1, kernel='linear')
clf1.fit(X_train_scaled, Y_train)
print("SVC test accuracy: ", accuracy_score(Y_test, clf1.predict(X_test_scaled)))

clf2 = KNeighborsClassifier(n_neighbors=5)
clf2.fit(X_train_scaled, Y_train)
print("KNN test accuracy: ",accuracy_score(Y_test, clf2.predict(X_test_scaled)))

clf3 = tree.DecisionTreeClassifier()
clf3.fit(X_train_scaled, Y_train)
print("Decision tree test accuracy: ",accuracy_score(Y_test, clf3.predict(X_test_scaled)))
# ====== Your code here ========

SVC test accuracy:  0.975
KNN test accuracy:  0.955
Decision tree test accuracy:  1.0


### Your response:
1. The normalization helps increase the SVM test accuracy from 0.93 to 0.975. Although it is not seem like a significant improvement, it actually brings the accuracy to be almost perfect.
2. The normalization helps increase the KNN test accuracy from 0.51 to 0.955. This is a significant improvement in test accuracy, bringing the accuracy to be almost perfect.
3. The normalization does not contribute anything to the Decision Tree Classifier because the test accuracy was already perfect even without normalization.
4. SVM and K-NN were affected similarly by normalization as the testing accuracy improves after normalizing the training and testing features. On the other hand, Decision Tree was affected differently as the testing accuracy was already perfect before normalization. Normalization is one way of feature scaling in which it brings the data or features to be of the same scale. This helps models like SVM and K-NN, in which the algorithm/model checks for the similarities/distances between the points of the data/features and the model will be affected by the scale of the features. If there is a feature that is in a different scale, it may end up dominating the model and the test prediction will not be good. As a result, we can see how having the features in the same scale helps better test predictions in SVM and KNN. On the other hand, we do not really need to perform normalization when using Decision Tree Classifier as shown in our experiment above, as Decision Tree is just making comparisons down the tree and scale does not really affect the model. Additionally, we are able to predict the test data perfectly.  Nevertheless, it is usually a good idea to have our features in the same scale.