# Standardizing Data
* Standardization is a preprocessing method used to transform continous data to make it look normally distributed
* In scikit-learn, this is often a necessary step, because many models assume that the data you're training on is normally distributed.
* Two standardization methods: Log normalization and Feature scaling
* Standardization medthods are applied to continous numerical data

## When to standardize
* Data features have high variance
* Data features are continuous and on different scales

### Import Data set

In [1]:
import pandas as pd
# making data frame from csv file 
wine = pd.read_csv("wine.csv") 
X = wine.drop(["Type"],axis=1)
y = wine["Type"]

In [8]:
print(X.shape)
print(y.value_counts())

(178, 13)
2    71
1    59
3    48
Name: Type, dtype: int64


In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)  
# Fit the k-nearest neighbors model to the training data
knn.fit(X_train,y_train)
# Score the model on the test data
print(knn.score(X_test,y_test))

0.7555555555555555


## 1. Log Normalization

### Log normalization is a method for standardizing the data when the data have a particular column with high variance.



In [13]:
X.var()

Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64

In [47]:
#Log Normalization
import numpy as np
X_norm = X.copy()
X_norm['Proline'] = np.log(X_norm['Proline'])
X_norm['Magnesium'] = np.log(X_norm['Magnesium'])

In [48]:
print(X_norm.var())

Alcohol                          0.659062
Malic acid                       1.248015
Ash                              0.075265
Alcalinity of ash               11.152686
Magnesium                        0.018667
Total phenols                    0.391690
Flavanoids                       0.997719
Nonflavanoid phenols             0.015489
Proanthocyanins                  0.327595
Color intensity                  5.374449
Hue                              0.052245
OD280/OD315 of diluted wines     0.504086
Proline                          0.172314
dtype: float64


In [49]:
X_train, X_test, y_train, y_test = train_test_split(X_norm, y)
knn = KNeighborsClassifier(n_neighbors=5)  

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train,y_train)

# Score the model on the test data
print(knn.score(X_test,y_test))

0.9777777777777777


## 2. Feature Scaling
* Scaling is a standardization method which is useful when your dataset contains continuous featuures that are on different scales
* Feature scaling transforms the features in your dataset have a mean of zero and a variance of one. Transforms to approximately normal distribution

In [50]:
import warnings
warnings.filterwarnings("ignore")

# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash','Alcalinity of ash','Magnesium']]
print(wine_subset.describe())

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

print(pd.DataFrame(wine_subset_scaled).describe())

              Ash  Alcalinity of ash   Magnesium
count  178.000000         178.000000  178.000000
mean     2.366517          19.494944   99.741573
std      0.274344           3.339564   14.282484
min      1.360000          10.600000   70.000000
25%      2.210000          17.200000   88.000000
50%      2.360000          19.500000   98.000000
75%      2.557500          21.500000  107.000000
max      3.230000          30.000000  162.000000
                  0             1             2
count  1.780000e+02  1.780000e+02  1.780000e+02
mean  -8.657245e-16 -1.160121e-16 -1.995907e-17
std    1.002821e+00  1.002821e+00  1.002821e+00
min   -3.679162e+00 -2.671018e+00 -2.088255e+00
25%   -5.721225e-01 -6.891372e-01 -8.244151e-01
50%   -2.382132e-02  1.518295e-03 -1.222817e-01
75%    6.981085e-01  6.020883e-01  5.096384e-01
max    3.156325e+00  3.154511e+00  4.371372e+00


In [51]:
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train,y_train)

# Score the model on the test data.
print(knn.score(X_test,y_test))

0.9333333333333333
