#**Normalization**

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the L1, L2, or max norms.

Normalization makes the features more consistent with each other, which allows the model to predict outputs more accurately.


In [None]:
import numpy as np
from sklearn import preprocessing

In [None]:
import pandas as pd

#create DataFrame
df = pd.DataFrame({'small': [-5, -2, 0, 8, 12, 13],
                   'medium': [15, 18, 22, 23, 23, 25 ],
                   'big': [30,35,40,45,50,55]})
df

Unnamed: 0,small,medium,big
0,-5,15,30
1,-2,18,35
2,0,22,40
3,8,23,45
4,12,23,50
5,13,25,55


**1) L1 Normalization**

The L1 norm that is calculated as the sum of the absolute values of the vector.


In [None]:
#L1 
normalized = preprocessing.normalize(df, norm='l1')
normalized

array([[-0.1       ,  0.3       ,  0.6       ],
       [-0.03636364,  0.32727273,  0.63636364],
       [ 0.        ,  0.35483871,  0.64516129],
       [ 0.10526316,  0.30263158,  0.59210526],
       [ 0.14117647,  0.27058824,  0.58823529],
       [ 0.13978495,  0.2688172 ,  0.59139785]])

**2) L2 Normalization**

The L2 norm that is calculated as the square root of the sum of the squared vector values.

In [None]:
#L2 
normalized = preprocessing.normalize(df, norm='l2')
normalized

array([[-0.14744196,  0.44232587,  0.88465174],
       [-0.05075096,  0.45675865,  0.88814181],
       [ 0.        ,  0.48191875,  0.87621591],
       [ 0.15635262,  0.44951379,  0.87948349],
       [ 0.21303267,  0.40831262,  0.88763612],
       [ 0.2103626 ,  0.40454346,  0.8899956 ]])

**3) Max Normalization**

The max norm that is calculated as the maximum vector values.

In [None]:
#max
normalized = preprocessing.normalize(df, norm='max')
normalized

array([[-0.16666667,  0.5       ,  1.        ],
       [-0.05714286,  0.51428571,  1.        ],
       [ 0.        ,  0.55      ,  1.        ],
       [ 0.17777778,  0.51111111,  1.        ],
       [ 0.24      ,  0.46      ,  1.        ],
       [ 0.23636364,  0.45454545,  1.        ]])

#**Rescaling Data**

When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent.
It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
We can rescale our data using scikit-learn using the MinMaxScaler class.


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
print(scaler.fit(df))

MinMaxScaler()


In [None]:
print(scaler.data_max_)

[13. 25. 55.]


In [None]:
print(scaler.transform(df))

[[0.         0.         0.        ]
 [0.16666667 0.3        0.2       ]
 [0.27777778 0.7        0.4       ]
 [0.72222222 0.8        0.6       ]
 [0.94444444 0.8        0.8       ]
 [1.         1.         1.        ]]


#**Standardizing Data**

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. We can standardize data using scikit-learn with the StandardScaler class.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
print(scaler.fit(df))

StandardScaler()


In [None]:
print(scaler.mean_)

[ 4.33333333 21.         42.5       ]


In [None]:
print(scaler.transform(df))

[[-1.33484762 -1.75662013 -1.46385011]
 [-0.90578946 -0.87831007 -0.87831007]
 [-0.61975068  0.29277002 -0.29277002]
 [ 0.52440442  0.58554004  0.29277002]
 [ 1.09648198  0.58554004  0.87831007]
 [ 1.23950137  1.17108009  1.46385011]]


#**Binarizing Data**

We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful. We can create new binary attributes in Python using scikit-learn with the Binarizer class.

In [None]:
from sklearn.preprocessing import Binarizer

transformer = Binarizer().fit(df)  # fit does nothing.
transformer

Binarizer()

In [None]:
transformer.transform(df)

array([[0, 1, 1],
       [0, 1, 1],
       [0, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

# **Binning**

Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism which includes also dealing with missing values, formatting, normalization and standardization.

- Binning can be applied to convert numeric values to categorical or to sample (quantise) numeric values.
- convert numeric to categorical includes binning by distance and binning by frequency
- reduce numeric values includes quantisation (or sampling).
- Binning is a technique for data smoothing. Data smoothing is employed to remove noise from data. Three techniques for data smoothing:
Binning  -   regression -  outlier analysis.


In [None]:
df['SmallBin'] = pd.qcut(df['small'], q=3)
df

Unnamed: 0,small,medium,big,SmallBin
0,-5,15,30,"(-5.001, -0.667]"
1,-2,18,35,"(-5.001, -0.667]"
2,0,22,40,"(-0.667, 9.333]"
3,8,23,45,"(-0.667, 9.333]"
4,12,23,50,"(9.333, 13.0]"
5,13,25,55,"(9.333, 13.0]"


In [None]:
#count frequency of each bin
df['SmallBin'].value_counts()

(-5.001, -0.667]    2
(-0.667, 9.333]     2
(9.333, 13.0]       2
Name: SmallBin, dtype: int64

In [None]:
# binning with specific quantiles
df['small_bin'] = pd.qcut(df['small'], q=[0, .4,.8,1])
df

Unnamed: 0,small,medium,big,SmallBin,small_bin
0,-5,15,30,"(-5.001, -0.667]","(-5.001, 0.0]"
1,-2,18,35,"(-5.001, -0.667]","(-5.001, 0.0]"
2,0,22,40,"(-0.667, 9.333]","(-5.001, 0.0]"
3,8,23,45,"(-0.667, 9.333]","(0.0, 12.0]"
4,12,23,50,"(9.333, 13.0]","(0.0, 12.0]"
5,13,25,55,"(9.333, 13.0]","(12.0, 13.0]"


In [None]:
# binning with specific quantiles and labels
df['small_bin_label'] = pd.qcut(df['small'], q=[0, .4,.8,1] ,labels=['A', 'B','C'])
df

Unnamed: 0,small,medium,big,SmallBin,small_bin,small_bin_label
0,-5,15,30,"(-5.001, -0.667]","(-5.001, 0.0]",A
1,-2,18,35,"(-5.001, -0.667]","(-5.001, 0.0]",A
2,0,22,40,"(-0.667, 9.333]","(-5.001, 0.0]",A
3,8,23,45,"(-0.667, 9.333]","(0.0, 12.0]",B
4,12,23,50,"(9.333, 13.0]","(0.0, 12.0]",B
5,13,25,55,"(9.333, 13.0]","(12.0, 13.0]",C
