# Data Preprocessing: Data Normalization

Normalization is a data preprocessing technique used to transform the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values or losing information. It’s about adjusting the scale of your data to level the playing field for all the features in your dataset.

In [1]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a dataset

In [2]:
data = make_classification(n_samples = 10, n_features = 2, n_informative = 2, n_redundant = 0, n_classes = 2, random_state =42)

In [3]:
type(data)

tuple

In [4]:
len(data)

2

In [145]:
data

(array([[ 1.06833894, -0.97007347],
        [-1.14021544, -0.83879234],
        [-2.8953973 ,  1.97686236],
        [-0.72063436, -0.96059253],
        [-1.96287438, -0.99225135],
        [-0.9382051 , -0.54304815],
        [ 1.72725924, -1.18582677],
        [ 1.77736657,  1.51157598],
        [ 1.89969252,  0.83444483],
        [-0.58723065, -1.97171753]]),
 array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0]))

# Performing Tuple Unpacking

## Extracting Features

In [5]:
X = data[0]

In [6]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [7]:
type(X)

numpy.ndarray

In [8]:
X.shape

(10, 2)

## Extracting Labels

In [9]:
y = data[1]

In [10]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

# Performing a train test split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [13]:
X_train.shape

(8, 2)

# Performing MinMax Scaling (range = (0,1), i.e Data Normalization)

In [14]:
from sklearn.preprocessing import MinMaxScaler

In [15]:
sc = MinMaxScaler()

In [16]:
help(MinMaxScaler)

Help on class MinMaxScaler in module sklearn.preprocessing._data:

class MinMaxScaler(sklearn.base.OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)
 |
 |  Transform features by scaling each feature to a given range.
 |
 |  This estimator scales and translates each feature individually such
 |  that it is in the given range on the training set, e.g. between
 |  zero and one.
 |
 |  The transformation is given by::
 |
 |      X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
 |      X_scaled = X_std * (max - min) + min
 |
 |  where min, max = feature_range.
 |
 |  This transformation is often used as an alternative to zero mean,
 |  unit variance scaling.
 |
 |  `MinMaxScaler` doesn't reduce the effect of outliers, but it linearly
 |  scales them down into a fixed range, where the largest occurring data point
 |  corresponds to the maximum value and the smallest one corresponds to

## fit and transform to training data

In [17]:
sc.fit(X_train)  # fit computes the minimum (X_min) and maximum  (X_max) to be used for later scaling.

In [18]:
X_train_Scaled = sc.transform(X_train) # Scaling the data according to maximum and minumum value

In [19]:
X_train_Scaled

array([[0.41885108, 0.36181853],
       [0.84826376, 0.25367198],
       [1.        , 0.88216362],
       [0.        , 1.        ],
       [0.49396176, 0.        ],
       [0.1995656 , 0.2480553 ],
       [0.46541255, 0.25607308],
       [0.98927673, 0.19903124]])

In [20]:
X_train_Scaled.max()

np.float64(1.0000000000000002)

In [21]:
X_train_Scaled.min()

np.float64(0.0)

## Using fit_transform in a single line of code

In [22]:
X_train_Scaled_2 = sc.fit_transform(X_train)

In [23]:
X_train_Scaled_2

array([[0.41885108, 0.36181853],
       [0.84826376, 0.25367198],
       [1.        , 0.88216362],
       [0.        , 1.        ],
       [0.49396176, 0.        ],
       [0.1995656 , 0.2480553 ],
       [0.46541255, 0.25607308],
       [0.98927673, 0.19903124]])

# transforming testing data

In [24]:
X_test_Scaled = sc.transform(X_test)

In [25]:
X_test_Scaled

array([[1.0261785 , 0.71067636],
       [0.37561963, 0.28691966]])

# Performing MInMax Scaling ( range  = (-1, 1))

In [166]:
sc2 = MinMaxScaler( feature_range=(-1, 1) )

In [167]:
X_train_new = sc2.fit_transform(X_train)

In [168]:
X_train_new

array([[-0.16229784, -0.27636293],
       [ 0.69652751, -0.49265605],
       [ 1.        ,  0.76432723],
       [-1.        ,  1.        ],
       [-0.01207648, -1.        ],
       [-0.6008688 , -0.50388939],
       [-0.0691749 , -0.48785385],
       [ 0.97855345, -0.60193751]])

In [169]:
X_train_new.max()

1.0000000000000004

In [170]:
X_train_new.min()

-1.0

In [171]:
X_test_new = sc2.transform(X_test)

In [172]:
X_test_new

array([[ 1.052357  ,  0.42135271],
       [-0.24876073, -0.42616069]])