# normalizing & standardizing data

### normalization: 

scales all data to values between 0 and 1.

you must know or be able to estimate the expected min and max values for your data.

### standardization:

rescales the data distribution so that the mean of all observations is 0, and the standard deviation is 1.

you must know or be able to estimate your mean and standard dev.

your data must fit a gaussian distribution.


## normalization

scikit-learn's MinMaxScaler object fits the data in order to estimate the max and min, then scales the data when we call .transform()

note: the MinMaxScaler object expects data to be presented as a matrix of size (rows, columns), so even if you use a numpy array (as i do here), make sure your data is reshaped into a matrix using .reshape(rows, columns) before attempting to fit the scaler.

In [37]:
# imports from standard python data processing libraries, numpy, pandas & sklearn

import numpy as np
from pandas import Series
from sklearn.preprocessing import MinMaxScaler

In [38]:
# create some data
# np.random.rand creates a specified number of data points with a normal distribution
# between 0 and 1

data = np.random.rand(10)

print(data)


[0.71322554 0.32331002 0.14940009 0.27190708 0.76416523 0.05511729
 0.54936922 0.12393065 0.861575   0.00812809]


In [39]:
# load as a pandas Series
# pandas series must be 1d

data_series = Series(data)

print(data_series)

0    0.713226
1    0.323310
2    0.149400
3    0.271907
4    0.764165
5    0.055117
6    0.549369
7    0.123931
8    0.861575
9    0.008128
dtype: float64


In [40]:
# prepare the data for the MinMaxScaler to fit
# get values, then reshape

series_values = data_series.values
series_values = series_values.reshape(len(series_values), 1)

In [41]:
# instantiate then fit scaler

scaler = MinMaxScaler(feature_range=(0,1))

scaler = scaler.fit(series_values)

In [42]:
# check results: min & max?

print('min: %f, max: %f' % (scaler.data_min_, scaler.data_max_))

min: 0.008128, max: 0.861575


In [44]:
# transform: where the actual normalization happens

normalized_data = scaler.transform(series_values)

print(normalized_data)

[[0.82617611]
 [0.36930467]
 [0.16553109]
 [0.30907486]
 [0.88586312]
 [0.05505814]
 [0.63418254]
 [0.13568807]
 [1.        ]
 [0.        ]]


In [45]:
# the process can be reversed: inverse the transform to get the original values back

inversed_nrml = scaler.inverse_transform(normalized_data)

print(inversed_nrml)

[[0.71322554]
 [0.32331002]
 [0.14940009]
 [0.27190708]
 [0.76416523]
 [0.05511729]
 [0.54936922]
 [0.12393065]
 [0.861575  ]
 [0.00812809]]
