# normalizing & standardizing data

#### 1) normalization: 

scales all data to values between 0 and 1.

you must know or be able to estimate the expected min and max values for your data.

#### 2) standardization:

rescales the data distribution so that the mean of all observations is 0, and the standard deviation is 1.

you must know or be able to estimate your mean and standard dev.

your data must fit a gaussian distribution.


In [1]:
# imports from standard python data processing libraries, numpy, pandas & sklearn

import numpy as np
from pandas import Series
from sklearn.preprocessing import MinMaxScaler

## normalization

scikit-learn's MinMaxScaler object fits the data in order to estimate the max and min, then scales the data when we call .transform()

note: the MinMaxScaler object expects data to be presented as a matrix of size (rows, columns), so even if you use a numpy array (as i do here), make sure your data is reshaped into a matrix using .reshape(rows, columns) before attempting to fit the scaler.

### example 1:
normalization on some contrived data that makes it easy to see what's going on

In [5]:
# pretend time sequence

data = [100.0, 200.0, 300.0, 400.0, 500.0, 600.0, 700.0, 800.0, 900.0]

In [6]:
data_series = Series(data)

print(data_series)

0    100.0
1    200.0
2    300.0
3    400.0
4    500.0
5    600.0
6    700.0
7    800.0
8    900.0
dtype: float64


#### useful objects: sklearn's MinMaxScaler & pandas' Series

to normalize this data we'll be using MinMaxScaler, a scikit-learn object. the MinMaxScaler object expects a matrix with rows and columns as input; otherwise it gets mad. pandas Series object has a number of useful attributes & methods, some of which come in handy here.

we'll use attributes and/or methods attached to both of these objects (sklearn's MinMaxScaler & pandas' Series) to normalize the data.

##### pandas' Series documentation: 

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

##### .reshape(): 

https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.reshape.html

https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html

##### more on using .reshape in pandas Series: 

https://stackoverflow.com/questions/14390224/reshape-of-pandas-series

##### sklearn's MinMaxScaler documentation: 

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html


In [24]:
# pandas Series attribute
# .values returns an ndarray type

data_values = data_series.values

# numpy method .reshape()
# can only be called on .values

data_array = data_values.reshape((len(data_values),1))

#### training the normalizer

remember, a normalizer is itself a type of model. it has to be trained. the scikit-learn MinMaxScaler object follows the familiar scikit-learn model api.

In [25]:
# create a MinMaxScaler object
# feature_range will give all values between 0 and 1

data_scaler = MinMaxScaler(feature_range=(0,1))

In [26]:
# fit the normalizer to the data

data_scaler = data_scaler.fit(data_array)

In [27]:
# check

original_max = data_scaler.data_max_

original_min = data_scaler.data_min_

print('min: %f' '\n' 'max: %f' % (original_max, original_min))

min: 900.000000
max: 100.000000


In [30]:
# transform
# feed in the Series array to the trained model

data_normalized = data_scaler.transform(data_array)

In [32]:
print(data_normalized)

[[0.   ]
 [0.125]
 [0.25 ]
 [0.375]
 [0.5  ]
 [0.625]
 [0.75 ]
 [0.875]
 [1.   ]]


#### inverse transform

once the data has been transformed, it's also possible to transform it back, using the inverse_transform() method.

In [33]:
data_inversed = data_scaler.inverse_transform(data_normalized)

print(data_inversed)

[[100.]
 [200.]
 [300.]
 [400.]
 [500.]
 [600.]
 [700.]
 [800.]
 [900.]]


### example 2:
some slightly random-er data

In [38]:
# create some data
# np.random.rand creates a specified number of data points with a normal distribution
# between 0 and 1

data = np.random.rand(10)

print(data)


[0.71322554 0.32331002 0.14940009 0.27190708 0.76416523 0.05511729
 0.54936922 0.12393065 0.861575   0.00812809]


In [39]:
# load as a pandas Series
# pandas series must be 1d

data_series = Series(data)

print(data_series)

0    0.713226
1    0.323310
2    0.149400
3    0.271907
4    0.764165
5    0.055117
6    0.549369
7    0.123931
8    0.861575
9    0.008128
dtype: float64


In [40]:
# prepare the data for the MinMaxScaler to fit
# get values, then reshape

series_values = data_series.values
series_values = series_values.reshape(len(series_values), 1)

In [41]:
# instantiate then fit scaler

scaler = MinMaxScaler(feature_range=(0,1))

scaler = scaler.fit(series_values)

In [42]:
# check results: min & max?

print('min: %f, max: %f' % (scaler.data_min_, scaler.data_max_))

min: 0.008128, max: 0.861575


In [44]:
# transform: where the actual normalization happens

normalized_data = scaler.transform(series_values)

print(normalized_data)

[[0.82617611]
 [0.36930467]
 [0.16553109]
 [0.30907486]
 [0.88586312]
 [0.05505814]
 [0.63418254]
 [0.13568807]
 [1.        ]
 [0.        ]]


In [45]:
# the process can be reversed: inverse the transform to get the original values back

inversed_nrml = scaler.inverse_transform(normalized_data)

print(inversed_nrml)

[[0.71322554]
 [0.32331002]
 [0.14940009]
 [0.27190708]
 [0.76416523]
 [0.05511729]
 [0.54936922]
 [0.12393065]
 [0.861575  ]
 [0.00812809]]


## 2) standardization

something here