# Machine Learning Methods: Data Standardization

### Data standardization is the conversion of data into a standard, uniform format, making it consistent across different data sets and easier to understand for computer systems. It's often performed when pre-processing data for input into machine learning or statistical models.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

## Lets create some synthetic data to help us understand Data Standardization better!
## We will also be using a new tool for data standardization called StandardScaler

In [9]:
## First lets import our new tool!
## There are a few different types of Scalars but for this lesson we will be using Standard Scaler

from sklearn.preprocessing import StandardScaler

In [11]:
## Lets create some syntthetic data!

X = 100*np.random.rand(200,4)+55

In [15]:
## Now lets see the mean of those new synthetic data! 
## The mean is import for using a Scaler

X.mean(axis=0)

array([107.03910567, 104.11781393, 106.01212783, 104.58907429])

In [17]:
## Now lets see the std of those new synthetic data! 
## The std is import for using a Scaler

X.std(axis=0)

array([28.30753689, 28.78521087, 28.94033065, 29.17456443])

In [19]:
## now lets use StandardScaler and create a new scaled variable

s = StandardScaler()
x_scaled = s.fit_transform(X)

In [25]:
## now that we have applied our scaler lets check the mean and std again for x_scaled
## Notice how we have scaled the mean down into a range near 0 and the std to near 1 
## almost as if its a boolean statement. We use the scaler to standardize our larger variable data down into near boolean values

x_scaled.mean(axis=0)

array([ 7.01660952e-16, -7.14983628e-16, -9.17044218e-16,  3.90798505e-16])

In [23]:
x_scaled.std(axis=0)

array([1., 1., 1., 1.])