In [2]:
import numpy as np
import sklearn.preprocessing

# A. Standard data format
Data can contain all sorts of different values. For example, Olympic 100m sprint times will range from 9.5 to 10.5 seconds, while calorie counts in large pepperoni pizzas can range from 1500 to 3000 calories. Even data measuring the exact same quantities can range in value (e.g. weight in kilograms vs. weight in pounds).

When data can take on any range of values, it makes it difficult to interpret. Therefore, data scientists will convert the data into a standard format to make it easier to understand. The standard format refers to data that has 0 mean and unit variance (i.e. standard deviation = 1), and the process of converting data into this format is called data standardization.

Data standardization is a relatively simple process. For each data value, x, we subtract the overall mean of the data, μ, then divide by the overall standard deviation, σ. The new value, z, represents the standardized data value. Thus, the formula for data standardization is:

# &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; z = (x-µ)/σ

# B. NumPy and scikit-learn
For most scikit-learn functions, the input data comes in the form of a NumPy array.

Note: The array’s rows represent individual data observations, while each column represents a particular feature of the data, i.e. the same format as a spreadsheet data table.

The scikit-learn data preprocessing module is called sklearn.preprocessing. One of the functions in this module, scale, applies data standardization to a given axis of a NumPy array.

In [3]:
# just for checking something regarding it's stadardization (what is it actually, what is standarization do)
pizza_data = np.array([[2100, 10, 800],
                       [2500, 11, 850],
                       [2140, 10.8, 804],
                       [1800, 10, 760],
                       [2000, 12, 800],
                       [2300, 11, 810]])
mean = np.mean(pizza_data[:,:1]) # it only gives the mean of column 1st
mean = np.mean(pizza_data[:], axis = 0) # it gives the mean of all columns
print(mean) 
from sklearn.preprocessing import scale
# Standardizing each column of pizza_data
col_standardized = scale(pizza_data)
print('{}\n'.format(repr(col_standardized)))

[2140.    10.8  804. ]
array([[-1.81319366e-01, -1.17108009e+00, -1.52646555e-01],
       [ 1.63187429e+00,  2.92770022e-01,  1.75543539e+00],
       [ 0.00000000e+00,  2.60032015e-15,  0.00000000e+00],
       [-1.54121461e+00, -1.17108009e+00, -1.67911211e+00],
       [-6.34617779e-01,  1.75662013e+00, -1.52646555e-01],
       [ 7.25277462e-01,  2.92770022e-01,  2.28969833e-01]])



Standarization means that the mean of the values is consider as 0 and according to that a number which is greater than mean is positive and the number which is less than mean is negative 
And also in standarization the variance(standard deviation) is 1

In [4]:
pizza_data = np.array([[2100,   10,  800],
                       [2500,   11,  850],
                       [1800,   10,  760],
                       [2000,   12,  800],
                       [2300,   11,  810]])
mean = np.mean(pizza_data[:],axis = 0) # it gives the 
print(mean) 
print(repr(pizza_data))
from sklearn.preprocessing import scale
# Standardizing each column of pizza_data
col_standardized = scale(pizza_data)
print('{}\n'.format(repr(col_standardized)))

# Column means (rounded to nearest thousandth)
col_means = col_standardized.mean(axis=0).round(decimals=3)
print('{}\n'.format(repr(col_means)))

# Column standard deviations
col_stds = col_standardized.std(axis=0)
print('{}\n'.format(repr(col_stds)))


[2140.    10.8  804. ]
array([[2100,   10,  800],
       [2500,   11,  850],
       [1800,   10,  760],
       [2000,   12,  800],
       [2300,   11,  810]])
array([[-0.16552118, -1.06904497, -0.1393466 ],
       [ 1.4896906 ,  0.26726124,  1.60248593],
       [-1.40693001, -1.06904497, -1.53281263],
       [-0.57932412,  1.60356745, -0.1393466 ],
       [ 0.66208471,  0.26726124,  0.2090199 ]])

array([ 0., -0.,  0.])

array([1., 1., 1.])



We normally standardize the data independently across each feature of the data array. This way, we can see how many standard deviations a particular observation’s feature value is from the mean.

For example, the second data observation in pizza_data has a net weight of 1.6 standard deviations above the mean pizza weight in the dataset.

If for some reason we need to standardize the data across rows, rather than columns, we can set the axis keyword argument in the scale function to 1. This may be the case when analyzing data within observations, rather than within a feature. An example of this would be analyzing a particular student’s test scores in terms of standard deviations from that student’s average test score