# Data Centering

Data centering involves subtracting the mean of a data set from each data point so that the new mean is 0. Mathematically, this looks like:

Xcenteredi=Xi−μ

where X_i is a datapoint and the Greek letter μ is the mean of all the X values.

For example, let’s take a look at a data set of ages for five individuals:

In [7]:
import numpy as np
ages = [24, 40, 28, 22, 56]
ages_mean = np.mean(ages)
centered_ages = ages - ages_mean
print(f'Mean: {ages_mean}\nCentered Ages: {centered_ages}\nSum of Centered Ages: {sum(centered_ages)}')

Mean: 34.0
Centered Ages: [-10.   6.  -6. -12.  22.]
Sum of Centered Ages: 0.0


This centered data is useful because it tells us how far above or below the mean each data point is, giving us additional insight that we can’t get just by looking at the initial data set. 

# Data Scaling

A common task for data analysts and scientists is to find trends in data by comparing features of data points. However, this task is made difficult when the features are on drastically different scales.

For instance, let’s consider a data set containing two features, age and income.

In general, a person’s age usually ranges from 0 to about 100 years. A person’s income, on the other hand, usually ranges from 0 to large amounts measured in the thousands of dollars. Clearly, age and income are two features that have vastly different ranges.

This presents issues when trying to use many machine learning algorithms, which treat all dimensions equally regardless of their scale. The difference in one year of age is interpreted as exactly equal to the difference in one dollar of income. That makes no sense! 

# Min-Max Normalization

Min-max normalization is one of the most simple and common ways to scale data.

For every feature in a data set, the minimum value of that feature is transformed into 0, the maximum value is transformed into 1, and every other value is transformed into a decimal between 0 and 1.

Xnorm = (X−Xmin)/(Xmax−Xmin)

One downside of min-max normalization is that it does not handle outliers very well. For example, if you have 99 values between 0 and 20, and one value is 100, then the 99 values will all be transformed to a value between 0 and 0.2 while the outlier is transformed to 1.

In [8]:
def min_max_normalize(lst):
  minimum = min(lst)
  maximum = max(lst)
  normalized = []

  # code goes here
  for el in lst:
    normalized.append((el - minimum)/(maximum - minimum))

  return normalized

# Uncomment these function calls to test your function:
print(min_max_normalize([0, 25, 50, 75, 100]))
# should print [0.0, 0.25, 0.5, 0.75, 1.0]
print(min_max_normalize([10, 12, 14]))
# should print [0.0, 0.5, 1.0]

[0.0, 0.25, 0.5, 0.75, 1.0]
[0.0, 0.5, 1.0]


# Standardization

Standardization (also known as Z-score normalization) is another common data scaling technique.

Standardization involves subtracting the mean of each observation and then dividing by the standard deviation:

z=(value−mean)/stdev

Once standardization is complete, all the features will have a mean of zero, a standard deviation of one, and therefore, the same scale.

Unlike normalization, standardization does not have a bounding range. This means that even if you have outliers in your data, your standardized data will not be affected. Therefore, if your dataset has outliers, standardization is the preferred scaling technique.

In [9]:
def standardize(lst, mean, std_dev):
  standardized = []

  # code goes here
  for el in lst:
    standardized.append((el - mean)/std_dev)

  return standardized

# Uncomment these function calls to test your standardize function:
print(standardize([1, 2, 3, 4, 5], 3.0, 1.41))
# should print [-1.418, -0.709, 0.0, 0.709, 1.418]
print(standardize([10, 15, 20], 15.0, 4.08))
# should print [-1.225, 0.0, 1.225]

[-1.4184397163120568, -0.7092198581560284, 0.0, 0.7092198581560284, 1.4184397163120568]
[-1.2254901960784315, 0.0, 1.2254901960784315]


## When to Normalize vs. Standardize?

Min-max normalization and standardization both have a similar goal of transforming features in data to have the same scale so that each feature is equally important. So when should you use min-max normalization vs. standardization?

There is not always a clear answer. Both normalization and standardization have their strengths as well as their drawbacks. For example, if you need your data to be on a 0-1 scale, then it makes sense to use min-max normalization. If you have outliers in your data, then it is best to use standardization (Z-score normalization) since it does not have a bounding range like min-max normalization does.

Keep in mind that not every data set requires normalization or standardization. If your data features do not have vastly different ranges, then scaling your data might not be necessary. 

## Python Implementation

As you saw, it is possible to implement min-max normalization and standardization by writing your own Python functions. However, in practice, most data analysts and scientists use popular libraries such as scikit-learn, which makes it very easy to scale your data.

For example, to normalize your data you can import MinMaxScaler from the sklearn.preprocessing package and then make a simple function call:

```python
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
 
# read in data 
data = pd.read_csv('data.csv')
 
# normalize data 
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
```

```python
from sklearn.preprocessing import StandardScaler
import pandas as pd
 
# read in data 
data = pd.read_csv('data.csv')
 
# standardize data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
```