# Chapter 2: Scale Machine Learning Data

Many machine learning algorithms expect data to be scaled consistently. There are two popular
methods that you should consider when scaling your data for machine learning. In this tutorial,
you will discover how you can rescale your data for machine learning. After reading this tutorial
you will know:

* How to normalize your data from scratch.
* How to standardize your data from scratch.
* When to normalize as opposed to standardize data.

Let’s get started.

## 2.1 Description

Many machine learning algorithms expect the scale of the input and even the output data to be
equivalent. It can help in methods that weight inputs in order to make a prediction, such as
in linear regression and logistic regression. It is practically required in methods that combine
weighted inputs in complex ways such as in artificial neural networks and deep learning.

### 2.1.1 Pima Indians Diabetes Dataset
In this tutorial we will use the Pima Indians Diabetes Dataset. This dataset involves the predic-
tion of the onset of diabetes within 5 years. The baseline performance on the problem is approx-
imately 65%. You can learn more about it in Appendix A, Section A.4. Download the dataset
and save it into your current working directory with the filename pima-indians-diabetes.csv.

## 2.2 Tutorial

This tutorial is divided into 3 parts:
1. Normalize Data.
2. Standardize Data.
3. When to Normalize and Standardize.

These steps will provide the foundations you need to handle scaling your own data.

### 2.2.1 Normalize Data

Normalization can refer to different techniques depending on context. Here, we use normalization
to refer to rescaling an input variable to the range between 0 and 1. Normalization requires
that you know the minimum and maximum values for each attribute.
This can be estimated from training data or specified directly if you have deep knowledge
of the problem domain. You can easily estimate the minimum and maximum values for each
attribute in a dataset by enumerating through the values. The snippet of code below defines
the dataset minmax() function that calculates the min and max value for each attribute in a
dataset, then returns an array of these minimum and maximum values.

In [18]:
# Load libraries
use strict;
use warnings;
use Data::Dump qw(dump);
use List::Util qw(zip min max sum);
use sml;
use AI::MXNet qw(nd);


In [20]:
# Function To Calculate the Min and Max Values For a Dataset.
# Find the min and max values for each column
use AI::MXNet qw(nd);
sub dataset_minmax2{
    my ($self, $dataset) = @_;
    my $mx_data = mx->nd->array($dataset);
    my $mins = $mx_data->min(axis => 0);
    my $maxs = $mx_data->max(axis => 0);
    return [$mins->asarray, $maxs->asarray];
}
sml->add_to_class('dataset_minmax2', \&dataset_minmax2);

*sml::dataset_minmax2

With this contrived dataset, we can test our function for calculating the min and max for
each column.

In [21]:
# Contrive small dataset
my $dataset = [[50, 30], [20, 90]];
printf "%s\n", dump $dataset;
# Calculate min and max for each column
my $minmax = sml->dataset_minmax2($dataset);
printf "%s\n", dump $minmax;
# Output of Example Calculating the Min and Max Values.
# [[50, 30], [20, 90]]
# [[20, 50], [30, 90]]

[[50, 30], [20, 90]]
[[20, 30], [50, 90]]


1

Once we have estimates of the maximum and minimum allowed values for each column, we
can now normalize the raw data to the range 0 and 1. The calculation to normalize a single
value for a column is:

<center>$scaled\ value = (value − min)\ /\ (max − min)$</center>  (2.1)

Below is an implementation of this in a function called normalize dataset() that normalizes
values in each column of a provided dataset.

In [22]:
# Function To Normalize a Dataset.
# Rescale dataset columns to the range 0-1
sub normalize_dataset2 {
    my ($self, $dataset, $minmax) = @_;
    my $mx_data = mx->nd->array($dataset);
    my $mins = mx->nd->array($minmax->[0]);
    my $ranges = mx->nd->array($minmax->[1]) - $mins;
    
    my $normalized = ($mx_data - $mins) / $ranges;
    @$dataset = @{$normalized->asarray};
}

sml->add_to_class('normalize_dataset2', \&normalize_dataset2);


*sml::normalize_dataset2

We can tie this function together with the dataset minmax() function and normalize the
contrived dataset.

In [23]:
# Contrive small dataset
$dataset = [[50, 30], [20, 90]];
print dump $dataset;

$minmax = sml->dataset_minmax2($dataset);
print "\n", dump $minmax;

sml->normalize_dataset2($dataset, $minmax);
print "\n", dump $dataset;


# Example Output of Normalizing the Contrived Dataset.
# [[50, 30], [20, 90]]
# [[20, 50], [30, 90]]
# [[1, 0], [0, 1]]

[[50, 30], [20, 90]]
[[20, 30], [50, 90]]
[[1, 0], [0, 1]]

1

We can combine this code with code for loading a CSV dataset and load and normalize the
Pima Indians Diabetes dataset. The example first loads the dataset and converts the values for
each column from string to floating point values. The minimum and maximum values for each
column are estimated from the dataset, and finally, the values in the dataset are normalized.

In [24]:
# Load pima-indians-diabetes dataset
my $filename = 'data/pima-indians-diabetes.csv';
$dataset = sml->load_csv($filename);
printf "Loaded data file %s with %d rows and %d columns.\n", $filename, scalar @$dataset, scalar @{$dataset->[0]}; 
print "[@{$dataset->[0]}]";

# convert string columns to float
for my $i (0 .. $#{$dataset->[0]}) {
    sml->str_column_to_float($dataset, $i);
}

print "\n[@{$dataset->[0]}]";

# Calculate min and max for each column
$minmax = sml->dataset_minmax($dataset);
sml->normalize_dataset($dataset, $minmax);
print "\n[@{$dataset->[0]}]";



# Example Output of Normalizing the Diabetes Dataset.
# Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns
# [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
# [0.35294117647058826, 0.7437185929648241, 0.5901639344262295, 0.35353535353535354, 0.0,
# 0.5007451564828614, 0.23441502988898377, 0.48333333333333334, 1.0]

Loaded data file data/pima-indians-diabetes.csv with 768 rows and 9 columns.
[6 148 72 35 0 33.6 0.627 50 1]
[6.0 148.0 72.0 35.0 0.0 33.6 0.6 50.0 1.0]
[0.352941176470588 0.743718592964824 0.590163934426229 0.353535353535354 0 0.500745156482861 0.217391304347826 0.483333333333333 1]

1

### 2.2.2 Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on
the value 0 and the standard deviation to the value 1. Together, the mean and the standard
deviation can be used to summarize a normal distribution, also called the Gaussian distribution
or bell curve.
It requires that the mean and standard deviation of the values for each column be known
prior to scaling. As with normalizing above, we can estimate these values from training data, or
use domain knowledge to specify their values. Let’s start with creating functions to estimate
the mean and standard deviation statistics for each column from a dataset. The mean describes
the middle or central tendency for a collection of numbers. The mean for a column is calculated
as the sum of all values for a column divided by the total number of values.<br><br>

<center>$\sum_{i=1}^n values_i / count(values)$</center> (2.2)

The function below named column_means() calculates the mean values for each column in
the dataset.

In [25]:
# Function To Calculate Means For Each Column in a Dataset.
# Calculate column means
my $column_means2 = sub {
    my ($self, $dataset) = @_;
    my $mx_data = mx->nd->array($dataset);
    return [$mx_data->mean(axis => 0)->asarray];
};

sml->add_to_class('column_means2', $column_means2);

*sml::column_means2

The standard deviation describes the average spread of values from the mean. It can be
calculated as the square root of the sum of the squared difference between each value and the
mean and dividing by the number of values minus 1.<br><br>

<center>$ standard\ deviation = \sqrt{\sum_{i=1}^n (values_i - mean)^2 / count(values) − 1}$</center> (2.3)

The function below named column stdevs() calculates the standard deviation of values for
each column in the dataset and assumes the means have already been calculated.

In [26]:
# Function To Calculate Standard Deviations For Each Column in a Dataset.
# Calculate column standard deviations

my $column_stdevs2 = sub {
    my ($self, $dataset, $means) = @_;
    my $mx_data = mx->nd->array($dataset);
    my $mx_means = mx->nd->array($means);
    
    my $variance = ($mx_data - $mx_means)->square->mean(axis => 0);
    return [$variance->sqrt->asarray];
};

sml->add_to_class('column_stdevs2', $column_stdevs2);

*sml::column_stdevs2

Using the contrived dataset, we can estimate the summary statistics.

In [27]:
# Standardize dataset
my $dataset = [[50, 30], [20, 90], [30, 50]];
print dump $dataset;

# Cálculo de estadísticos
my $means = sml->column_means2($dataset);
my $stdevs = sml->column_stdevs2($dataset, $means);

printf "Medias: %s\n", dump $means;
printf "Desviaciones estándar: %s\n", dump $stdevs;


# Example Output From Calculating Statistics from the Contrived Dataset.
# [[50, 30], [20, 90], [30, 50]]
# [33.333333333333336, 56.666666666666664]
# [15.275252316519467, 30.550504633038933]

[[50, 30], [20, 90], [30, 50]]Medias: [[33.3333320617676, 56.6666679382324]]
Desviaciones estándar: [[12.4721918106079, 24.9443836212158]]


1

Once the summary statistics are calculated, we can easily standardize the values in each
column. The calculation to standardize a given value is as follows:<br><br>

<center>$standardized\_value_i = (value_i − mean)\ /\ stdev$</center>  (2.4)

Below is a function named standardize dataset() that implements this equation

In [29]:
# Function To Standardize a Dataset.
# Standardize dataset
sub standardize_dataset2{
    my ($self, $dataset, $means, $stdevs) = @_;
    my $mx_data = mx->nd->array($dataset);
    my $mx_means = mx->nd->array($means);
    my $mx_stdevs = mx->nd->array($stdevs);
    
    my $standardized = ($mx_data - $mx_means) / $mx_stdevs;
    @$dataset = @{$standardized->asarray};
}

sml->add_to_class('standardize_dataset2', \&{'standardize_dataset2'});


*sml::standardize_dataset2

Combining this with the functions to estimate the mean and standard deviation summary
statistics, we can standardize our contrived dataset.

In [30]:
printf "%s\n", dump $means;
printf "%s\n", dump $stdevs;

sml->standardize_dataset2($dataset, $means, $stdevs);
printf "%s\n", dump $dataset;

# Example Output From Standardizing the Contrived Dataset.
# [[1.0910894511799618, -0.8728715609439694], 
#  [-0.8728715609439697, 1.091089451179962],
#  [-0.21821789023599253, -0.2182178902359923]]

[[33.3333320617676, 56.6666679382324]]
[[12.4721918106079, 24.9443836212158]]
[
  [1.33630621433258, -1.06904494762421],
  [-1.06904482841492, 1.33630609512329],
  [-0.267261117696762, -0.267261296510696],
]


1

Again, we can demonstrate the standardization of a machine learning dataset. The example
below demonstrates how to load and standardize the Pima Indians diabetes dataset, assumed
to be in the current working directory as in the previous normalization example.

In [32]:
# Load pima-indians-diabetes dataset
$filename = 'data/pima-indians-diabetes.csv';
$dataset = sml->load_csv($filename);
printf "Loaded data file %s with %d rows and %d columns. \n", $filename, scalar @$dataset, scalar @{$dataset->[0]};
for my $i (0 .. $#{$dataset->[0]}){
    sml->str_column_to_float($dataset, $i);
}
printf "%s\n", dump $dataset->[0];

$minmax = sml->dataset_minmax2($dataset);

# 2. Normalizar el dataset al rango [0, 1]
sml->normalize_dataset2($dataset, $minmax);

# 3. Calcular media y desviación estándar para estandarización
$means = sml->column_means2($dataset);
$stdevs = sml->column_stdevs2($dataset, $means);

# 4. Estandarizar el dataset (media=0, desviación=1)
sml->standardize_dataset2($dataset, $means, $stdevs);

# 5. Mostrar la primera fila procesada
printf "%s\n", dump $dataset->[0];

# Example Output From Standardizing the Diabetes Dataset.
# Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns
# [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]
# [0.6395304921176576, 0.8477713205896718, 0.14954329852954296, 0.9066790623472505,
# -0.692439324724129, 0.2038799072674717, 0.468186870229798, 1.4250667195933604,
# 1.3650063669598067]

Loaded data file data/pima-indians-diabetes.csv with 768 rows and 9 columns. 
["6.0", "148.0", "72.0", "35.0", "0.0", 33.6, 0.6, "50.0", "1.0"]
[
  0.639947295188904,
  0.84832364320755,
  0.149640902876854,
  0.907269954681396,
  -0.692890584468842,
  0.204012244939804,
  0.38508066534996,
  1.42599534988403,
  1.36589586734772,
]


1

### 2.2.3 When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution.
If a given data attribute is normal or close to normal, this is probably the scaling method to use.
It is good practice to record the summary statistics used in the standardization process so that
you can apply them when standardizing data in the future that you may want to use with your
model. Normalization is a scaling technique that does not assume any specific distribution.

If your data is not normally distributed, consider normalizing it prior to applying your
machine learning algorithm. It is good practice to record the minimum and maximum values
for each column used in the normalization process, again, in case you need to normalize new
data in the future to be used with your model.

## 2.3 Extensions

There are many other data transforms you could apply. The idea of data transforms is to best
expose the structure of your problem in your data to the learning algorithm. It may not be
clear what transforms are required upfront. A combination of trial and error and exploratory
data analysis (plots and stats) can help tease out what may work. Below are some additional
transforms you may want to consider researching and implementing:
* Normalization that permits a configurable range, such as -1 to 1 and more.
* Standardization that permits a configurable spread, such as 1, 2 or more standard deviations
from the mean.
* Exponential transforms such as logarithm, square root and exponents.
* Power transforms such as Box-Cox for fixing the skew in normally distributed data.

## 2.4 Review

In this tutorial, you discovered how to rescale your data for machine learning from scratch.
Specifically, you learned:
* How to normalize data from scratch.
* How to standardize data from scratch.
* When to use normalization or standardization on your data.