# Min-Max Scaler
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
import azureml.dataprep as dprep

The min-max scaler scales all values in a column to a desired range (typically [0, 1]). This is also known as feature scaling or unity-based normalization. Min-max scaling is commonly used to normalize numeric columns in a data set for machine learning algorithms.

First, load a data set containing information about crime in Chicago. Keep only a few columns.

In [2]:
dflow = dprep.read_csv('../data/crime-spring.csv')
dflow = dflow.keep_columns(columns=['ID', 'District', 'FBI Code'])
dflow = dflow.to_number(columns=['District', 'FBI Code'])
dflow.head(5)

Unnamed: 0,ID,District,FBI Code
0,10498554,5.0,11.0
1,10516598,6.0,6.0
2,10519196,22.0,11.0
3,10519591,5.0,10.0
4,10534446,17.0,6.0


Using `get_profile()`, you can see the shape of the numeric columns such as the minimum, maximum, count, and number of error values.

In [3]:
dflow.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.STRING,10498554,10535059,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
District,FieldType.DECIMAL,5,24,10.0,0.0,10.0,0.0,0.0,0.0,5.0,5.0,5.0,6.0,13.0,19.0,24.0,24.0,24.0,13.5,6.94822,48.2778,0.0930109,-1.62325
FBI Code,FieldType.DECIMAL,6,11,10.0,0.0,10.0,0.0,0.0,0.0,6.0,6.0,6.0,6.0,11.0,11.0,11.0,11.0,11.0,9.4,2.36643,5.6,-0.702685,-1.59582


To apply min-max scaling, call the function `min_max_scaler` on the Dataflow and specify the column name. This will trigger a full data scan over the column to determine the min and max values and perform the scaling. Note that the min and max values of the column are preserved at this point. If the same dataflow steps are performed over a different dataset, the min-max scaler must be re-executed.

In [4]:
dflow_district = dflow.min_max_scale(column='District')
dflow_district.head(5)

Unnamed: 0,ID,District,FBI Code
0,10498554,0.0,11.0
1,10516598,0.052632,6.0
2,10519196,0.894737,11.0
3,10519591,0.0,10.0
4,10534446,0.631579,6.0


Look at the data profile to see that the "District" column is now scaled; the min is 0 and the max is 1. Any error values and missing values from the source column are preserved.

In [5]:
dflow_district.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.STRING,10498554,10535059,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
District,FieldType.DECIMAL,0,1,10.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0526316,0.421053,0.736842,1.0,1.0,1.0,0.447368,0.365696,0.133733,0.0930109,-1.62325
FBI Code,FieldType.DECIMAL,6,11,10.0,0.0,10.0,0.0,0.0,0.0,6.0,6.0,6.0,6.0,11.0,11.0,11.0,11.0,11.0,9.4,2.36643,5.6,-0.702685,-1.59582


You can also specify a custom range for the scaling. Instead of [0, 1], let's choose [-10, 10].

In [6]:
dflow_district_range = dflow.min_max_scale(column='District', range_min=-10, range_max=10)
dflow_district_range.head(5)

Unnamed: 0,ID,District,FBI Code
0,10498554,-10.0,11.0
1,10516598,-8.947368,6.0
2,10519196,7.894737,11.0
3,10519591,-10.0,10.0
4,10534446,2.631579,6.0


In some cases, you may want to manually provide the min and max of the data in the source column. For example, you may want to avoid a full data scan because the dataset is large and we already know the min and max. You can provide the known min and max to the `min_max_scaler` function. The column will be scaled using the provided values. For example, if you want to scale the `FBI Code` column with 6  (`data_min`) becoming 0 (`range_min`), the program will scan the data to get `data_max`, which will become 1 (`range_max`).

In [7]:
dflow_fbi = dflow.min_max_scale(column='FBI Code', data_min=6)
dflow_fbi.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.STRING,10498554,10535059,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
District,FieldType.DECIMAL,5,24,10.0,0.0,10.0,0.0,0.0,0.0,5.0,5.0,5.0,6.0,13.0,19.0,24.0,24.0,24.0,13.5,6.94822,48.2778,0.0930109,-1.62325
FBI Code,FieldType.DECIMAL,0,1,10.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.68,0.473286,0.224,-0.702685,-1.59582


## Using a Min-Max Scaler builder

For more flexibility when constructing the arguments for the min-max scaling, you can use a Min-Max Scaler builder.

In [8]:
builder = dflow.builders.min_max_scale(column='District')
builder

MinMaxScalerBuilder
    column: 'District'
    range_min: 0
    range_max: 1
    data_min: None
    data_max: None

Calling `builder.learn()` will trigger a full data scan to see what `data_min` and `data_max` are. You can choose whether to use these values or set custom values.

In [9]:
builder.learn()
builder

MinMaxScalerBuilder
    column: 'District'
    range_min: 0
    range_max: 1
    data_min: 5.0
    data_max: 24.0

If you want to provide custom values for any of the arguments, you can update the builder object.

In [10]:
builder.range_max = 10
builder.data_min = 6
builder

MinMaxScalerBuilder
    column: 'District'
    range_min: 0
    range_max: 10
    data_min: 6
    data_max: 24.0

When you are satisfied with the arguments, you will call `builder.to_dataflow()` to get the result. Note that the min and max values of the source column is preserved by the builder at this point. If you need to get the true `data_min` and `data_max` values again, you will need to set those arguments on the builder to `None` and then call `builder.learn()` again.

In [11]:
dflow_builder = builder.to_dataflow()
dflow_builder.head(5)

Unnamed: 0,ID,District,FBI Code
0,10498554,-0.555556,11.0
1,10516598,0.0,6.0
2,10519196,8.888889,11.0
3,10519591,-0.555556,10.0
4,10534446,6.111111,6.0
