# Quantile Transformation
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

DataPrep has the ability to perform quantile transformation to a numeric column. This transformation can transform the data into a normal or uniform distribution. Values bigger than the learnt boundaries will simply be clipped to the learnt boundaries when applying quantile transformation.

Let's load a sample of the median income of california households in different suburbs from the 1990 census data. From the data profile, we can see that the minimum value and maximum value is 0.9946 and 15 respectively.

In [1]:
import azureml.dataprep as dprep

dflow = dprep.read_csv(path='../data/median_income.csv').set_column_types(type_conversions={
    'median_income': dprep.TypeConverter(dprep.FieldType.DECIMAL)
})
dflow.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
median_income,FieldType.DECIMAL,0.9946,15.0,250.0,0.0,250.0,0.0,0.0,0.0,0.9946,1.96745,1.9531,2.6998,3.6279,4.77335,8.3792,11.2866,15.0,4.007843,2.026679,4.10743,1.763205,4.703196


Let's now apply quantile transformation to `median_income` and see how that affects the data. We will apply quantile transformation twice, one that maps the data to a Uniform(0, 1) distribution, one that maps it to a Normal(0, 1) distribution.

From the data profile, we can see that the min and max of the uniform median income is strictly between 0 and 1 and the mean and standard deviation of the normal median income is close to 0 and 1 respectively.

*Note: for normal distribution, we will clip the values at the ends as the 0th percentile and the 100th percentile are -Inf and Inf respectively.*

In [2]:
dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_uniform', quantiles_count=5)
dflow = dflow.quantile_transform(source_column='median_income', new_column='median_income_normal', 
                           quantiles_count=5, output_distribution="Normal")
dflow.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
median_income,FieldType.DECIMAL,0.9946,15.0,250.0,0.0,250.0,0.0,0.0,0.0,0.9946,1.96745,1.9531,2.6998,3.6279,4.77335,8.3792,11.2866,15.0,4.007843,2.026679,4.10743,1.763205,4.703196
median_income_normal,FieldType.DECIMAL,-7.941345,7.941444,250.0,0.0,250.0,0.0,0.0,0.0,-7.941345,-1.068625,-1.077959,-0.67449,-0.000726,0.667828,0.986876,1.335982,7.941444,-0.062255,1.022674,1.045862,-0.00028,26.651015
median_income_uniform,FieldType.DECIMAL,0.0,1.0,250.0,0.0,250.0,0.0,0.0,0.0,0.0,0.14263,0.140526,0.25,0.49971,0.747867,0.838148,0.909222,1.0,0.484399,0.253287,0.064154,-0.079341,-1.335671


Let's now save the dataflow which we will later load in the operationalization notebook.

In [3]:
from tempfile import mkdtemp
from os import path

tmp_dir = mkdtemp()
pkg_path = path.join(tmp_dir, 'quantile_transform.dprep')
pkg = dprep.Package(arg=dflow)
pkg.save(pkg_path)
print('Package saved to: "{}"'.format(pkg_path))

Package saved to: "/tmp/tmpwt82axi5/quantile_transform.dprep"
