# Examples of Transformations

### An Automunge Demonstration

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import Automunger
am = Automunger.AutoMunge()

Automunge is a platform for preparing tabular data for machine learning, where tabular data refers to two-dimensional tables of feature set columns and sampled rows, such as may be provided in the form of a pandas dataframe or numpy array.

The library has two master functions:
- automunge(.) for the initial preparation of "training" data
- postmunge(.) for subsequent preparation of "test" data on a consistent basis

Where training data refers to data that may then be applied to train a machine learning model, and test data refers to corresponding data that may be applied to generate predictions from that model.

The preparations performed by the function include numerical encodings, missing data infill, and may also include feature engineering transformations sourced from an extensive built in library or even custom defined - making raw data suitable for the direct application of machine learning.

Feature engineering transformations, and in some cases sets of transformations which may include generations and branches of derivations, may be applied to a distinct column in the received set. In some cases, received feature sets may be returned in multiple configurations.

When the automunge(.) function is applied to a received tabular training set, in addition to applying data transformations, the function populates and returns a dictionary containing all of the steps and parameters of transformations, which may then be passed to the postmunge(.) function with a corresponding tabular test set for consistent processing on the training set basis.

For our demonsrtations, we'll activate a few parameters to make visualization easier:
- we'll turn off data set shuffling with shuffletrain parameter
- we'll turn off printouts with printstatus parameter

To demonstrate the types of data transformations available, let's create a toy data set with a few representative categories of data. We'll create a toy training set which includes a labels column and a toy test set without labels included.

In [4]:
import pandas as pd
import numpy as np

toy_df_train = \
pd.DataFrame({'numbers':[1.0, 5.2, -3, 0, 11, np.nan], \
              'strings':['square', 'square', 'circle', 'circle', 'triangle', np.nan], \
              'time'   :['1/01/2017 1:00am', 
                         '3/10/2018 6:30am', 
                         '5/31/2018 10:15am', 
                         '7/26/2019 1:40pm', 
                         '9/05/2019 6:45pm', 
                         '12/25/2019'], \
              'labels' :['yes', 'yes', 'yes', 'no', 'no', 'yes'],
             })


toy_df_test = \
pd.DataFrame({'numbers':[-2, 3.4, 0, 10, np.nan, 5], \
              'strings':['circle', 'square', 'square', 'triangle', np.nan, 'circle'], \
              'time'   :['3/03/2018 4:00am', 
                         '7/10/2017 6:30pm', 
                         '11/12/2018 6:15am', 
                         '4/13/2019 12:40pm', 
                         '8/25/2017 6:45pm', 
                         '1/02/2018'],
             })


In [5]:
toy_df_train

Unnamed: 0,numbers,strings,time,labels
0,1.0,square,1/01/2017 1:00am,yes
1,5.2,square,3/10/2018 6:30am,yes
2,-3.0,circle,5/31/2018 10:15am,yes
3,0.0,circle,7/26/2019 1:40pm,no
4,11.0,triangle,9/05/2019 6:45pm,no
5,,,12/25/2019,yes


In [6]:
toy_df_test

Unnamed: 0,numbers,strings,time
0,-2.0,circle,3/03/2018 4:00am
1,3.4,square,7/10/2017 6:30pm
2,0.0,square,11/12/2018 6:15am
3,10.0,triangle,4/13/2019 12:40pm
4,,,8/25/2017 6:45pm
5,5.0,circle,1/02/2018


Our demonstrations will illustrate different types of data transformations that may be applied to these toy data sets. Let's start with numerical encodings under automation.

When we run the automunge(.) function, the output returns a series of 17 sets, the naming of these returned sets has an optional convention for these demonstrations.

- train: processed training data
- trainID: index column and carveout columns corresponding to train data
- labels: processed label sets corresponding to training data
- validation1, validationID1, validationlabels1: validation sets carved out from training data
- validation2, validationID2, validationlabels2: second validation sets carved out from training data
- test, testID, testlabels: processed test data if a test set was passed to function
- labelsencoding_dict, finalcolumns_train, finalcolumns_test, featureimportance: informational reports
- postprocess_dict: populated dictionary for consistently preparing additional adata

In [7]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              printstatus = False)

train

Unnamed: 0,numbers_nmbr,strings_1010_0,strings_1010_1,time_tmzn_year,time_tmzn_mdsn,time_tmzn_mdcs,time_tmzn_hmss,time_tmzn_hmsc,time_tmzn_bshr,time_tmzn_wkdy,time_tmzn_hldy
0,-0.379221,0,1,-1.632993,0.5145554,0.857457,0.258819,0.965926,0,0,0
1,0.486392,0,1,-0.408248,0.9857698,-0.168101,0.991445,-0.130526,0,0,0
2,-1.203615,0,0,-0.408248,1.224647e-16,-1.0,0.442289,-0.896873,1,1,0
3,-0.585319,0,0,0.816497,-0.8207635,-0.571268,-0.422618,-0.906308,1,1,0
4,1.681763,1,0,0.816497,-0.9961947,0.087156,-0.980785,0.19509,0,1,0
5,0.0,1,1,0.816497,0.4098203,0.912166,0.0,1.0,0,1,1


Note that if this training data is to be used to train a model, the returned postprocess_dict dictionary should be externally saved, such as with the pickle library.

And using the returned dictionary from the automunge(.) call, we can then consistently process test data, returning the same type and order of columns with consistent basis derived from the training set.

In [8]:
test, testID, testlabels, \
labelsencoding_dict, postreports_dict = \
am.postmunge(postprocess_dict, toy_df_test, \
             printstatus = False)

test

Unnamed: 0,numbers_nmbr,strings_1010_0,strings_1010_1,time_tmzn_year,time_tmzn_mdsn,time_tmzn_mdcs,time_tmzn_hmss,time_tmzn_hmsc,time_tmzn_bshr,time_tmzn_wkdy,time_tmzn_hldy
0,-0.997516,0,0,-0.408248,0.998717,-0.050649,0.866025,0.5,0,0,0
1,0.115415,0,1,-1.632993,-0.638465,-0.769651,-0.991445,0.130526,0,1,0
2,-0.585319,0,1,-0.408248,-0.309017,0.951057,0.997859,-0.065403,0,1,1
3,1.475665,1,0,0.816497,0.731354,-0.681998,-0.173648,-0.984808,1,0,0
4,0.0,1,1,-1.632993,-0.994869,-0.101168,-0.980785,0.19509,0,1,0
5,0.445173,0,0,-0.408248,0.528964,0.848644,0.0,1.0,0,1,0


# ___________________

# numeric

Let's inspect each of the returned sets in detail. For visualization we concatinate an input column with the corresponding returned set. The returned sets will have the same column header as the input but with a suffix appender associated with the transfomration that was applied.

In [9]:
inputcolumn = 'numbers'

visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,numbers,numbers_nmbr
0,1.0,-0.379221
1,5.2,0.486392
2,-3.0,-1.203615
3,0.0,-0.585319
4,11.0,1.681763
5,,0.0


Here we see that the input column 'numbers', containing numerical data, defaulted to a z-score normalization in which the data was centered and scaled to a mean of 0 and standard deviation of 1. The default missing data infill for this transform is imputation with the set's mean.

# categoric

The defaults for categoric data applies numeric encoding by way of a binary transform, in which distinct categoric values may be represented by zero, one, or more simultaneous activations in the returned set. The convention for missing data infill is a distinct set of activations.

In [10]:
inputcolumn = 'strings'

visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,strings,strings_1010_0,strings_1010_1
0,square,0,1
1,square,0,1
2,circle,0,0
3,circle,0,0
4,triangle,1,0
5,,1,1


# datetime

The datetime data has some more extensive transformations. Here the time stamps are segregated by time scale and subjected to both sin and cos transforms to allow a model to recognize periodicity. We also by default aggregate bin markers to designate time stamps that fall within business hours, weekdays, and holidays.

In [11]:
inputcolumn = 'time'

visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,time,time_tmzn_year,time_tmzn_mdsn,time_tmzn_mdcs,time_tmzn_hmss,time_tmzn_hmsc,time_tmzn_bshr,time_tmzn_wkdy,time_tmzn_hldy
0,1/01/2017 1:00am,-1.632993,0.5145554,0.857457,0.258819,0.965926,0,0,0
1,3/10/2018 6:30am,-0.408248,0.9857698,-0.168101,0.991445,-0.130526,0,0,0
2,5/31/2018 10:15am,-0.408248,1.224647e-16,-1.0,0.442289,-0.896873,1,1,0
3,7/26/2019 1:40pm,0.816497,-0.8207635,-0.571268,-0.422618,-0.906308,1,1,0
4,9/05/2019 6:45pm,0.816497,-0.9961947,0.087156,-0.980785,0.19509,0,1,0
5,12/25/2019,0.816497,0.4098203,0.912166,0.0,1.0,0,1,1


# labels

Our label set for categoric labels defaults to one-hot encoding, a different type of categoric encoding than the binary demonstrated above. In one hot encoding, each distinct categoric value is associated with activations in a distinct column.

In [12]:
inputcolumn = 'labels'

visualization = pd.concat([toy_df_train[inputcolumn], \
                           labels[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,labels,labels_no,labels_yes
0,yes,0,1
1,yes,0,1
2,yes,0,1
3,no,1,0
4,no,1,0
5,yes,0,1


# ___________________

# assigned transforms

A user does not have to defer to defaults. Automunge includes an extensive library of data transformations and data transformation sets that can be easily applied to a column. Let's demonstrate a few alternatives.

Assigning transformations to a column is performed by passing a dictionary called the assigncat.

# normalizations

A few different types of normalizations are available. Here for example is a min-max scaling, which has more of a known range of output than the z-score normalization demonstrated above.

In [13]:
target_category = 'mnmx'
inputcolumn     = 'numbers'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,numbers,numbers_mnmx
0,1.0,0.285714
1,5.2,0.585714
2,-3.0,0.0
3,0.0,0.214286
4,11.0,1.0
5,,0.417143


Another type of normalization available in the library retains the sign of the received data after scaling, such as may benefit interpretibility.

In [14]:
target_category = 'retn'
inputcolumn     = 'numbers'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,numbers,numbers_retn
0,1.0,0.071429
1,5.2,0.371429
2,-3.0,-0.214286
3,0.0,0.0
4,11.0,0.785714
5,,0.202857


Normalizations are also available to be applied in sets, such as to present a received column in multiple configurations to a training operation. Here is an example in which min-max normalization is supplemented by a set of aggregated bins indicating number of standard deviations from the mean.

In [15]:
target_category = 'mnm7'
inputcolumn     = 'numbers'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,numbers,numbers_mnmx,numbers_bins_0,numbers_bins_1,numbers_bins_2,numbers_bins_3,numbers_bins_4,numbers_bins_5
0,1.0,0.285714,0,0,1,0,0,0
1,5.2,0.585714,0,0,0,1,0,0
2,-3.0,0.0,0,1,0,0,0,0
3,0.0,0.214286,0,0,1,0,0,0
4,11.0,1.0,0,0,0,0,1,0
5,,0.417143,0,0,1,0,0,0


# categoric 

Similarily, there are several options available for categoric sets, let's demonstrate a few. First, again the defaults for categoric are binary encodings.

In [16]:
target_category = '1010'
inputcolumn     = 'strings'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,strings,strings_1010_0,strings_1010_1
0,square,0,1
1,square,0,1
2,circle,0,0
3,circle,0,0
4,triangle,1,0
5,,1,1


Note that the defaults themselves are configurable.

Here is one-hot encoding, in which each category has a unique column for activations.

In [17]:
target_category = 'text'
inputcolumn     = 'strings'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,strings,strings_circle,strings_square,strings_triangle
0,square,0,1,0
1,square,0,1,0
2,circle,1,0,0
3,circle,1,0,0
4,triangle,0,0,1
5,,0,0,0


Another option is ordinal integer encodings, which is applied as a default when the number of unique values exceeds a configuratble heuristic threshold.

In [18]:
target_category = 'ord3'
inputcolumn     = 'strings'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,strings,strings_ord3
0,square,1
1,square,1
2,circle,0
3,circle,0
4,triangle,2
5,,3


Of course it's not enough to just numerically encode training data, we need to apply a consistent basis to test data, here demonstrated for the postmunge function.

In [19]:
test, testID, testlabels, \
labelsencoding_dict, postreports_dict = \
am.postmunge(postprocess_dict, toy_df_test, \
             printstatus = False)

visualization = pd.concat([toy_df_test[inputcolumn], \
                           test[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,strings,strings_ord3
0,circle,0
1,square,1
2,square,1
3,triangle,2
4,,3
5,circle,0


# transformation sets

In some cases, we may desire to present our feature sets to machine learning in multiple configurations, such as for example if we are not sure of interpretation. In Automunge sets of transformations can be specified for application to a target column with a simple set of "family tree" primitives defined for a root category. Let's demonstrate here.

Custom defined family trees of transformations can be passed to a function call by way of the transformdict and processdict parameters. The transformdict parameter is used to populate a family tree, and the processdict parameter is to populate a supporting data structure for defining transformation category properties.

Let's demonstrate a scenario to assemble a transformation set in which a normalization is supplemented by two types of bin aggregation. We'll create a new root category 'newt' and populate with transformations pre-defined in the library. Here we'll apply an upstream retain normalization and power of ten bins, and a standard deviation bins downstream of the retain normalization. We'll also include a NArw transformation which designates markers for entries that were subject to infill based on values of the source column.

In [20]:
transformdict =  {'newt' : {'parents'       : ['newt'],
                            'siblings'      : [],
                            'auntsuncles'   : ['pwr2'],
                            'cousins'       : ['NArw'],
                            'children'      : [],
                            'niecesnephews' : [],
                            'coworkers'     : [],
                            'friends'       : ['bins']}}

The corresponding processdict will make use of transformation functions defined in the library for the retain normalization. Here NArowtype designates the types of entries from the source column that will be targets for infill, MLinfilltype designates the types of predictive models to be trained for ML infill, and labelctgy is a support entry for feature importance for cases where a label is returned in multiple configurations.

In [21]:
processdict    =  {'newt' : {'functionpointer' : 'retn', 
                             'NArowtype'       : 'numeric',
                             'MLinfilltype'    : 'numeric',
                             'labelctgy'       : 'newt'}}

We can then pass these populated structures to a function call and assign a column to the newly defined root category 'newt'. If we want we can also apply ML infill even on this custom defined transformation set.

In [22]:
target_category = 'newt'
inputcolumn     = 'numbers'

assigncat = {target_category : inputcolumn}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df_train, \
              labels_column = 'labels', \
              shuffletrain = False, \
              assigncat = assigncat, \
              transformdict = transformdict, \
              processdict = processdict, \
              printstatus = False)



visualization = pd.concat([toy_df_train[inputcolumn], \
                           train[postprocess_dict['column_map'][inputcolumn]]], \
                           axis=1)
                           
visualization

Unnamed: 0,numbers,numbers_NArw,numbers_retn,numbers_retn_bins_0,numbers_retn_bins_1,numbers_retn_bins_2,numbers_retn_bins_3,numbers_retn_bins_4,numbers_retn_bins_5,numbers_-10^0,numbers_10^0,numbers_10^1
0,1.0,0,0.071429,0,0,1,0,0,0,0,1,0
1,5.2,0,0.371429,0,0,0,1,0,0,0,1,0
2,-3.0,0,-0.214286,0,1,0,0,0,0,1,0,0
3,0.0,0,0.0,0,0,1,0,0,0,0,0,0
4,11.0,0,0.785714,0,0,0,0,1,0,0,0,1
5,,1,0.202857,0,0,1,0,0,0,0,0,0
