# Custom Transformations

This notebook is a companion to the essay [Custom Transformations with Automunge](https://medium.com/automunge/custom-transformations-with-automunge-ae694c635a7e), and demonstrates user-defined transformation functions for integration into the platform. We recommend reading this notebook after reading the essay.

The language presented will be similar to the essay, but will replace the code demonstrations with the full transformation function templates.

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import *
am = AutoMunge()

We'll apply these transformations to the titanic data set, which is a well known tabular benchmark available on Kaggle.

In [4]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


For the demonstrations below, we'll target applying the custom transformations to the 'Fare' feature which is a continuous numeric set.

In [5]:
targetcolumn = 'Fare'

The purpose of this notebook is to offer introduction to a newly streamlined standard for Automunge custom transformation functions. To be a little more precise, custom transformation functions are for user defined operations to be applied to transform the entries found in a column in a tabular data set, with support for basing those operations on properties of the entries in a designated “training set” for consistent basis on additional data applied with a separate corresponding “test set” custom transformation function. By implementing custom transformations through the platform, a user can then integrate such operations within a set of transformations with order of operations defined by our family tree primitives, and with the set potentially mixed with transforms available in our internal library. The integration of custom transformations comes with built in support for auto ML derived missing data infill and pushbutton aggregate inversions.

The new streamlined convention was rolled out in version 6.41, and served the purpose of abstracting away almost all of the complexity from prior conventions, where by complexity am referring to accommodating and populating various data structures passed alongside transformations. In the new convention, which we refer to in the documentation as the 'custom_train' convention, operations can be defined independent of data structures, which are all managed separately in a wrapper function. A user simply defines a function for a received training set dataframe (df), target column (column), and dictionary for received parameters (normalization_dict), and returns the resulting transformed dataframe (df) along with the same dictionary (normalization_dict) logging any properties from the train set needed to consistently prepare additional data. And since this is targeting a column in a Pandas dataframe, may include Pandas or Numpy operations for instance.

In this example, we’ll define a custom_train_template that applies an operation similar to z-score normalization.

In [6]:
def custom_train_template(df, column, normalization_dict):
  """
  #Template for a custom_train transformation function to be applied to a train feature set.
  
  #Where if a custom_test entry is not defined then custom_train will be applied to any 
  #corresponding test feature sets as well (as may be ok when processing the feature in df_test 
  #doesn't require accessing any train data properties from the normalization_dict).

  #Receives a df as a pandas dataframe
  #Where df will generally be from df_train (or may also be from df_test when custom_test not specified)

  #column is the target column of transform
  #which will already have the suffix appender incorporated when this is applied

  #normalization_dict is a dictionary pre-populated with any parameters passed in assignparam
  #(and also parameters designated in any defaultparams for the associated processdict entry)

  #returns the resulting transformed dataframe as df

  #returns normalization_dict, which is a dictionary for storing properties derived from train data
  #that may then be accessed to consistently transform test data
  #note that any desired drift statistics can also be stored in normalization_dict
  #e.g. normalization_dict.update({'property' : property})

  #note that prior to this function call 
  #a datatype casting based on the NArowtype processdict entry may have been performed
  #as well as a default infill of adjinfill 
  #unless infill type otherwise specified in a defaultinfill processdict entry
  #note that this default infill is a precursor to ML infill
  
  #note that if this same custom_train is to be applied to both train and test data 
  #(when custom_test not defined) then the quantity, headers, and order of returned columns 
  #will need to be consistent independent of data properties
  
  #Note that the assumptions for data type of recieved data
  #Should align with the NArowtype specified in processdict
  
  #Note that the data types and quantity of returned columns 
  #Will need to align with the MLinfilltype specified in processdict
  
  #note that following this function call a dtype conversion will take place based on MLinfilltype
  #unless deactivated with a dtype_convert processdict entry 
  """

  #As an example, here is the application of z-score normalization 
  #derived based on the training set mean and standard deviation
  
  #which can accept any kind of numeric data 
  #so corresponding NArowtype processdict entry can be 'numeric'
  #and returns a single column of continuous numeric data 
  #so corresponding MLinfilltype processdict entry will need to be 'numeric'

  #where we'll include the option for a parameter 'muiltiplier'
  #which is an arbitrary example to demonstrate accessing parameters
  
  #basically we check if that parameter had been passed in assignparam or defaultparams
  if 'multiplier' in normalization_dict:
    multiplier = normalization_dict['multiplier']
    
  #or otherwise assign and save a default value
  else:
    multiplier = 1
    normalization_dict.update({'multiplier' : multiplier})

  #Now we measure any properties of the train data used for the transformation
  mean = df[column].mean()
  stdev = df[column].std()
  
  #It's good practice to ensure numbers used in derivation haven't been derived as nan
  #or would result in dividing by zero
  if mean != mean:
    mean = 0
  if stdev != stdev or stdev == 0:
    stdev = 1
    
  #In general if that same basis will be needed to process test data we'll store in normalization_dict
  normalization_dict.update({'mean' : mean,
                             'stdev': stdev})

  #Optionally we can measure additional drift stats for a postmunge driftreport
  #we will also save those in the normalization_dict
  minimum = df[column].min()
  maximum = df[column].max()
  normalization_dict.update({'minimum' : minimum,
                             'maximum' : maximum})

  #Now we can apply the transformation
  
  #The generic formula for z-score normalization is (x - mean) / stdev
  #here we incorporate an additional variable as the multiplier parameter (defaults to 1)
  df[column] = (df[column] - mean) * multiplier / stdev
  
  #A few clarifications on column management for reference:
  
  #Note that it is ok to return multiple columns
  #we recommend naming additional columns as a function of the received column header 
  #e.g. newcolumn = column + '_' + str(int)
  #returned column headers should be strings
  
  #when columns are conditionally created as a function of data properties 
  #will need to save headers for reference in custom_test
  # e.g. normalization_dict.update('newcolumns_list' : [newcolumn]}
  
  #Note that it is ok to delete the received column from dataframe as part of transform if desired
  #If any other temporary columns were created as part of transform that aren't returned
  #their column headers should be logged as a normalization_dict entry under 'tempcolumns'
  # e.g. normalization_dict.update('tempcolumns' : [tempcolumn]}

  return df, normalization_dict

For additional data, what we refer to as test data, a user can either allow the same function to be applied as was applied to the training set (as may be appropriate when operations are independent of training set properties), or alternatively may define a corresponding custom transformation that applies transformations to the test data column based on properties accessed form the corresponding training data column which were logged in the returned dictionary.

This is an example of a corresponding operation using properties derived from the train data to conduct a form of z-score normalization on a consistent basis.

In [7]:
def custom_test_template(df, column, normalization_dict):
  """
  #This transform will be applied to a test data feature set
  #on a basis of a corresponding custom_train entry
  #Such as test data passed to either automunge(.) or postmunge(.)
  #Using properties from the train set basis stored in the normalization_dict

  #Note that when a custom_test entry is not defined, 
  #The custom_train entry will instead be applied to both train and test data

  #Receives df as a pandas dataframe of test data
  #and a string column header (column) 
  #which will correspond to the column (with suffix appender already included) 
  #that was passed to custom_train

  #Also receives a normalization_dict dictionary
  #Which will be the dictionary populated in and returned from custom_train

  #note that prior to this function call 
  #a datatype casting based on the NArowtype processdict entry may have been performed
  #as well as a default infill of adjinfill 
  #unless infill type otherwise specified in a defaultinfill processdict entry
  
  #where convention is that the quantity, headers, and order of returned columns
  #will need to match those returned from the corresponding custom_train
  """

  #As an example, here is the corresponding z-score normalization 
  #derived based on the training set mean and standard deviation
  #which was populated in a normalization_dict in the custom_train example given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #apply the transform
  #and return the transformed dataframe

  #access the train set properties from normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #then apply the transformation and return the dataframe
  df[column] = (df[column] - mean) * multiplier / stdev

  return df

The custom_train convention has even streamlined the specification of a corresponding custom inversion operation, which may be defined with a similar simple template or otherwise omitted when inversion support is not needed.

And here is an example of a corresponding inversion operation for the same operation.

In [8]:
def custom_inversion_template(df, returnedcolumn_list, inputcolumn, normalization_dict):
  """
  #User also has the option to define a custom inversion function
  #Corresponding to custom_train and custom_test

  #Where the function receives a dataframe df 
  #Containing a post-transform configuration of one or more columns whose headers are 
  #recorded in returnedcolumn_list
  #And this function is for purposes of creating a new column with header inputcolumn
  #Which inverts that transformation originally applied to produce those 
  #columns in returnedcolumn_list

  #Here normalization_dict is the same as populated and returned from a corresponding custom_train
  #as applied to the train set

  #Returns the transformed dataframe df with the addition of a new column as df[inputcolumn]
  
  #Note that the returned dataframe should retain the columns in returnedcolumn_list
  #Whose retention will be managed elsewhere
  """

  #As an example, here we'll be inverting the z-score normalization 
  #derived based on the training set mean and standard deviation
  #which corresponds to the examples given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #Initialize the new column inputcolumn
  #And use values in the set from returnedcolumn_list to recover values for inputcolumn

  #First let's access the values we'll need from the normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #Now initialize the inputcolumn
  df[inputcolumn] = 0

  #So for the example of z-score normalization, we know returnedcolumn_list will only have one entry
  #In some other cases transforms may have returned multiple columns
  returnedcolumn = returnedcolumn_list[0]

  #now we perform the inversion
  df[inputcolumn] = (df[returnedcolumn] * stdev / multiplier) + mean

  return df

Having defined our custom transformation functions, we can then pass them to an automunge(.) call by way of populating two corresponding data structures, the processdict and the transformdict. These data structures are for defining properties of “transformation categories” which can then be assigned to a column to apply the transformation functions. Each of these data structures are addressed in some detail in the recent essay [Data Structure](https://medium.com/automunge/data-structure-59e52f141dd6).

The processdict is for defining properties associated with a “transformation category”, including various properties like what kind of received data is considered valid input, the form and structure of returned data from the transform (e.g. integers, floats, or boolean integers in one or multiple returned columns), as well as the associated transformation functions, which as used here will be our custom transformation functions defined in the “custom_train” convention. (The bulk of the internal library has transformations defined in an alternate convention.) 

Here we demonstrate populating a processdict entry for a new transformation category we’ll refer to as ‘newt’. This string, in addition to serving as a transformation category identifier, will also be included as a suffix appender on the returned column (thus an input column with header ‘targetcolumn’ would be returned as ‘targetcolumn_newt’). Note that if we’re not sure what processdict entries to apply for other properties, we can just copy entries from another category from the library, here we’ll match our other entries to the ‘nmbr’ category by way of a functionpointer entry, which will populate entries corresponding to ‘nmbr’ for processdict entries not already specified. (Note that if we omit the entry for ‘custom_test’ or designate as None, the same custom_train_template will be applied to both training and test data.)

In [9]:
processdict = \
{'newt' :
 {'custom_train' : custom_train_template,
  'custom_test' : custom_test_template,
  'custom_inversion' : custom_inversion_template,
  'functionpointer' : 'nmbr',
 }}

The transformdict is for defining sets of transformation categories to be associated with a root transformation category by populating as entries to the Automunge family tree primitives. Thus, when the root category is assigned to a column, the transformation functions associated with the transformation category entries to the family tree primitives will be applied to that column. 

Here we demonstrate populating a root category set for the ‘newt’ category we just specified in the processdict, which will have transformation functions applied based on the family tree entries of a ‘newt’ category to apply our custom transformation functions as well as the ‘NArw’ category to populate markers for missing data. (Using these same primitives, it is possible to define sets of transformations that include generations and branches of derivations.)

In [10]:
transformdict = \
{'newt' :
 {'parents'       : [],
  'siblings'      : [],
  'auntsuncles'   : ['newt'],
  'cousins'       : ['NArw'],
  'children'      : [],
  'niecesnephews' : [],
  'coworkers'     : [],
  'friends'       : [],
 }}

Now that we’ve defined transformation category properties to the new category 'newt' including our custom transformation functions and populated a family tree for the use of 'newt' as a root category, we can then assign the root category to a target column with the automunge(.) assigncat parameter. Note that when assigning the same root category to multiple target columns, the columns can be entered as a list of headers (using [list] brackets) instead of a single string value as shown here.

In [11]:
assigncat = \
{'newt' : targetcolumn}

Note that if we want to pass parameters to our custom transformation function we can do so with an additional parameter known as assignparam.

Here we'll demonsrtate assigning a multiplier to the output with the 'multiplier' parameter defined in our custom_train_template function above.

In [12]:
assignparam = \
{'newt' : 
 {targetcolumn :
  {'multiplier' : 10}}}

Putting it all together for an automunge(.) call to prepare a training data set df_train would look something like this.

(Here we'll also turn off shuffling for visualization purposes and designate our labels_column since it won't be included in df_test.)

In [13]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(
  df_train,
  labels_column = 'Survived',
  shuffletrain=False,
  assigncat = assigncat,
  assignparam = assignparam,
  processdict = processdict,
  transformdict = transformdict)

labelctgy processdict entry wasn't provided for  newt
selecting arbitrary entry based on family tree
labelctgy selected as  newt

_______________
Begin Automunge

evaluating column:  PassengerId
processing column:  PassengerId
    root category:  nmbr
 returned columns:
['PassengerId_nmbr', 'PassengerId_NArw']

evaluating column:  Pclass
processing column:  Pclass
    root category:  nmbr
 returned columns:
['Pclass_nmbr', 'Pclass_NArw']

evaluating column:  Name
processing column:  Name
    root category:  hsh2
 returned columns:
['Name_hsh2', 'Name_NArw']

evaluating column:  Sex
processing column:  Sex
    root category:  bnry
 returned columns:
['Sex_bnry', 'Sex_NArw']

evaluating column:  Age
processing column:  Age
    root category:  nmbr
 returned columns:
['Age_nmbr', 'Age_NArw']

evaluating column:  SibSp
processing column:  SibSp
    root category:  nmbr
 returned columns:
['SibSp_nmbr', 'SibSp_NArw']

evaluating column:  Parch
processing column:  Parch
    root category:  n

In [14]:
train.head()

Unnamed: 0,PassengerId_nmbr,Pclass_nmbr,Name_hsh2,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,Ticket_hsh2,Fare_newt,PassengerId_NArw,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
0,-1.729137,0.826913,200,1,-0.530005,0.43255,-0.473408,403,-5.021631,0,...,0,0,1,0,0,0,0,0,1,1
1,-1.725251,-1.565228,164,0,0.57143,0.43255,-0.473408,611,7.864036,0,...,1,0,1,0,0,1,0,0,0,1
2,-1.721365,0.826913,962,0,-0.254646,-0.474279,-0.473408,310,-4.885798,0,...,0,0,0,0,0,0,0,0,1,1
3,-1.71748,-1.565228,796,0,0.364911,0.43255,-0.473408,761,4.204941,0,...,0,1,1,1,0,0,0,0,1,1
4,-1.713594,0.826913,894,1,0.364911,-0.474279,-0.473408,190,-4.860644,0,...,0,0,0,0,0,0,0,0,1,1


To view the form of the data returned from our custom transformation set as applied to the target feature 'Fare', we can use the column_map saved in the returned dictionary postprocess_dict to access from the returned dataframe train.

In [15]:
train[postprocess_dict['column_map']['Fare']].head()

Unnamed: 0,Fare_newt,Fare_NArw
0,-5.021631,0
1,7.864036,0
2,-4.885798,0
3,4.204941,0
4,-4.860644,0


Having populated our postprocess_dict through the automunge call which logs all of the steps and parameters of transformations (you did remember to save it, right?), we can then prepare additional df_test data in a pushbutton operation with a postmunge(.) call which will apply transformations and imputations to corresponding data on a consistent basis. Note that we’ll need to initialize our custom functions again if this is taking place in a separate notebook.

In [16]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(
  postprocess_dict,
  df_test)

_______________
Begin Postmunge

______

processing column:  PassengerId
    root category:  nmbr

 returned columns:
['PassengerId_nmbr', 'PassengerId_NArw']

______

processing column:  Pclass
    root category:  nmbr

 returned columns:
['Pclass_nmbr', 'Pclass_NArw']

______

processing column:  Name
    root category:  hsh2

 returned columns:
['Name_hsh2', 'Name_NArw']

______

processing column:  Sex
    root category:  bnry

 returned columns:
['Sex_bnry', 'Sex_NArw']

______

processing column:  Age
    root category:  nmbr

 returned columns:
['Age_nmbr', 'Age_NArw']

______

processing column:  SibSp
    root category:  nmbr

 returned columns:
['SibSp_nmbr', 'SibSp_NArw']

______

processing column:  Parch
    root category:  nmbr

 returned columns:
['Parch_nmbr', 'Parch_NArw']

______

processing column:  Ticket
    root category:  hsh2

 returned columns:
['Ticket_hsh2', 'Ticket_NArw']

______

processing column:  Fare
    root category:  newt

 returned columns:
['Fare_n

In [17]:
test.head()

Unnamed: 0,PassengerId_nmbr,Pclass_nmbr,Name_hsh2,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,Ticket_hsh2,Fare_newt,PassengerId_NArw,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
0,1.733022,0.826913,927,1,0.330491,-0.474279,-0.473408,953,-4.905077,0,...,0,0,0,0,0,0,0,0,1,0
1,1.736908,0.826913,183,0,1.190988,0.43255,-0.473408,123,-5.07194,0,...,0,0,0,0,0,0,0,0,1,1
2,1.740794,-0.369157,846,1,2.223584,-0.474279,-0.473408,197,-4.531124,0,...,0,0,0,0,0,0,0,0,1,0
3,1.74468,0.826913,472,1,-0.185806,-0.474279,-0.473408,305,-4.737389,0,...,0,0,0,0,0,0,0,0,1,1
4,1.748565,0.826913,739,0,-0.530005,0.43255,0.767199,849,-4.007916,0,...,0,0,0,0,0,0,0,0,1,1


Similarly, we can use the postprocess_dict returned from automunge(.) to invert transformations. Shown here for the test set that was just prepared (test), inversions could also be performed separately to invert predictions after an inference operation and recover the original form of labels.

In [18]:
df_invert, recovered_list, \
inversion_info_dict = \
am.postmunge(
  postprocess_dict,
  test,
  inversion='test')

_______________
Begin Postmunge

Evaluating inversion paths for columns derived from:  PassengerId
Inversion path selected based on returned column  PassengerId_nmbr
With full recovery.
Recovered source column:  PassengerId

Evaluating inversion paths for columns derived from:  Pclass
Inversion path selected based on returned column  Pclass_nmbr
With full recovery.
Recovered source column:  Pclass

Evaluating inversion paths for columns derived from:  Name
No inversion path available for source column:  Name

Evaluating inversion paths for columns derived from:  Sex
Inversion path selected based on returned column  Sex_bnry
With full recovery.
Recovered source column:  Sex

Evaluating inversion paths for columns derived from:  Age
Inversion path selected based on returned column  Age_nmbr
With full recovery.
Recovered source column:  Age

Evaluating inversion paths for columns derived from:  SibSp
Inversion path selected based on returned column  SibSp_nmbr
With full recovery.
Recovere

Note that an inversion will not recover input form for transforms that do not support inversion, such as the transforms applied to 'Name' and 'Ticket'.

In [19]:
df_invert.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,892.0,3.0,male,34.5,5.960464e-08,0.0,7.829203,,Q
1,893.0,3.0,female,47.0,1.0,0.0,7.0,,S
2,894.0,2.0,male,62.0,5.960464e-08,0.0,9.687502,,Q
3,895.0,3.0,male,27.0,5.960464e-08,0.0,8.6625,,S
4,896.0,3.0,female,22.0,1.0,1.0,12.2875,,S


You now have all you need to define custom transformations for integration into an Automunge data pipeline. Remember, with great power comes great responsibility. Don’t forget to have fun.