# pickle tutorial - security and alternate workflows

The purpose of this notebook will be to demonstrate some conventions for saving the "postprocess_dict" dictionary, which is the returned dictionary from an automunge(.) call which may be used to productionize a model with preprocessing built through the Automunge library. 

Note that our tutorials so far have relied on the pickle library which is a native python module used to serialize and download python dictionaries, where the serialization is a non-encrypted form of data compression, and then the same pickle module may then be used to upload and initialize that same dictionary in a new notebook or environment. In some cases the serialization of a postprocessdict may be beneficial from a storage memory standpoint when preprocessing includes forms of ML infill which may store in the dictionary a trained model specific to each tabular feature. We understand that there are other options available to download a python dictionary without serialization (e.g. libraries like JSON, YAML, etc), however the pickle module has support for serializing additional datya types populated in a dictionary that may not be supported by other forms of JSON download. The most relevant data types that pickle may help with include those trained machine learning models returned Automunge library components for e.g. ML infill, PCA, feature selection, etc, as well as any custom data transformation functions that may have been iniatialized by a user for purposes of custom data transformations integrated into the family tree primitives API.

The pickle library, although widely used in data science and other circles, has a known security vulnerability associated with cases with an uploaded serialized dictionary had been altered and not matched to the original intended form of distribution. We note in our readme that python's documentation has suggested a mitigation for this vulnerability that relies on a second python module called [hmac](https://docs.python.org/3/library/hmac.html#module-hmac), which serves the purposes of deriving a form of signature for an original downloaded pickle object which can then be compared to an uploaded object received from a potentially unsecure channel in order to validate that the uploaded form matches the original intended basis. In the demonstrations of this notebook the hmac signature will be derived and then affixed to the pickled object for comparison prior to loading. An alternative and even more secure approach may be to share the signature thorugh a seperate more secure channel as an additional means of redundancy. This notebook is our first demonstration of the incorporation of the hmac tool into a workflow in our documentation, which will be one of the agendas of this notebook.

One of the ways that pickle appears to attempt to circumvent aspects of this vulnerability is that for cases where a data type is associated with a custom defined function (such as we noted above may be the cases when the Automunge API is used for custom designed feature transformation sets), the serialized pickle object doesn't actually store and retain a full function definition, it merely saves a pointer to a function identifier, such that when a pickle object is uploaded and reinitialized into a seperate notebook or environment a user must first reinitialize any custom functions that were oriignally populated through an automunge(.) call. 

In the context of the Automunge workflow, there are definitely tradeoffs associated with relying on a user to re-initalize a custom function definition in a new production environment. Intuitively, for any scenario where the user training a machine learning model with preprocessing conducted through the Automunge API ("Alice") does not match the user running the model in a productionized environment ("Bob"), there may be a desire for Bob not to have easy access to the full function definitions, either from a privacy standpoint or just as important from an ease of distribution and initialization standpoint. There are alternatives to the pickle library that may have different treatment for the serialization of custom function definitions. Two examples of pickle alternatives include the [dill](https://pypi.org/project/dill/) and [joblib](https://pypi.org/project/joblib/) libraries, this notebook will mainly focus on conventions of the dill library as a starting point. 

The dill library we understand to have several similar conventions to the pickle library, in fact for our demonstrations below we merely replaced the import and accessing of "pickle" with "dill", however a key difference is that when serializing a python dictionary with entries of custom defined functions, instead of serializing the function as a pointer like pickle, the dill library serializes the entire function definition. From an ease of distribtuion and integration into a new productionized environment this has obvious benefits (no need to seperately distribute function definitions), however from a security standpoint this raises the stakes quite significantly for trust in a recieved serialized object. We suggest that in any scenario where a user is considering a library like dill for ease of distriobution in context of custom functions that they _always_ do so in conjunction with the hmac signature functionality which we will also demonstrate below. Please note that while the pickle is an internal module to the python language with all of the quality control this would imply, libraries like dill (or Automunge for that matter) are often open source projects from smaller development teams and so a user may consider performing their own diligence in conjection with a productionized implemention.

To restate an important distinction, please note that our demonstrations of the hmac signature will be associated with affixing a derived signature as the first line in the serialized package. We expect that an even more secure form of packaging could be implemented by distirbuting such signature through some seperate and more secure channel as an added bit of redundancy.

__________

The Automunge library [readme](https://github.com/Automunge/AutoMunge/blob/master/README.md) contains a concise demonstration / code sample for the integration of the pickle module into a user workflow, we provide again here:

-----

```
#Sample pickle code:

#sample code to download postprocess_dict dictionary returned from automunge(.)
import pickle
with open('filename.pickle', 'wb') as handle:
  pickle.dump(postprocess_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

#to upload for later use in postmunge(.) in another notebook
import pickle
with open('filename.pickle', 'rb') as handle:
  postprocess_dict = pickle.load(handle)

#Please note that if you included externally initialized functions in an automunge(.) call
#like for custom_train transformation functions or customML inference functions
#they will need to be reinitialized prior to uploading the postprocess_dict with pickle.
```
-----

In cases where we demonstrate alternate workflows to this sample code it will be presented in a form that may be directly substituted in place of the code sample above.

Our agenda for this notebook will be to present the following alternatives to the sample pickle workflow noted above:

**1. demonstration of the Automunge API in a workflow relying on the pickle code demonstration above**

**2. pickle module as used in conjunction with custom defined transformation functions in Automunge API**

**3. demonstration of the integration of the hmac signature into a pickle workflow**

**4. demonstration of the dill library (in conjunction with hmac signature) as an alternative to pickle**

**5. demonstration of the dill library (in conjunction with hmac signature) with custom defined transformation functions**

_____

Great let's get started.
_____

# 1. demonstration of the Automunge API in a workflow relying on the pickle code demonstration above

Some basics of the workflow under automation presented here. This is just meant to demonstrate the inegration of the pickle downloads / upload.

Assume we are working with the Titanic data set as a simple common benchmark.

In [1]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

#run automunge(.) to encode the dataframes and populate a postprocess_dict
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column,
             trainID_column = trainID_column,
             ML_cmnd = {'stochastic_impute_numeric' : False,
                        'stochastic_impute_categoric' : False,
                       },
            )

_______________
Begin Automunge

______

versioning serial stamp:
_8.33_185349838215

Automunge returned train column set: 
['Pclass_nmbr', 'Sex_bnry', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Ticket_nmbr', 'Fare_nmbr', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [2]:
#sample code to download postprocess_dict dictionary returned from automunge(.)
import pickle
with open('filename.pickle', 'wb') as handle:
  pickle.dump(postprocess_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

Now if we want to simulate uplading the serialized pickle dictionary in a seperate notebook we can restart the jupyter notebook kernel, usually can use this keyboard shortcut.
```
#now apply a Kernel restart
#Esc + 0 0
```

Then in the reset notebook we can re-apply imports and uplaod the pickled postprocess_dict for use to consistently encode a test data set.

In [3]:
#to upload for later use in postmunge(.) in another notebook
import pickle
with open('filename.pickle', 'rb') as handle:
  postprocess_dict = pickle.load(handle)

In [4]:
#to upload for later use in postmunge(.) in another notebook
import pickle
with open('filename.pickle', 'rb') as handle:
  postprocess_dict = pickle.load(handle)


from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'


# df_test


#the uplaoded postprocess_dict is used as basis to encode any additional test data
test, test_ID, test_labels, \
postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus=False,
              testID_column=trainID_column)

# 2. pickle module as used in conjunction with custom defined transformation functions in Automunge API

We refer reader to the tutorial notebook on github "11 - Custom Transformations" for a full demonstration of defining custom transformation functions for integration into the Automunge API for encoding dataframes. As a clarification to this workflow, it should be noted that in the context of relying on the pickle library for downloading a serialized postprocess_dict dictionary for use in a seperate notebook, any custom transformation functions that are integrated into the Automunge API call will need to reinitiatized in the seperate notebook prior to uplaoding the postprocess_dict with pickle. Should a user not desire that the functions be visible for reinitializion in a seperate notebook we note above that using libraries like Dill noted above there may be potential to package full function definitions into a serialized postprocess_dict although such practices will be outside the scope of this tutorial.

We present here a demonstration of custom transformation funciton definition for integration into an automunge(.) call followed by pickle download, reset of notebook to simiulate a new environment, reinitializaiton of the custom functions, followed by pickle upload, and then the postmunge(.) call for processing additional data.

In [5]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

In [6]:
#now we define our custom transformation functions
#here we will define three functions for training data, tests data, and an inversino funciton
#these are consistent with the seperate tutorial notebook "Custom Transformations"


def custom_train_template(df, column, normalization_dict):
  """
  #Template for a custom_train transformation function to be applied to a train feature set.
  #further detail on conventions provided in readme file
  """

  #As an example, here is the application of z-score normalization 
  #derived based on the training set mean and standard deviation
  
  #which can accept any kind of numeric data 
  #so corresponding NArowtype processdict entry can be 'numeric'
  #and returns a single column of continuous numeric data 
  #so corresponding MLinfilltype processdict entry will need to be 'numeric'

  #where we'll include the option for a parameter 'muiltiplier'
  #which is an arbitrary example to demonstrate accessing parameters
  
  #basically we check if that parameter had been passed in assignparam or defaultparams
  if 'multiplier' in normalization_dict:
    multiplier = normalization_dict['multiplier']
    
  #or otherwise assign and save a default value
  else:
    multiplier = 1
    normalization_dict.update({'multiplier' : multiplier})

  #Now we measure any properties of the train data used for the transformation
  mean = df[column].mean()
  stdev = df[column].std()
  
  #It's good practice to ensure numbers used in derivation haven't been derived as nan
  #or would result in dividing by zero
  if mean != mean:
    mean = 0
  if stdev != stdev or stdev == 0:
    stdev = 1
    
  #In general if that same basis will be needed to process test data we'll store in normalization_dict
  normalization_dict.update({'mean' : mean,
                             'stdev': stdev})

  #Optionally we can measure additional drift stats for a postmunge driftreport
  #we will also save those in the normalization_dict
  minimum = df[column].min()
  maximum = df[column].max()
  normalization_dict.update({'minimum' : minimum,
                             'maximum' : maximum})

  #Now we can apply the transformation
  
  #The generic formula for z-score normalization is (x - mean) / stdev
  #here we incorporate an additional variable as the multiplier parameter (defaults to 1)
  df[column] = (df[column] - mean) * multiplier / stdev

  return df, normalization_dict

def custom_test_template(df, column, normalization_dict):
  """
  #This transform will be applied to a test data feature set
  #on a basis of a corresponding custom_train entry
  #Such as test data passed to either automunge(.) or postmunge(.)
  #Using properties from the train set basis stored in the normalization_dict

  #Further detail on conventions provided in readme file
  """

  #As an example, here is the corresponding z-score normalization 
  #derived based on the training set mean and standard deviation
  #which was populated in a normalization_dict in the custom_train example given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #apply the transform
  #and return the transformed dataframe

  #access the train set properties from normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #then apply the transformation and return the dataframe
  df[column] = (df[column] - mean) * multiplier / stdev

  return df


def custom_inversion_template(df, returnedcolumn_list, inputcolumn, normalization_dict):
  """
  #User also has the option to define a custom inversion function
  #Corresponding to custom_train and custom_test
  #furher detail on conventions provided in readme file

  """

  #As an example, here we'll be inverting the z-score normalization 
  #derived based on the training set mean and standard deviation
  #which corresponds to the examples given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #Initialize the new column inputcolumn
  #And use values in the set from returnedcolumn_list to recover values for inputcolumn

  #First let's access the values we'll need from the normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #Now initialize the inputcolumn
  df[inputcolumn] = 0

  #So for the example of z-score normalization, we know returnedcolumn_list will only have one entry
  #In some other cases transforms may have returned multiple columns
  returnedcolumn = returnedcolumn_list[0]

  #now we perform the inversion
  df[inputcolumn] = (df[returnedcolumn] * stdev / multiplier) + mean

  return df


In [7]:
#having defined the functions, in order to pass them through an automunge(.) call
#we then incorporate them into a simple data structure
#these are consistent with the seperate tutorial notebook "Custom Transformations"

#this is the transformation funcitons for a new trasnformatino category we'll call 'newt'
processdict = \
{'newt' :
 {'custom_train' : custom_train_template,
  'custom_test' : custom_test_template,
  'custom_inversion' : custom_inversion_template,
  'functionpointer' : 'nmbr',
 }}

#here we define a family tree for use of 'newt' as a root category
transformdict = \
{'newt' :
 {'parents'       : [],
  'siblings'      : [],
  'auntsuncles'   : ['newt'],
  'cousins'       : ['NArw'],
  'children'      : [],
  'niecesnephews' : [],
  'coworkers'     : [],
  'friends'       : [],
 }}

#now we assign 'newt' to a column
targetcolumn = 'Age'
assigncat = {'newt' : targetcolumn}

#optionally, if we want to pass parameters to the funcitons, we can use assignparam
assignparam = \
{'newt' : 
 {targetcolumn :
  {'multiplier' : 10}}}


#we can then put all together in an automunge(.) call
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(
  df_train,
  labels_column = 'Survived',
  trainID_column = trainID_column,
  shuffletrain=False,
  assigncat = assigncat,
  assignparam = assignparam,
  processdict = processdict,
  transformdict = transformdict)

_______________
Begin Automunge

______

versioning serial stamp:
_8.33_735411028851

Automunge returned train column set: 
['Pclass_nmbr', 'Sex_bnry', 'Age_newt', 'SibSp_nmbr', 'Parch_nmbr', 'Ticket_nmbr', 'Fare_nmbr', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [8]:
#here is the column that was target for the custom transformation
train['Age_newt']

0     -1.022554
1      8.070384
2      1.250681
3      6.365458
4      6.365458
         ...   
886    1.818989
887   -2.727479
888   -8.853846
889    1.250681
890    4.660532
Name: Age_newt, Length: 891, dtype: float32

We can then download the saved postprocess_dict which can be later used to consistently prepare additional data in a new notebook.

In [9]:
#sample code to download postprocess_dict dictionary returned from automunge(.)
import pickle
with open('filename.pickle', 'wb') as handle:
  pickle.dump(postprocess_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

### To simulate a new notebook we can now reset this session with 'Esc 00'.

In [10]:
#now the purpose of this section is to demonstrate that in a new notebook\
#because we applied custome defined transfomration functions in our automunge(.) call
#we will need to reinitialize those functions prior to uplaoding with pickle
#here we reinitialize those custome functions 
#to avoid the need for reinitializing such functions, 
#refer to the dill demonstration provided further below

from Automunge import *
am = AutoMunge()

import pandas as pd

#now we define our custom transformation functions
#here we will define three functions for training data, tests data, and an inversino funciton
#these are consistent with the seperate tutorial notebook "Custom Transformations"


def custom_train_template(df, column, normalization_dict):
  """
  #Template for a custom_train transformation function to be applied to a train feature set.
  #further detail on conventions provided in readme file
  """

  #As an example, here is the application of z-score normalization 
  #derived based on the training set mean and standard deviation
  
  #which can accept any kind of numeric data 
  #so corresponding NArowtype processdict entry can be 'numeric'
  #and returns a single column of continuous numeric data 
  #so corresponding MLinfilltype processdict entry will need to be 'numeric'

  #where we'll include the option for a parameter 'muiltiplier'
  #which is an arbitrary example to demonstrate accessing parameters
  
  #basically we check if that parameter had been passed in assignparam or defaultparams
  if 'multiplier' in normalization_dict:
    multiplier = normalization_dict['multiplier']
    
  #or otherwise assign and save a default value
  else:
    multiplier = 1
    normalization_dict.update({'multiplier' : multiplier})

  #Now we measure any properties of the train data used for the transformation
  mean = df[column].mean()
  stdev = df[column].std()
  
  #It's good practice to ensure numbers used in derivation haven't been derived as nan
  #or would result in dividing by zero
  if mean != mean:
    mean = 0
  if stdev != stdev or stdev == 0:
    stdev = 1
    
  #In general if that same basis will be needed to process test data we'll store in normalization_dict
  normalization_dict.update({'mean' : mean,
                             'stdev': stdev})

  #Optionally we can measure additional drift stats for a postmunge driftreport
  #we will also save those in the normalization_dict
  minimum = df[column].min()
  maximum = df[column].max()
  normalization_dict.update({'minimum' : minimum,
                             'maximum' : maximum})

  #Now we can apply the transformation
  
  #The generic formula for z-score normalization is (x - mean) / stdev
  #here we incorporate an additional variable as the multiplier parameter (defaults to 1)
  df[column] = (df[column] - mean) * multiplier / stdev

  return df, normalization_dict

def custom_test_template(df, column, normalization_dict):
  """
  #This transform will be applied to a test data feature set
  #on a basis of a corresponding custom_train entry
  #Such as test data passed to either automunge(.) or postmunge(.)
  #Using properties from the train set basis stored in the normalization_dict

  #Further detail on conventions provided in readme file
  """

  #As an example, here is the corresponding z-score normalization 
  #derived based on the training set mean and standard deviation
  #which was populated in a normalization_dict in the custom_train example given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #apply the transform
  #and return the transformed dataframe

  #access the train set properties from normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #then apply the transformation and return the dataframe
  df[column] = (df[column] - mean) * multiplier / stdev

  return df


def custom_inversion_template(df, returnedcolumn_list, inputcolumn, normalization_dict):
  """
  #User also has the option to define a custom inversion function
  #Corresponding to custom_train and custom_test
  #furher detail on conventions provided in readme file

  """

  #As an example, here we'll be inverting the z-score normalization 
  #derived based on the training set mean and standard deviation
  #which corresponds to the examples given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #Initialize the new column inputcolumn
  #And use values in the set from returnedcolumn_list to recover values for inputcolumn

  #First let's access the values we'll need from the normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #Now initialize the inputcolumn
  df[inputcolumn] = 0

  #So for the example of z-score normalization, we know returnedcolumn_list will only have one entry
  #In some other cases transforms may have returned multiple columns
  returnedcolumn = returnedcolumn_list[0]

  #now we perform the inversion
  df[inputcolumn] = (df[returnedcolumn] * stdev / multiplier) + mean

  return df


In [11]:
#now that the functions are reinitialized,
#we can upload our serialized postprocess_dict with pickle
#note that if we had attempted this prior to initializing the functions
#the pickle operaiton wouldn't be able to run

#to upload for later use in postmunge(.) in another notebook
import pickle
with open('filename.pickle', 'rb') as handle:
  postprocess_dict = pickle.load(handle)


In [12]:
#we can then prepare additional data in the new notebook with postmunge


#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

test, test_ID, test_labels, \
postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus=False,
              testID_column=trainID_column)

In [13]:
#here is the column with custom transform in the test data
test['Age_newt']

0       6.081304
1      13.185161
2      21.709789
3       1.818989
4      -1.022554
         ...    
413     2.654403
414     8.638692
415     8.354538
416     5.831248
417   -11.599174
Name: Age_newt, Length: 418, dtype: float32

# 3. demonstration of the integration of the hmac signature into a pickle workflow

In [14]:
#here we revisit the workflow from example 1
#but with the added integration of the derivation of an hmac signature with pickling
#the purpose of the hmac signature is so that a different user
#looking to upload a serialized postprocess_dict in a seperate notebook
#can verify that the pickled object they have received is identical to the original source
#which may have security benfits
#note that for this operation to be secure
#the derived hmac signature may also need to be transmitted to the second user
#in some more secure channel than with the serialized postprocess_dict
#with corresponding extension of this demonstration for a simple comparison verificaiton

In [15]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

#run automunge(.) to encode the dataframes and populate a postprocess_dict
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column,
             trainID_column = trainID_column,
             ML_cmnd = {'stochastic_impute_numeric' : False,
                        'stochastic_impute_categoric' : False,
                       },
            )

_______________
Begin Automunge

______

versioning serial stamp:
_8.33_918885590673

Automunge returned train column set: 
['Pclass_nmbr', 'Sex_bnry', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Ticket_nmbr', 'Fare_nmbr', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [16]:
import pickle
import hmac

#drafting of this function supported by Bard LLM service
def download_postprocess_dict_with_hmac(postprocess_dict, filename):
    # Create a new HMAC object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")

    # Serialize the object
    serialized_obj = pickle.dumps(postprocess_dict)

    # Compute the signature of the serialized object
    hmac_object.update(serialized_obj)
    signature = hmac_object.hexdigest()

    # Write the signature and the serialized object to the file
    with open(filename, "wb") as f:
        f.write(signature.encode("ascii"))
        f.write(serialized_obj)
        
#     return signature

signature = download_postprocess_dict_with_hmac(postprocess_dict, "filename.pickle")

### To simulate a new notebook we can now reset this session with 'Esc 00'.

In [17]:

import pickle
import hmac

#drafting of this function supported by Bard LLM service
def upload_postprocess_dict_with_hmac_verification(filename):
    # Load the serialized object and the signature
    with open(filename, "rb") as f:
        signature = f.readline(64).decode("ascii")
        serialized_obj = f.read()

    # Compute the signature of the serialized object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")
    hmac_object.update(serialized_obj)
    computed_signature = hmac_object.hexdigest()

    # Verify the signature
    if signature != computed_signature:
        raise Exception("The serialized object has been tampered with!")

    # Deserialize the object
    postprocess_dict = pickle.loads(serialized_obj)

    return postprocess_dict


# Reinitialize the dictionary with HMAC signature inspection
postprocess_dict = upload_postprocess_dict_with_hmac_verification("filename.pickle")


In [18]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'


# df_test


#the uplaoded postprocess_dict is used as basis to encode any additional test data
test, test_ID, test_labels, \
postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus=False,
              testID_column=trainID_column)

# 4. demonstration of the dill library as an alternative to pickle

We will find that the dill library can be very easily substituted for the pickle implmentation by simply directly replacing calls to pickle with dill. Although dill can read .pickle formatted files, it is probably better practice to encoded as .dill for clarity.

In [19]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

#run automunge(.) to encode the dataframes and populate a postprocess_dict
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column,
             trainID_column = trainID_column,
             ML_cmnd = {'stochastic_impute_numeric' : False,
                        'stochastic_impute_categoric' : False,
                       },
            )

_______________
Begin Automunge

______

versioning serial stamp:
_8.33_208496506434

Automunge returned train column set: 
['Pclass_nmbr', 'Sex_bnry', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Ticket_nmbr', 'Fare_nmbr', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [20]:
import dill
import hmac

#drafting of this function supported by Bard LLM service
def download_postprocess_dict_with_hmac(postprocess_dict, filename):
    # Create a new HMAC object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")

    # Serialize the object
    serialized_obj = dill.dumps(postprocess_dict)

    # Compute the signature of the serialized object
    hmac_object.update(serialized_obj)
    signature = hmac_object.hexdigest()

    # Write the signature and the serialized object to the file
    with open(filename, "wb") as f:
        f.write(signature.encode("ascii"))
        f.write(serialized_obj)
        
#     return signature

signature = download_postprocess_dict_with_hmac(postprocess_dict, "filename.dill")


### To simulate a new notebook we can now reset this session with 'Esc 00'.

In [21]:

import dill
import hmac

#drafting of this function supported by Bard LLM service
def upload_postprocess_dict_with_hmac_verification(filename):
    # Load the serialized object and the signature
    with open(filename, "rb") as f:
        signature = f.readline(64).decode("ascii")
        serialized_obj = f.read()

    # Compute the signature of the serialized object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")
    hmac_object.update(serialized_obj)
    computed_signature = hmac_object.hexdigest()

    # Verify the signature
    if signature != computed_signature:
        raise Exception("The serialized object has been tampered with!")

    # Deserialize the object
    postprocess_dict = dill.loads(serialized_obj)

    return postprocess_dict


# Reinitialize the dictionary with HMAC signature inspection
postprocess_dict = upload_postprocess_dict_with_hmac_verification("filename.dill")


In [22]:
#to upload for later use in postmunge(.) in another notebook


from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'


# df_test


#the uplaoded postprocess_dict is used as basis to encode any additional test data
test, test_ID, test_labels, \
postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus=False,
              testID_column=trainID_column)

# 5. demonstration of the dill library (in conjunction with hmac signature) with custom defined transformation functions

This demonstration is similar to that in section 3. of this notebook, the purpose is to demonstrate that by using the dill library instead of pickle, we gain ability to upload the serialized postprocess_dict dictionary without having to re-initialize any custom trasnformation functions, which will have their full definitions saved in the dill encoded dictionary as opposed to pickle which only stores pointers. (This is why it mayt be more important to incorproate a hmac signiature into this workflow.)

In [23]:
from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

In [24]:
#now we define our custom transformatino functions
#here we will define three functions for training data, tests data, and an inversino funciton
#these are consistent with the seperate tutorial notebook "Custom Transformations"


def custom_train_template(df, column, normalization_dict):
  """
  #Template for a custom_train transformation function to be applied to a train feature set.
  #further detail on conventions provided in readme file
  """

  #As an example, here is the application of z-score normalization 
  #derived based on the training set mean and standard deviation
  
  #which can accept any kind of numeric data 
  #so corresponding NArowtype processdict entry can be 'numeric'
  #and returns a single column of continuous numeric data 
  #so corresponding MLinfilltype processdict entry will need to be 'numeric'

  #where we'll include the option for a parameter 'muiltiplier'
  #which is an arbitrary example to demonstrate accessing parameters
  
  #basically we check if that parameter had been passed in assignparam or defaultparams
  if 'multiplier' in normalization_dict:
    multiplier = normalization_dict['multiplier']
    
  #or otherwise assign and save a default value
  else:
    multiplier = 1
    normalization_dict.update({'multiplier' : multiplier})

  #Now we measure any properties of the train data used for the transformation
  mean = df[column].mean()
  stdev = df[column].std()
  
  #It's good practice to ensure numbers used in derivation haven't been derived as nan
  #or would result in dividing by zero
  if mean != mean:
    mean = 0
  if stdev != stdev or stdev == 0:
    stdev = 1
    
  #In general if that same basis will be needed to process test data we'll store in normalization_dict
  normalization_dict.update({'mean' : mean,
                             'stdev': stdev})

  #Optionally we can measure additional drift stats for a postmunge driftreport
  #we will also save those in the normalization_dict
  minimum = df[column].min()
  maximum = df[column].max()
  normalization_dict.update({'minimum' : minimum,
                             'maximum' : maximum})

  #Now we can apply the transformation
  
  #The generic formula for z-score normalization is (x - mean) / stdev
  #here we incorporate an additional variable as the multiplier parameter (defaults to 1)
  df[column] = (df[column] - mean) * multiplier / stdev

  return df, normalization_dict

def custom_test_template(df, column, normalization_dict):
  """
  #This transform will be applied to a test data feature set
  #on a basis of a corresponding custom_train entry
  #Such as test data passed to either automunge(.) or postmunge(.)
  #Using properties from the train set basis stored in the normalization_dict

  #Further detail on conventions provided in readme file
  """

  #As an example, here is the corresponding z-score normalization 
  #derived based on the training set mean and standard deviation
  #which was populated in a normalization_dict in the custom_train example given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #apply the transform
  #and return the transformed dataframe

  #access the train set properties from normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #then apply the transformation and return the dataframe
  df[column] = (df[column] - mean) * multiplier / stdev

  return df


def custom_inversion_template(df, returnedcolumn_list, inputcolumn, normalization_dict):
  """
  #User also has the option to define a custom inversion function
  #Corresponding to custom_train and custom_test
  #furher detail on conventions provided in readme file

  """

  #As an example, here we'll be inverting the z-score normalization 
  #derived based on the training set mean and standard deviation
  #which corresponds to the examples given above

  #Basically the workflow is we access any values needed from the normalization_dict
  #Initialize the new column inputcolumn
  #And use values in the set from returnedcolumn_list to recover values for inputcolumn

  #First let's access the values we'll need from the normalization_dict
  mean = normalization_dict['mean']
  stdev = normalization_dict['stdev']
  multiplier = normalization_dict['multiplier']

  #Now initialize the inputcolumn
  df[inputcolumn] = 0

  #So for the example of z-score normalization, we know returnedcolumn_list will only have one entry
  #In some other cases transforms may have returned multiple columns
  returnedcolumn = returnedcolumn_list[0]

  #now we perform the inversion
  df[inputcolumn] = (df[returnedcolumn] * stdev / multiplier) + mean

  return df


In [25]:
#having defined the functions, in order to pass them through an automunge(.) call
#we then incorporate them into a simple data structure
#these are consistent with the seperate tutorial notebook "Custom Transformations"
#further detail on conventions provided in readme file

#this is the transformation funcitons for a new trasnformatino category we'll call 'newt'
processdict = \
{'newt' :
 {'custom_train' : custom_train_template,
  'custom_test' : custom_test_template,
  'custom_inversion' : custom_inversion_template,
  'functionpointer' : 'nmbr',
 }}

#here we define a family tree for use of 'newt' as a root category
transformdict = \
{'newt' :
 {'parents'       : [],
  'siblings'      : [],
  'auntsuncles'   : ['newt'],
  'cousins'       : ['NArw'],
  'children'      : [],
  'niecesnephews' : [],
  'coworkers'     : [],
  'friends'       : [],
 }}

#now we assign 'newt' to a column
targetcolumn = 'Age'
assigncat = {'newt' : targetcolumn}

#optionally, if we want to pass parameters to the funcitons, we can use assignparam
assignparam = \
{'newt' : 
 {targetcolumn :
  {'multiplier' : 10}}}


#we can then put all together in an automunge(.) call
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(
  df_train,
  labels_column = 'Survived',
  trainID_column = trainID_column,
  shuffletrain=False,
  assigncat = assigncat,
  assignparam = assignparam,
  processdict = processdict,
  transformdict = transformdict)

_______________
Begin Automunge

______

versioning serial stamp:
_8.33_531125869863

Automunge returned train column set: 
['Pclass_nmbr', 'Sex_bnry', 'Age_newt', 'SibSp_nmbr', 'Parch_nmbr', 'Ticket_nmbr', 'Fare_nmbr', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [26]:
import dill
import hmac

#drafting of this function supported by Bard LLM service
def download_postprocess_dict_with_hmac(postprocess_dict, filename):
    # Create a new HMAC object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")

    # Serialize the object
    serialized_obj = dill.dumps(postprocess_dict)

    # Compute the signature of the serialized object
    hmac_object.update(serialized_obj)
    signature = hmac_object.hexdigest()

    # Write the signature and the serialized object to the file
    with open(filename, "wb") as f:
        f.write(signature.encode("ascii"))
        f.write(serialized_obj)
        
#     return signature

signature = download_postprocess_dict_with_hmac(postprocess_dict, "filename.dill")

### To simulate a new notebook we can now reset this session with 'Esc 00'.

In [27]:

import dill
import hmac

#drafting of this function supported by Bard LLM service
def upload_postprocess_dict_with_hmac_verification(filename):
    # Load the serialized object and the signature
    with open(filename, "rb") as f:
        signature = f.readline(64).decode("ascii")
        serialized_obj = f.read()

    # Compute the signature of the serialized object
    hmac_object = hmac.new(b"secret_key", digestmod="sha256")
    hmac_object.update(serialized_obj)
    computed_signature = hmac_object.hexdigest()

    # Verify the signature
    if signature != computed_signature:
        raise Exception("The serialized object has been tampered with!")

    # Deserialize the object
    postprocess_dict = dill.loads(serialized_obj)

    return postprocess_dict


# Reinitialize the dictionary with HMAC signature inspection
postprocess_dict = upload_postprocess_dict_with_hmac_verification("filename.dill")


In [28]:
#to upload for later use in postmunge(.) in another notebook


from Automunge import *
am = AutoMunge()

import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'


# df_test


#the uplaoded postprocess_dict is used as basis to encode any additional test data
test, test_ID, test_labels, \
postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus=False,
              testID_column=trainID_column)

In [29]:
# Voila