### Automunge Noise Injection Demonstrations

Automunge has a suite of options for injecting noise into tabular features. We expect noise injections may serve several potential benefits such as a resource for data augmentation, bias mitigation, differential privacy, model perturbation for aggregation of ensembles, and non-determinism. Noise injections were introduced in the paper [Numeric Encoding Options with Automunge](https://medium.com/automunge/a-numbers-game-b68ac261c40d) and discussed in more depth in the essay [Noise Injections with Automunge](https://medium.com/automunge/noise-injections-with-automunge-7ebb672216e2).

In [1]:
from Automunge import *
am = AutoMunge()

We'll demonstrate feature importance on the [Titanic](https://www.kaggle.com/c/titanic/data) set, a common benchmark.

In [2]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

______

### DP transformation categories

DP family of transforms are surveyed in the README's library of transformations as [Differential Privacy Noise Injections](https://github.com/Automunge/AutoMunge/blob/master/README.md#differential-privacy-noise-injections). 

The noise injections are performed in conjunction with numeric normalizations or categoric encodings. Here is a quick survey of options available as of version 7.59:

### Numeric Injections:
- **DPnb**: z-score normalization with Gaussian noise
- **DPmm**: min-max scaling with scaled Gaussian noise
- **DPrt**: retain normalization with scaled Gaussian noise
- **DLnb / DLmm / DLrt**: similar to three preceding with default to Laplace distributed noise instead of Gaussian
- **DPqt**: distribution conversion by quantile transform with Gaussian noise
- **DPbx**: distribution conversion by box-cox transform with Gaussian noise
- **DPns**: z-score normalization with swap-noise injection
- **DPne**: pass-through as numeric with noise, no normalization or infill

### Categoric Injections:
- **DPbn**: boolean integer encoding with weighted activation flips
- **DPod**: ordinal encoding with weighted activation flips
- **DPoh**: one hot encoding with weighted activation flips
- **DP10**: binarization with weighted activation set flips
- **DP1s**: binarization with swap-noise injection
- **DPhs**: multi column hash encoding (like hash) with weighted activation flips
- **DPh2**: single column hash encoding (like hsh2) with weighted activation flips
- **DPh1**: multi column hash binarization (like hs10) with weighted activation set flips
- **DPse**: pass-through as cateogric with swap noise, no encoding or infill

Here is an example of assigning these root categories to received features with headers 'inputcolumn1', 'inputcolumn2', 'inputcolumn3'.
```
assigncat = {
    'DLnb' : 'inputcolumn1',
    'DPod' : ['inputcolumn2', 'inputcolumn3'],
}
```


_______

### DP parameters

Each of these transformations accepts optional parameter specifications to vary from the defaults. Parameters are passed to transformations through the automunge(.) assignparam parameter. 
```
assignparam = {
    '(category)' : {'(column)'   : {'(parameter)' : 42}},
    'default_assignparam' : {'(category)' : {'(parameter)' : 42}},
    'global_assignparam'  : {'(parameter)': 42},
}
```
In order of precedence, parameter assignments may be designated targeting a transformation category as applied to a specific column header with suffix appenders, a transformation category as applied to an input column header (which may include multiple instances), all instances of a specific transformation category, all transformation categories, or may be initialized as default parameters when defining a transformation category.

Note that in most cases the noise injection is performed in an alternate category than the encoding. Can inspect the returned column header suffix appenders to identify which category to target assignparam specification for each, generally noise is the final category. The column header specification is intended as the header serving as input to the transform prior to that suffix appender. (Note that DPhs has a special convention for passing parameters to noise injection due to configuration as noted in read me.)

As further illustrated in Table 2 of [Noise Injections with Automunge](https://medium.com/automunge/noise-injections-with-automunge-7ebb672216e2), the trainnoise and testnoise parameters can be used in conjunction with the postmunge(.) traindata parameter to target noise towards train and/or test data.

### Numeric Parameters:
- **trainnoise**: activates noise injection to train data, defaults to True
- **testnoise**: activates noise injection to test data, defaults to False (note that postmunge(.) has the traindata parameter to treat test data as train data if prefer option to select between different postmunge calls)
- **sigma**: scale of train noise distribution, defaults to 0.06 for DPnb, 0.03 for DPmm/DPrt
- **test_sigma**: scale of test noise distribution, defaults to 0.03 for DPnb, 0.02 for DPmm/DPrt
- **mu**: mean of train noise distribution, defaults to 0.
- **test_mu**: mean of test noise distribution, defaults to 0.
- **flip_prob**: ratio of train entries receiving injection, defaults to 0.03
- **test_flip_prob**: ratio of test entries receiving injection, defaults to 0.03
- **noisedistribution**: accepts one of {'normal', 'laplace'}, DP categories default to normal, DL to laplace. Also accepts one of {'abs_normal', 'negabs_normal', 'abs_laplace', 'negabs_laplace'} for all positive signed noise (abs) or all negative signed noise (negabs). Note that when applying all positive or all negative noisedistribution scenarios the noise_scaling_bias_offset option should be deactivated.
- **noise_scaling_bias_offset**: defaults to True, results in the mean of the noise receiving an offset to closer resemble zero mean after scaling (not supported with z-score DPnb since not needed). 

### Categoric and Swap Parameters:
- **trainnoise**: activates noise injection to train data, defaults to True
- **testnoise**: activates noise injection to test data, defaults to False (note that postmunge(.) has the traindata parameter to treat test data as train data if prefer option to select between different postmunge calls)
- **flip_prob**: ratio of train entries receiving injection, defaults to 0.03
- **test_flip_prob**: ratio of test entries receiving injection, defaults to 0.01
- **weighted**: selects between weighted vs uniform sampling from set of alternate acitvations, defaults to True (not supported with swap noise, which already resembles a weighted sampling)

Here is an example of assignparam specification to: 
- set an all positive noise distribution for category DPmm as applied to an input column with header 'inputcolumn', noting that for scaled noise like DPmm all positive or all negative noise should be performed with a deactivated noise_scaling_bias_offset.
- update the flip_prob parameter to 0.1 for all cases of DPnb injections via default_assignparam
- target testnoise injections to all injections via global_assignparam
```
assignparam = {
    'DPmm' : {'inputcolumn': {'noisedistribution'         : 'abs_normal',
                              'noise_scaling_bias_offset' : False}},
    'default_assignparam' : {'DPnb' : {'flip_prob' : 0.1}},
    'global_assignparam'  : {'testnoise': True},
}
```


_______

### Noise injection under automation

The automunge(.) powertransform parameter can be used to select between alternate sets of default transformations applied under automation. We currently have two scenarios for default encodings with noise, inlcuding powertransform passed as one of {'DP1', 'DP2'}. DP2 differs from DP1 in that numerical defaults to retain normalization instead of z-score and categoric defaults to ordinal instead of binarization.

**'DP1'** 
- numerical receives DPnb
- categoric receives DP10
- binary receives DPbn
- hash receives DPhs, 
- hsh2 receives DPh2 
- (labels do not receive noise)

**'DP2'**
- numerical receives DPrt
- categoric receives DPod
- binary receives DPbn
- hash receives DPhs, 
- hsh2 receives DPh2 
- (labels do not receive noise)

An example specification:
```
powertransform = 'DP2'
```

Otherwise noise can just be manually assigned in the assigncat parameter as demonstrated above, which specificaitons will take precedence over what would otherwise be performed under automation.

_______

### Data Augmentation with Noise

Data augmentation refers to increasing the size of a training set with manipulations to increase variety. In the image modality it is common to acheive data augmentation by way of adjustments like image cropping, rotations, color shift, etc. Here we are simply adding noise for similar effect. In a deep learning benchmark performed in our paper [Numeric Encoding Options with Automunge](https://medium.com/automunge/a-numbers-game-b68ac261c40d) we found that this type of data augmentation was fairly benign with a fully represented data set, but was increasingly beneficial with underserved training data.

Data augmentation can be realized by assigning noise transforms in conjunction with the automunge(.) noise_augment parameter, which accepts integers of number of additional duplicates to prepare (e.g. noise_augment=1 would double the size fo the training set returned from automunge(.)). For cases where too much duplication starts to run into memory constraints additional duplicates can also be prepared with postmunge(.), which also has a noise_augment parameter option and accepts the traindata parameter to distinguish whether a data set is to be treated as train or test data.

Under the default configuration when noise_augment is received as an integer dtype, one of the duplicates will be prepared without noise. If noise_augment is received as a float(int) type, all of the duplicates will be prepared with noise.

Here is an example of preparing data augmentation for the Titanic set loaded earlier.

In [3]:
#number of rows in original training data
df_train.shape[0]

891

In [4]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             powertransform = 'DP2',
             noise_augment = 2.0,
             printstatus = False)

In [5]:
#number of rows in training data after noise_augment
train.shape[0]

2673

_______

### Alternate random samplers

The random sampling for noise injection defaults to numpy's PCG64, which is based on the [PCG pseudo random number generator](https://www.pcg-random.org/index.html). On it's own this generator is not truly random, it relies on seedings of entropy provided by the operating system which are then enhanced through use. 

To support integration of enhanced randomness profiles, both automunge(.) and postmunge(.) accept parameters for entropy_seeds and random_generator.

entropy_seeds accepts an integer or list of integers which may serve as a supplemental source of entropy for the numpy.random generator to enhance randomness properties.

random_generator accepts input of a numpy.random.Generator formatted random sampler. An example could be numpy.random.MT19937 for Mersenne Twister, or could even be an external library with a numpy.random formatted generator, such as for example could be used to sample with the support of quantum circuits.

If an alternate library does not have a numpy.random formatted generator, their output can be channeled to entropy_seeds for similar benefit.

Specifications of entropy_seeds and random_generator are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict.

The two parameters can also be passed in tangent, for sampling with a custom generator with custom supplemental entropy seeds.

Here is an example of specifying an alternate generator and supplemental entropy seedings.
```
random_generator = numpy.random.MT19937

entropy_seeds = [4,5,6]
```

_______

### All Together Now

Let's do a quick demonstration tieing it all together.

Here we'll apply the 'DP2' powertransform option for noise under augmentation, overide a few of the default transforms with assigncat, assign a few deviations to transformation parameters via assignparam, add some additional entropy seeds from some other resource, and prepare a few additional training data duplicates for data augmentation purposes.

In [6]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
powertransform = 'DP2'

assigncat = {'DPh2' : 'Name'}

noise_augment = 2.

entropy_seeds = [432,6,243,561232,89]

assignparam = {
    'DPrt' : {'Age': {'noisedistribution'         : 'abs_normal',
                      'noise_scaling_bias_offset' : False}},
    'default_assignparam' : {'DPrt' : {'flip_prob' : 0.1}},
    'global_assignparam'  : {'testnoise': True},
}

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column,
             trainID_column = trainID_column,
             powertransform = powertransform,
             assigncat = assigncat,
             noise_augment = noise_augment,
             entropy_seeds = entropy_seeds,
             assignparam = assignparam,
             printstatus = False)

In [8]:
train.head()

Unnamed: 0,Pclass_NArw,Pclass_DPrt,Name_NArw,Name_DPh2_DPo7,Sex_NArw,Sex_DPb2_DPbn,Age_NArw,Age_DPrt,SibSp_NArw,SibSp_DPrt,...,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Cabin_DPo4_DPod,Embarked_NArw,Embarked_DPo4_DPod
712,0,0.0,0,505,0,1,0,0.459663,0,0.125,...,0,265,0,0,0,0.103644,0,20,0,1
1032,0,1.0,0,112,0,0,0,0.271174,0,0.0,...,0,437,0,0,0,0.015127,1,8,0,1
883,0,0.0,0,298,0,1,1,0.646394,0,0.026208,...,0,470,0,0,0,0.050749,0,50,0,1
2263,0,0.5,0,803,0,1,1,0.337104,0,0.0,...,0,444,0,0,0,0.0,1,125,0,1
222,0,1.0,0,622,0,1,1,0.312921,0,0.0,...,0,411,0,0,0,0.01411,1,146,0,2


Similarly we can prepare additional test data in postmunge(.) using the postprocess_dict returned from automunge(.), which since we set testnoise as globally activated will be performed in the default traindata=False case.

In [9]:
entropy_seeds = [2345, 77887, 2342, 7878789]

traindata = False

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
             df_test,
             entropy_seeds = entropy_seeds,
             traindata=traindata,
             printstatus=False,
            )

In [10]:
test.head()

Unnamed: 0,Pclass_NArw,Pclass_DPrt,Name_NArw,Name_DPh2_DPo7,Sex_NArw,Sex_DPb2_DPbn,Age_NArw,Age_DPrt,SibSp_NArw,SibSp_DPrt,...,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Cabin_DPo4_DPod,Embarked_NArw,Embarked_DPo4_DPod
0,0,1.0,0,507,0,1,0,0.428248,0,0.0,...,0,552,0,0,0,0.019197,1,146,0,3
1,0,1.0,0,1010,0,0,0,1.0,0,0.125,...,0,186,0,0,0,0.013663,1,3,0,1
2,0,0.5,0,949,0,1,0,1.0,0,0.0,...,0,742,0,0,0,0.018909,1,5,0,3
3,0,1.0,0,682,0,1,0,0.334004,0,0.0,...,0,232,0,0,0,0.016908,1,45,0,1
4,0,1.0,0,525,0,0,0,0.271174,0,0.125,...,0,489,0,0,0,0.023984,1,3,0,1


_______

### Noise directed at existing data pipelines

When noise is intended for direction at an existing data pipeline, such as for incorporationg of noise into test data for an inference operation on a previously trained moded, there may be desire to inject noise without other edits to a dataframe. This is possible by passing the dataframe as a df_train to an automunge(.) call and assigning the various features to one of these three pass-through categories:
- DPne: pass-through numeric with gaussian (or laplace) noise, comparable parameter support to DPnb
- DPse: pass-through with swap noise (e.g. for categoric data), comparable parameter support to DPmc
- excl: pass-through without noise

Note that DPne will return entries as float data type, converting any non-numeric to NaN. The default noise scale for DPne (sigma=0.06 / test_sigma0.03) is set to align with z-score normalized data. One way to adjust for differently scaled data could be to multiply the default by the standard deviation of the feature.

Note that DPse injects swap noise by accessing an alternate row entry for a target. This type of noise may not be suitable for test data injections in a scenario where inference may be run on a test set with one or very few samples.

The order of columns in returned dataframe will be retained. If order of rows retention is desired can deactivate the shuffletrain parameter. Note that the DPne and DPse will also adjust the column headers with suffix appenders. If recovery of original column headers is desired, this code snippet will apply, noting that this assumes only one returned column for each input column.
```
returned_headers = list(train)
#note this for loop not needed with automunge(.) setting excl_suffix=True
for excl_entry in postprocess_dict['excl_suffix_conversion_dict']:
    returned_headers[returned_headers.index(excl_entry)] = \
    postprocess_dict['excl_suffix_conversion_dict'][target_for_substitution]
orig_headers = [postprocess_dict['column_dict'][x]['origcolumn'] for x in returned_headers]
rename_dict = dict(zip(returned_headers, orig_headers))
train.rename(columns = rename_dict, inplace = True)
```

The convention in library is that data is recieved in a tidy form (one column per feature and one row per observation), so ideally categoric features should be received in a single column configuration for targeting with DPse.

Here is an example of injecting noise to the Titanic's df_test.

In [11]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [12]:
numeric_features = \
['Age', 'Fare']

categoric_features = \
['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']

passthrough_features = \
['PassengerId']

#assign the features to one of DPne/DPse/excl
assigncat = \
{'DPne' : numeric_features, #numeric features recieving gaussian noise
 'DPse' : categoric_features,
 'excl' : passthrough_features,
}

#adjust the noise scale to account for unscaled data
#we'll use the default for test noise of sigma=0.03 as a starting point
Age_sigma = 0.03 * df_test['Age'].std()
Fare_sigma = 0.03 * df_test['Fare'].std()

#note that since we are passing the data to an automunge(.) call
#automunge will assume the data is training data for purposes of noise injection
#so any noise parameters should be passed to the train setting 
#(e.g. sigma instead of test_sigma)
#if laplace noise instead of gaussian desired the noisedistribution setting can be adjusted here

assignparam = \
{'DPne' : {'Age' : {'sigma' : Age_sigma},
           'Fare' : {'sigma' : Fare_sigma}}}

#We'll also deactivate shuffletrain to retain order of rows
shuffletrain = False

#note that the family trees for DPne/DPse/excl do not include NArw aggregation
#so no need to deactivate NArw_marker
#they are also already excluded from infill based on their process_dict specification
#so no need to deactivate MLinfill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_test,
             assigncat = assigncat,
             assignparam = assignparam,
             shuffletrain = shuffletrain,
             printstatus = False,
            )

train.head()

Unnamed: 0,PassengerId,Pclass_DPse,Name_DPse,Sex_DPse,Age_DPne,SibSp_DPse,Parch_DPse,Ticket_DPse,Fare_DPne,Cabin_DPse,Embarked_DPse
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [13]:
#then to recover the original column headers

returned_headers = list(train)
#note this for loop not needed with automunge(.) setting excl_suffix=True
for excl_entry in postprocess_dict['excl_suffix_conversion_dict']:
    returned_headers[returned_headers.index(excl_entry)] = \
    postprocess_dict['excl_suffix_conversion_dict'][excl_entry]
orig_headers = [postprocess_dict['column_dict'][x]['origcolumn'] for x in returned_headers]
rename_dict = dict(zip(returned_headers, orig_headers))
train.rename(columns = rename_dict, inplace = True)

train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Then to consistently prepare additional test data can either use the populated postprocess_dict and pass the data to postmunge(.) in conjunction with traindata=True, or can just pass the additional data to automunge(.) with comparable parameter settings.

_______