### Automunge Noise Injection Demonstrations

Automunge has a suite of options for injecting noise into tabular features. We expect noise injections may serve several potential benefits such as a resource for data augmentation, bias mitigation, differential privacy, model perturbation for aggregation of ensembles, and non-determinism. Noise injections were introduced in the paper [Numeric Encoding Options with Automunge](https://medium.com/automunge/a-numbers-game-b68ac261c40d) and discussed in more depth in the essay [Noise Injections with Automunge](https://medium.com/automunge/noise-injections-with-automunge-7ebb672216e2).

In [1]:
from Automunge import *
am = AutoMunge()

We'll demonstrate feature importance on the [Titanic](https://www.kaggle.com/c/titanic/data) set, a common benchmark.

In [2]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

______

### DP transformation categories

DP family of transforms are surveyed in the README's library of transformations as [Differential Privacy Noise Injections](https://github.com/Automunge/AutoMunge/blob/master/README.md#differential-privacy-noise-injections). 

The noise injections are performed in conjunction with numeric normalizations or categoric encodings. Here is a quick survey of options available as of version 7.59:

### Numeric Injections:
- **DPnb**: z-score normalization with Gaussian noise
- **DPmm**: min-max scaling with scaled Gaussian noise
- **DPrt**: retain normalization with scaled Gaussian noise
- **DLnb / DLmm / DLrt**: similar to three preceding with default to Laplace distributed noise instead of Gaussian
- **DPqt**: distribution conversion by quantile transform with Gaussian noise
- **DPbx**: distribution conversion by box-cox transform with Gaussian noise
- **DPns**: z-score normalization with swap-noise injection
- **DPne**: pass-through as numeric with noise, no normalization or infill

### Categoric Injections:
- **DPbn**: boolean integer encoding with weighted activation flips
- **DPod**: ordinal encoding with weighted activation flips
- **DPoh**: one hot encoding with weighted activation flips
- **DP10**: binarization with weighted activation set flips
- **DP1s**: binarization with swap-noise injection
- **DPhs**: multi column hash encoding (like hash) with weighted activation flips
- **DPh2**: single column hash encoding (like hsh2) with weighted activation flips
- **DPh1**: multi column hash binarization (like hs10) with weighted activation set flips
- **DPse**: pass-through as cateogric with swap noise, no encoding or infill

Here is an example of assigning some of these root categories to received features with headers 'column1', 'column2', 'column3'. DTnb is z-score normalization with Gaussian noise to test data, shown here assigned to column1. DBod is ordinal encoding with weighted activation flips to both train and test data, shown here assigned to column2 and column3. (To just inject to train data the identifier string for that default configuration replaces the DT or DB prefix with DP.)
```
assigncat = {
  'DTnb' : 'column1',
  'DBod' : ['column2', 'column3'],
}
```


_______

### DP parameters

Each of these transformations accepts optional parameter specifications to vary from the defaults. Parameters are passed to transformations through the automunge(.) assignparam parameter. As we described in Appendix F, parameter assignments through assignparam can be conducted in three ways, where global_assignparam passes the setting to every transform applied to every column, default_assignparam pass the same setting to every instance of a specific transformation's tree category identifier applied to any column, or in the third option a parameter setting can be assigned to a specific transformation tree category identifier passed to a specific column (where that column may be an input column or a derived column with suffix appender passed to the transform). Note that the difference between a tree category and a root category is that a root category is the identifier of the family tree of transformation categories assigned to a column in the assigncat parameter, and a tree category is an entry to one of those family tree primitives which is used to access the transformation function. To restate for clarity, the (column) string designates one of either the input column header (before suffixes are applied) or an intermediate column header with suffixes that serves as input to the target transformation.
```
assignparam = {
  'global_assignparam'  : {'(parameter)': 42},
  'default_assignparam' : {'(category)' : {'(parameter)' : 42}},
  '(category)' : {'(column)'   : {'(parameter)' : 42}},
}
```
For noise injections that are performed in conjunction with a normalization or encoding, the noise transform is generally applied in a different tree category than the encoding transform, so if parameters are desired to be passed to the encoding, assignparam will need to target a different tree category for the encoding than for the noise. Generally speaking, the noise transform family trees have been configured so that the noise tree category matches the root category, which was intentional for simplicity of paramter assignment (with an exception for DPhs for esoteric reasons). To view the full family tree such as to inspect the encoding tree category, the set of family trees associated with various root categories are provided in the code repository as FamilyTrees.md.

As noted in Appendix D, for subsequent data passed to postmunge(.), the data can also be treated as test data or train data, and in both cases also have noise deactivated. The postmunge(.) traindata parameter defaults to False to prepare postmunge(.) as test data and accepts entries of {False, True, 'test\_no\_noise', 'train\_no\_noise'\}.

Note that assignparam can also be used to deviate from the default train or test noise injection settings. As noted above, the convention for the string identifiers of noise root categories is that 'DP' injects noise to train and not test data, 'DT' injects noise to test and not train data, and 'DB' injects noise to both train and test data. These are the defaults, but each of these can be updated by parameter assignment with assignparam specification of 'trainnoise' or 'testnoise' parameters. 

As further illustrated in Table 2 of [Noise Injections with Automunge](https://medium.com/automunge/noise-injections-with-automunge-7ebb672216e2), the trainnoise and testnoise parameters can be used in conjunction with the postmunge(.) traindata parameter to target noise towards train and/or test data.

Most of the noise injection transforms share common parameters between those targeting numeric or categoric entries. Here is a summary.

### Numeric Parameters:
- **trainnoise**: activates noise injection to train data (defaults True for DP or DB and False for DT)
- **testnoise**: activates noise injection to test data (defaults True for DT or DB and False for DP)
- **flip_prob**: ratio of train entries receiving injection
- **test_flip_prob**: ratio of test entries receiving injection
- **sigma**: scale of train noise distribution
- **test_sigma**: scale of test noise distribution
- **mu**: mean of train noise distribution (before any scaling)
- **test_mu**: mean of test noise distribution (before any scaling)
- **noisedistribution**: train noise distribution, defaults to 'normal' (Gaussian), accepts one of {'normal', 'laplace', 'uniform', 'abs_normal', 'abs_laplace', 'abs_uniform', 'negabs_normal', 'negabs_laplace', 'negabs_uniform'}, where abs refers to all positive signed noise and negabs refers to all negative signed noise
- **test_noisedistribution**:  test noise distribution, comparable options supported
- **rescale_sigmas**: for min-max normalization (DPmm) or retain normalization (DPrt), this activates the mean adjustment noted in Appendix H, defaults to True
- **retain_basis**: for cases where distribution parameters passed as list or distribution, activating retain_basis means the basis sampled in automunge is carried through to postmunge or the default of False means a unique basis is sampled in automunge and postmunge

### Categoric and Swap Parameters:
- **trainnoise**: activates noise injection to train data (defaults True for DP or DB and False for DT)
- **testnoise**: activates noise injection to test data (defaults True for DT or DB and False for DP)
- **flip_prob**: ratio of train entries receiving injection
- **test_flip_prob**: ratio of test entries receiving injection
- **weighted**: weighted vs uniform sampling of activation flips to train data
- **test_weighted**: weighted vs uniform sampling of activation flips to test data
- **retain_basis**: for cases where distribution parameters passed as list or distribution, activating retain_basis means the basis sampled in automunge is carried through to postmunge or the default of False means a unique basis is sampled in automunge and postmunge

Here is an example of assignparam specification to: 
- set an all positive noise distribution for category DPmm as applied to an input column with header 'column1', noting that for scaled noise like DPmm all positive or all negative noise should be performed with a deactivated noise_scaling_bias_offset.
- update the flip_prob parameter to 0.1 for all cases of DPnb injections via default_assignparam
- apply testnoise injections to all noise transforms via global_assignparam
```
#assumes DPmm and DPnb have been assigned in assigncat
assignparam = {
  'DPmm' : {'inputcolumn': {'noisedistribution'         : 'abs_normal',
                            'noise_scaling_bias_offset' : False}},
  'default_assignparam' : {'DPnb' : {'flip_prob' : 0.1}},
  'global_assignparam'  : {'testnoise': True},
}
```


_______

### Noise injection under automation

The automunge(.) powertransform parameter can be used to select between alternate sets of default transformations applied under automation. We currently have two scenarios for default encodings with noise, inlcuding powertransform passed as one of {'DP1', 'DP2'}. DP2 differs from DP1 in that numerical defaults to retain normalization instead of z-score and categoric defaults to ordinal instead of binarization. (Or DT and DB equivalents DT1/DT2/DB1/DB2 for different default train and test noise configurations.)

**'DP1'** 
- numerical receives DPnb
- categoric receives DP10
- binary receives DPbn
- hash receives DPhs, 
- hsh2 receives DPh2 
- (labels do not receive noise)

**'DP2'**
- numerical receives DPrt
- categoric receives DPod
- binary receives DPbn
- hash receives DPhs, 
- hsh2 receives DPh2 
- (labels do not receive noise)

An example specification:
```
powertransform = 'DP2'
```

Otherwise noise can just be manually assigned in the assigncat parameter as demonstrated above, which specificaitons will take precedence over what would otherwise be performed under automation.

_______

### Data Augmentation with Noise

Data augmentation refers to increasing the size of a training set with manipulations to increase variety. In the image modality it is common to achieve data augmentation by way of adjustments like image cropping, rotations, color shift, etc. Here we are simply adding noise for similar effect. In a deep learning benchmark performed in our paper [Numeric Encoding Options with Automunge](https://medium.com/automunge/a-numbers-game-b68ac261c40d) we found that this type of data augmentation was fairly benign with a fully represented data set, but was increasingly beneficial with underserved training data.

Data augmentation can be realized by assigning noise transforms in conjunction with the automunge(.) noise_augment parameter, which accepts integers of number of additional duplicates to prepare, e.g. noise_augment=1 would double the size of the training set returned from automunge(.). For cases where too much duplication starts to run into memory constraints additional duplicates can also be prepared with postmunge(.), which also has a noise_augment parameter option and accepts the traindata parameter to distinguish whether a data set is to be treated as train or test data.

Under the default configuration when noise_augment is received as an integer dtype, one of the duplicates will be prepared without noise. If noise_augment is received as a float(int) type, all of the duplicates will be prepared with noise.

Here is an example of preparing data augmentation for the Titanic set loaded earlier.

In [3]:
#number of rows in original training data
df_train.shape[0]

891

In [4]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             powertransform = 'DP2',
             noise_augment = 2.0,
             printstatus = False)

In [5]:
#number of rows in training data after noise_augment
train.shape[0]

2673

_______

### Alternate random samplers

The random sampling for noise injection defaults to numpy's PCG64, which is based on the [PCG pseudo random number generator](https://www.pcg-random.org/index.html). On its own this generator is not truly random, it relies on seedings of entropy provided by the operating system which are then enhanced through use. To support integration of enhanced randomness profiles, both automunge(.) and postmunge(.) accept parameters for entropy_seeds and random_generator.

entropy_seeds accepts an integer or list/array of integers which may serve as a supplemental source of entropy for the numpy.random generator to enhance randomness properties.

random_generator accepts input of a numpy.random.Generator formatted random sampler. An example could be numpy.random.MT19937 for Mersenne Twister, or could even be an external library with a numpy.random formatted generator, such as for example could be used to sample with the support of quantum circuits.

If an alternate library does not have a numpy.random formatted generator, their output can be channeled to entropy_seeds for similar benefit.

Specifications of entropy_seeds and random_generator are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict.

The two parameters can also be passed in tangent, for sampling with a custom generator with custom supplemental entropy seeds.

Here is an example of specifying an alternate generator and supplemental entropy seedings.
```
random_generator = numpy.random.MT19937
entropy_seeds = [4,5,6]
```
In the default case the same bank of entropy seeds is fed to each sampling operation with a shuffle. The library also supports different types of sampling scenarios that can be supported by entropy seedings. Alternate sampling scenarios can be specified to automunge(.) or postmunge(.) by the sampling_dict parameter. Here are a few scenarios to illustrate.

1) In one scenario, instead of passing externally sampled supplemental entropy seeds, a user can pass a custom generator for internal sampling of entropy seeds. Here is an example of using a custom generator to sampling entropy seeds and the default generator PCG64 for sampling applied in the transformations. The sampling_type bulk_seeds means that a unique seed will be generated for each sampled entry.
```
random_generator = (custom numpy formatted generator)
entropy_seeds = False
sampling_dict = \
{'sampling_type' : 'bulk_seeds',
 'extra_seed_generator' : 'custom',
 'sampling_generator' : 'PCG64',
 }
```

2) In another scenario a user may want to reduce their sampling budget by only accessing one entropy seed for each set of entries. This is accessed with the sampling_type of sampling_seed.
```
random_generator = (custom numpy formatted generator)
entropy_seeds = False
sampling_dict = \
{'sampling_type' : 'sampling_seed',
 'extra_seed_generator' : 'custom',
 'sampling_generator' : 'PCG64',
 }
```

3) There may be a case where a source of supplemental entropy seeds isn't available as a numpy.random formatted generator. In this case, in order to apply one of the alternate sampling_type scenarios, a user may desire to know a budget of how many seeds are required for externally sampled seeds passed through the entropy_seeds parameter. This can be accomplished by first running the automunge(.) call without entropy seeding specifications to generate the report returned as postprocess_dict['sampling_report_dict']. (note that if sampling seeds internally with a custom generator this isn't needed.) Note that the sampling_report_dict will report requirements separately for train and test data and in the bulk_seeds case will have a row count basis. (If not passing test data to automunge(.) the test budget can be omitted. For postmunge the use of train or test budget should align with the postmunge traindata parameter.) For example, if a user wishes to derive a set of entropy seeds to support a bulk_seeds sampling type, they can produce a report and derive as follows:
```
#first run automunge(.) to populate postprocess_dict (not shown)
#using comparable category and parameter assignments
#we recommend running initially with default sampling_type
#which will populate sampling_report_dict for test data even if df_test not provided

#access the sampling_report_dict in the returned postprocess_dict
sampling_report_dict = postprocess_dict['sampling_report_dict']

#a bulk_seeds sampling_type budget will need to take account for row counts
rowcount_train = df_train.shape[0]
rowcount_test = df_test.shape[0]

#the budget can be derived as
train_budget = \
sampling_report_dict['bulk_seeds_total_train'] * rowcount_train \
/ sampling_report_dict['rowcount_basis_train']

test_budget = \
sampling_report_dict['bulk_seeds_total_test'] * rowcount_test \
/ sampling_report_dict['rowcount_basis_test']

#number of external seeds needed for bulk seeds case:
seed_count = \
train_budget + test_budget

#this number of seeds can then be passed to the entropy_seeds parameter

random_generator = False
entropy_seeds = externally_sampled_seeds_list
sampling_dict = \
{'sampling_type' : 'bulk_seeds',
 'extra_seed_generator' : 'PCG64',
 'sampling_generator' : 'PCG64',
 }
```

_______

### QRAND library sampling

To sample noise from a quantum circuit, a user can either pass externally sampled entropy_seeds or make use of an external library with a numpy.random formatted generator. Here's an example of using the [QRAND](https://github.com/pedrorrivero/qrand) library to sample from a quantum circuit, based on a tutorial provided in their read me which makes use of Qiskit for access to cloud resources.

To sample noise from a quantum circuit, a user can either pass externally sampled entropy_seeds or make use of an external library with a numpy.random formatted generator. Here's an example of using the [QRAND](https://github.com/pedrorrivero/qrand) library to sample from a quantum circuit, based on a tutorial provided in their read me which makes use of [Qiskit](https://www.ibm.com/quantum-computing/) for access to cloud resources.

```
from qrand import QuantumBitGenerator
from qrand.platforms import QiskitPlatform
from qrand.protocols import HadamardProtocol
from numpy.random import Generator
from qiskit import IBMQ

provider = IBMQ.load_account()
platform = QiskitPlatform(provider)
protocol = HadamardProtocol()
bitgen = QuantumBitGenerator(platform, protocol)
# gen = Generator(bitgen)

#then can initialize automunge(.) or postmunge(.) parameters as
random_generator = bitgen
entropy_seeds = False
sampling_dict = \
{'sampling_type' : 'default',
 'extra_seed_generator' : 'off',
 'sampling_generator' : 'custom',
 }
```

_______

### All Together Now

Let's do a quick demonstration tieing it all together.

Here we'll apply the powertransform = 'DP2' option for noise under augmentation, override a few of the default transforms with assigncat, assign a few deviations to transformation parameters via assignparam, add some additional entropy seeds from some other resource, and prepare a few additional training data duplicates for data augmentation purposes.

In [6]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
powertransform = 'DP2'

assigncat = {'DPh2' : 'Name'}

noise_augment = 2.

entropy_seeds = [432,6,243,561232,89]

#(Age is a feature header in the Titanic data set)
assignparam = {
    'DPrt' : {'Age': {'noisedistribution'         : 'abs_normal',
                      'noise_scaling_bias_offset' : False}},
    'default_assignparam' : {'DPrt' : {'flip_prob' : 0.1}},
    'global_assignparam'  : {'testnoise': True},
}

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column,
             trainID_column = trainID_column,
             powertransform = powertransform,
             assigncat = assigncat,
             noise_augment = noise_augment,
             entropy_seeds = entropy_seeds,
             assignparam = assignparam,
             printstatus = False)

In [8]:
train.head()

Unnamed: 0,Pclass_NArw,Pclass_DPrt,Name_NArw,Name_DPh2_DPo7,Sex_NArw,Sex_DPb2_DPbn,Age_NArw,Age_DPrt,SibSp_NArw,SibSp_DPrt,...,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Cabin_DPo4_DPod,Embarked_NArw,Embarked_DPo4_DPod
25,0,1.0,0,19,0,0,0,0.132948,0,0.433644,...,0,119,0,0,0,0.061045,1,3,0,1
1789,0,1.0,0,806,0,1,0,0.019854,0,0.375,...,0,328,0,0,0,0.041136,1,7,0,1
1302,0,1.0,0,355,0,1,1,0.358235,0,0.0,...,0,870,0,0,0,0.013387,1,146,0,3
253,0,1.0,0,590,0,1,0,0.497361,0,0.125,...,0,915,0,0,0,0.054457,1,3,0,1
1775,0,1.0,0,961,0,1,0,0.308872,0,0.0,...,0,921,236,0,0,0.013761,1,129,0,1


Similarly we can prepare additional test data in postmunge(.) using the postprocess_dict returned from automunge(.), which since we set testnoise as globally activated will result in injected noise in the default traindata=False case.

In [9]:
entropy_seeds = [2345, 77887, 2342, 7878789]

traindata = False

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
             df_test,
             entropy_seeds = entropy_seeds,
             traindata=traindata,
             printstatus=False,
            )

In [10]:
test.head()

Unnamed: 0,Pclass_NArw,Pclass_DPrt,Name_NArw,Name_DPh2_DPo7,Sex_NArw,Sex_DPb2_DPbn,Age_NArw,Age_DPrt,SibSp_NArw,SibSp_DPrt,...,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Cabin_DPo4_DPod,Embarked_NArw,Embarked_DPo4_DPod
0,0,1.0,0,15,0,1,0,0.428248,0,0.0,...,0,739,0,0,0,0.015282,1,45,0,3
1,0,1.0,0,68,0,0,0,1.0,0,0.122006,...,0,789,0,0,0,0.013663,1,3,0,1
2,0,0.5,0,44,0,1,0,1.0,0,0.0,...,0,87,0,0,0,0.018256,1,5,0,3
3,0,1.0,0,310,0,1,0,0.334004,0,0.0,...,0,822,0,0,0,0.016908,1,45,0,1
4,0,1.0,0,311,0,0,0,0.271174,0,0.125,...,0,829,0,0,0,0.023984,1,3,0,1


_______

### Noise directed at existing data pipelines

One more thing. When noise is intended for direction at an existing data pipeline, such as for incorporation of noise into test data for an inference operation on a previously trained model, there may be desire to inject noise without other edits to a dataframe. This is possible by passing the dataframe as a df_train to an automunge(.) call and assigning the various features to one of these three pass-through categories:

- DPne: pass-through numeric with gaussian (or laplace) noise, comparable parameter support to DPnb
- DPse: pass-through with swap noise (e.g. for categoric data), comparable parameter support to DPmc
- excl: pass-through without noise

Note that DPse injects swap noise by accessing an alternate row entry for a target. This type of noise may not be suitable for test data injections in a scenario where inference may be run on a test set with one or very few samples. The convention in library is that data is received in a tidy form (one column per feature and one row per observation), so ideally categoric features should be received in a single column configuration for targeting with DPse.

Note that DPne will return entries as float data type, converting any non-numeric to NaN. The default noise scale for DPne (sigma=0.06 / test_sigma=0.03) is set to align with z-score normalized data. For the DPne pass-through transform, since the feature may not be z-score normalized, the scaling is adjusted by multiplication with the evaluated standard deviation of the feature as found in the training data by use of the defaulted parameter rescale_sigmas = True. This adjustment factor is derived based on the training data used to fit the postprocess_dict, and that same basis is carried through to postmunge(.). If user doesn't have access to the training data, they can fit the postprocess_dict to a representative df_test routed as the automunge(.) df_train.

Here is an example of injecting noise to the Titanic's df_test.

In [11]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [12]:
numeric_features = \
['Age', 'Fare']

categoric_features = \
['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']

passthrough_features = \
['PassengerId']

#assign the features to one of DTne/DTse/excl
#The DT configuration defaults to injecting noise just to test data
assigncat = \
{'DTne' : numeric_features, #numeric features receiving gaussian noise
 'DTse' : categoric_features, #categproc features receiving swap noise
 'excl' : passthrough_features,
}

#if we want to update the noise parameters they can be applied in assignparam
#shown here are the defaults
assignparam = \
{'default_assignparam' : 
  {'DPne' : {'test_sigma' : 0.06,
             'rescale_sigmas' : True},
   'DTse' : {'test_flip_prob' : 0.01}}}

#We'll also deactivate shuffletrain to retain order of rows
shuffletrain = False

#note that the family trees for DPne/DPse/excl do not include NArw aggregation
#so no need to deactivate NArw_marker
#they are also already excluded from infill based on process_dict specification
#so no need to deactivate MLinfill

#the orig_headers parameter retains original column headers without suffix appenders
orig_headers = True

#this operation will fit the postprocess_dict to the df_test
#if the training set is accessible that could be used in its place 
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_test,
             assigncat = assigncat,
             assignparam = assignparam,
             shuffletrain = shuffletrain,
             orig_headers = orig_headers,
             printstatus = False,
            )
            
#we can then use the populated postprocess_dict to run postmunge(.)
#which has better latency than automunge(.)
#the entropy seeding parameters are shown with their defaults for reference

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(
  postprocess_dict, 
  df_test,
  printstatus = False,
  random_generator = False,
  entropy_seeds = False,
  sampling_dict = {},
)

In [13]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The returned dataframe test can then be passed to inference. The order of columns in returned dataframe will be retained for these transforms and the orig_headers parameter retains original column headers without suffix appenders.

The postmunge(.) call can then be repeated as additional inference data becomes available, and could be applied sequentially to streams of data in inference.

_______