# Data Augmentation Demonstration

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import AutoMunge
am = AutoMunge()

We'll demonstrate data augmentation on the Titanic set, a common benchmark.

In [4]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
df_train.shape

(891, 12)

Data augmentation is available for assignment to target columns in assigncat, and may be targeted to numeric or bounded categoric sets.

In [7]:
numeric_columns = ['Age', 'Parch', 'Fare']
categoric_columns = ['Pclass', 'Cabin', 'Embarked']

There are a few forms of augmentation to choose from, each of which in parallel to noise injection applies a different type of normalization or categoric encoding.

Numeric
- DPnb: for z-score normalized numeric data
- DPmm: for min-max scaled numeric data
- DPrt: for retain normalized numeric data

Categoric
- DPbn: for binary categoric data (i.e. two value sets)
- DPod: for ordinal encoded categoric data
- DPoh: for one-hot encoded categoric data
- DP10: for binarized categoric data

Here we'll demonstrate by applying z-score normalized numeric sets and binarized bounded categoric sets.

In [8]:
assigncat = {'DPnb' : numeric_columns, 
             'DP10' : categoric_columns}

Columns that are not explicitly assigned to transformation categories in assigncat are defered to automation.

Note that the DP family of transforms inject noise into training data but do not inject noise into test data.

Here we'll demonstrate processing the same training data set twice, both with and without noise injection, and concatinating the two results (by passing the same training set df_train as both train and test data).

In [9]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_train,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               printstatus = False)

train = pd.concat([train, test], axis=0, ignore_index=True)
trainID = pd.concat([trainID, testID], axis=0, ignore_index=True)
labels = pd.concat([labels,testlabels], axis=0, ignore_index=True)

In [10]:
train.shape

(1782, 35)

In [11]:
train.head()

Unnamed: 0,Sex_bnry,SibSp_nmbr,Pclass_ord3_DPod_1010_0,Pclass_ord3_DPod_1010_1,Name_hash_0,Name_hash_1,Name_hash_2,Name_hash_3,Name_hash_4,Name_hash_5,...,Cabin_ord3_DPod_1010_1,Cabin_ord3_DPod_1010_2,Cabin_ord3_DPod_1010_3,Cabin_ord3_DPod_1010_4,Cabin_ord3_DPod_1010_5,Cabin_ord3_DPod_1010_6,Cabin_ord3_DPod_1010_7,Embarked_ord3_DPod_1010_0,Embarked_ord3_DPod_1010_1,Embarked_ord3_DPod_1010_2
0,1,0.43255,0,0,241,630,390,346,657,341,...,0,0,0,0,0,0,0,0,0,1
1,1,-0.474279,1,0,766,464,164,155,133,0,...,0,0,0,0,0,0,0,0,0,0
2,1,-0.474279,0,0,210,464,466,723,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,-0.474279,1,0,903,212,184,963,843,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0.43255,0,0,61,212,513,0,0,0,...,0,0,0,0,0,0,0,0,0,1


When it is time to process any additional data on the train set basis, we can still perform in the postmunge(.) function using the postprocess_dict returned from the corresponding automunge(.) call. By default postmunge will treat data as test data and not inject noise. If noise injection is desired on additional data postmunge accepts the traindata parameter to signal data is to be treated as training data for noise injection when traindata=True.

In [12]:
test, testID, testlabels, \
labelsencoding_dict, postreports_dict \
= am.postmunge(postprocess_dict, df_test, 
               traindata=False,
               printstatus=False)

In [13]:
test.head()

Unnamed: 0,Sex_bnry,SibSp_nmbr,Pclass_ord3_DPod_1010_0,Pclass_ord3_DPod_1010_1,Name_hash_0,Name_hash_1,Name_hash_2,Name_hash_3,Name_hash_4,Name_hash_5,...,Cabin_ord3_DPod_1010_1,Cabin_ord3_DPod_1010_2,Cabin_ord3_DPod_1010_3,Cabin_ord3_DPod_1010_4,Cabin_ord3_DPod_1010_5,Cabin_ord3_DPod_1010_6,Cabin_ord3_DPod_1010_7,Embarked_ord3_DPod_1010_0,Embarked_ord3_DPod_1010_1,Embarked_ord3_DPod_1010_2
0,1,-0.474279,0,0,571,464,489,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0.43255,0,0,29,260,489,203,116,0,...,0,0,0,0,0,0,0,0,0,0
2,1,-0.474279,1,0,610,464,110,843,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,-0.474279,0,0,902,464,368,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0.43255,0,0,15,260,82,269,287,567,...,0,0,0,0,0,0,0,0,0,0


One more demonstration, note that the noise distribution profiles injected to any column can be custom configuraed by parameter. Automunge allows passing parameters to transformations by the assignparam dictionary. Available parameters for each transformation category and their defaults are documented in the library of transformations section of the [READ ME](https://github.com/Automunge/AutoMunge/blob/master/README.md).

Here we'll demonstrate again applying noise injection to training data, but in this case will configure custom noise profiles.

In [14]:
#If we want to overwrite for the transformation category globally 
#can apply assignparam with a default assignparam entry.

#here sigma is the noise standard deviation and flip_prob refers to the ratio of entries receiving injection

assignparam = \
{'default_assignparam' :
   {'DPnb' : {'sigma' : 0.05, 'flip_prob' : 0.5},
    'DP10' : {'flip_prob' : 0.5}
} }

#Or we can overwrite for a specific column, 
#here we demonstrate applying scaled Laplace distributed noise instead of Gaussian 
#to the ‘DPnb’ transform application to column ‘Fare’.

assignparam.update(
 {'DPnb' : {'Fare' : {'noisedistribution' : 'laplace'}}}
)


In [15]:

#and then can similarly process the train data with and without noise injections and concatinate

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_train,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               assignparam = assignparam,
               printstatus = False)

train = pd.concat([train, test], axis=0, ignore_index=True)
trainID = pd.concat([trainID, testID], axis=0, ignore_index=True)
labels = pd.concat([labels,testlabels], axis=0, ignore_index=True)

More information is available on tabular data augmentation with Automunge in the paper [A Numbers Game
Numeric Encoding Options with Automunge](https://medium.com/automunge/a-numbers-game-b68ac261c40d), particularly in Section 5 and Appendix D.