## Oversampling Demonstration

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import Automunger
am = Automunger.AutoMunge()

To demonstration oversampling, we create a small toy data set to ensure it is easy to inspect the results.

In [4]:
import pandas as pd

toy_df = \
pd.DataFrame({'feature1':[1,2,3,4,5,6], 
              'feature2':[7,6,5,4,3,2],
              'labels'  :[1,1,0,1,0,1]})

toy_df

Unnamed: 0,feature1,feature2,labels
0,1,7,1
1,2,6,1
2,3,5,0
3,4,4,1
4,5,3,0
5,6,2,1


By inspection, our toy dataset appears to have an underrepresented class for 0 in the labels.

First let's process the data in automunge(.) without oversampling to serve as a basis for comparison. We'll turn off shuffling for clarity.

In [5]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df,
               labels_column = 'labels',
               shuffletrain = False, 
               printstatus = False)

print('train')
print(train)
print()
print('trainID')
print(trainID)
print()
print('labels')
print(labels)

train
   feature1_nmbr  feature2_nmbr  feature1_NArw  feature2_NArw
0      -1.336306       1.336306              0              0
1      -0.801784       0.801784              0              0
2      -0.267261       0.267261              0              0
3       0.267261      -0.267261              0              0
4       0.801784      -0.801784              0              0
5       1.336306      -1.336306              0              0

trainID
   Automunge_index
0                0
1                1
2                2
3                3
4                4
5                5

labels
   labels_0.0  labels_1.0
0           0           1
1           0           1
2           1           0
3           0           1
4           1           0
5           0           1


Now we'll run again after activating oversampling by the `TrainLabelFreqLevel` parameter.

In [6]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df,
               labels_column = 'labels',
               shuffletrain = False, 
               TrainLabelFreqLevel = True, 
               printstatus = False)

print('train')
print(train)
print()
print('trainID')
print(trainID)
print()
print('labels')
print(labels)

train
   feature1_nmbr  feature2_nmbr  feature1_NArw  feature2_NArw
0      -1.336306       1.336306              0              0
1      -0.801784       0.801784              0              0
2      -0.267261       0.267261              0              0
3       0.267261      -0.267261              0              0
4       0.801784      -0.801784              0              0
5       1.336306      -1.336306              0              0
2      -0.267261       0.267261              0              0
4       0.801784      -0.801784              0              0

trainID
   Automunge_index
0                0
1                1
2                2
3                3
4                4
5                5
2                2
4                4

labels
   labels_0.0  labels_1.0
0           0           1
1           0           1
2           1           0
3           0           1
4           1           0
5           0           1
2           1           0
4           1           0


Here we see that the rows corresponding to index 2 and 4 were duplicated in the returned sets, which without shuffling are the bottom two rows of the returned sets.

Note that oversampling is also available for regression applications by preparing numeric labels with transformation sets that include aggregated bins. Let's create another toy data set to illustrate.

In [7]:
import pandas as pd

toy_df2 = \
pd.DataFrame({'feature1':[1,2,3,4,5,6], 
              'feature2':[7,6,5,4,3,2],
              'labels'  :[3,3,6,9,3,6]})

toy_df2

Unnamed: 0,feature1,feature2,labels
0,1,7,3
1,2,6,3
2,3,5,6
3,4,4,9
4,5,3,3
5,6,2,6


Now we can assign our numeric labels to a transform in assigncat, here we'll apply 'exc3' which suppl,enets pass-tyhrough data with aggregated standard deviation bins. Similarly 'exc4' would prepare power of ten bins.

In [8]:
assigncat = {'exc3':['labels']}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(toy_df2,
               labels_column = 'labels',
               shuffletrain = False, 
               TrainLabelFreqLevel = True, 
               assigncat = assigncat, 
               printstatus = False)

print('train')
print(train.to_string())
print()
print('trainID')
print(trainID.to_string())


train
   feature1_nmbr  feature2_nmbr  feature1_NArw  feature2_NArw
0      -1.336306       1.336306              0              0
1      -0.801784       0.801784              0              0
2      -0.267261       0.267261              0              0
3       0.267261      -0.267261              0              0
4       0.801784      -0.801784              0              0
5       1.336306      -1.336306              0              0
2      -0.267261       0.267261              0              0
5       1.336306      -1.336306              0              0
3       0.267261      -0.267261              0              0
3       0.267261      -0.267261              0              0

trainID
   Automunge_index
0                0
1                1
2                2
3                3
4                4
5                5
2                2
5                5
3                3
3                3


In [9]:

pd.set_option('display.max_rows', labels.shape[0])

labels

Unnamed: 0,labels_exc2,labels_exc2_bins_0,labels_exc2_bins_1,labels_exc2_bins_2,labels_exc2_bins_3,labels_exc2_bins_4,labels_exc2_bins_5
0,3.0,0,0,1,0,0,0
1,3.0,0,0,1,0,0,0
2,6.0,0,0,0,1,0,0
3,9.0,0,0,0,0,1,0
4,3.0,0,0,1,0,0,0
5,6.0,0,0,0,1,0,0
2,6.0,0,0,0,1,0,0
5,6.0,0,0,0,1,0,0
3,9.0,0,0,0,0,1,0
3,9.0,0,0,0,0,1,0


Note that oversampling can also be applied to test data as long as labels are included. You can pass `TrainLabelFreqLevel = 'traintest'` to apply to both train and test data passed to automunge(.) or `TrainLabelFreqLevel = 'test'` to just apply to automunge(.) test data. Similarly you can apply oversampling in the postmunge function by passing `TrainLabelFreqLevel = True`.

In [10]:
test, testID, testlabels, \
labelsencoding_dict, postreports_dict \
= am.postmunge(postprocess_dict, 
               toy_df2,
               TrainLabelFreqLevel = True,
               printstatus = False,
              )

test

Unnamed: 0,feature1_nmbr,feature2_nmbr,feature1_NArw,feature2_NArw
0,-1.336306,1.336306,0,0
1,-0.801784,0.801784,0,0
2,-0.267261,0.267261,0,0
3,0.267261,-0.267261,0,0
4,0.801784,-0.801784,0,0
5,1.336306,-1.336306,0,0
2,-0.267261,0.267261,0,0
5,1.336306,-1.336306,0,0
3,0.267261,-0.267261,0,0
3,0.267261,-0.267261,0,0


Voila