# Data Augmentation Demonstration

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import Automunger
am = Automunger.AutoMunge()

We'll demonstrate data augmentation on the Titanic set, a common benchmark. Importantly, in order to perform feature importance we'll need to include and designate a target label column.

In [4]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#titanic set
labels_column = 'Survived'
trainID_column = 'PassengerId'

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can then perform feature importance in automunge(.) by the featureselection parameter. The results will be included about midway through the printouts and also available for inspections in the returned featureimprotance dicitonary. 

Note that automunge(.) performs feature importance by way of shuffle permutation, and relies on the same ML architecture used for ML infill, which in default configuration is Random Forest.

In [6]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               labels_column = labels_column,
               trainID_column = trainID_column,
               featureselection = True,
               printstatus = True)

_______________
Begin Feature Importance evaluation

_______________
Begin Automunge processing

evaluating column:  Pclass
processing column:  Pclass
    root category:  text
 returned columns:
['Pclass_1.0', 'Pclass_2.0', 'Pclass_3.0']

evaluating column:  Name
processing column:  Name
    root category:  hash
 returned columns:
['Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13']

evaluating column:  Sex
processing column:  Sex
    root category:  bnry
 returned columns:
['Sex_bnry']

evaluating column:  Age
processing column:  Age
    root category:  nmbr
 returned columns:
['Age_nmbr']

evaluating column:  SibSp
processing column:  SibSp
    root category:  nmbr
 returned columns:
['SibSp_nmbr']

evaluating column:  Parch
processing column:  Parch
    root category:  nmbr
 returned columns:
['Parch_nmbr']

evaluating colu

Base Accuracy of feature importance model:
0.8491620111731844

_______________
Evaluating feature importances

_______________
Feature Importance results:

Sex_bnry
metric =  0.08938547486033521
metric2 =  0.0

Age_nmbr
metric =  0.04469273743016755
metric2 =  0.0

SibSp_nmbr
metric =  0.011173184357541888
metric2 =  0.0

Parch_nmbr
metric =  0.0
metric2 =  0.0

Fare_nmbr
metric =  0.03910614525139666
metric2 =  0.0

Pclass_1.0
metric =  0.016759776536312887
metric2 =  0.011173184357541888

Pclass_2.0
metric =  0.016759776536312887
metric2 =  0.005586592178770888

Pclass_3.0
metric =  0.016759776536312887
metric2 =  -0.005586592178770999

Name_hash_0
metric =  0.08379888268156421
metric2 =  0.08379888268156421

Name_hash_1
metric =  0.08379888268156421
metric2 =  0.0

Name_hash_2
metric =  0.08379888268156421
metric2 =  0.07821229050279332

Name_hash_3
metric =  0.08379888268156421
metric2 =  0.07821229050279332

Name_hash_4
metric =  0.08379888268156421
metric2 =  0.07262569832402233


The results included in prinouts are first presented in order of features and then again sorted by the metric from most important to least.

Here "metric" measures the importance of the source column and a higher value implies higher importance.

And "metric2" measures the relative importance of features derived from the same source column and a lower value implies higher relative importance.

Important to keep in mind that feature importance is as much a measure of the model as it is of the feature.

We can also inspect results in the returned featureimportance dictionary.

Here featureimportance['FS_sorted'] are the results sorted by the metric, and featureimportance['FScolumn_dict'] are the raw unsorted data. Each aggragation includes the data in a few different formats.

In [9]:
#Here are the sorted metric results 

featureimportance['FS_sorted']['metric_key']

{0.08938547486033521: ['Sex'],
 0.08379888268156421: ['Name'],
 0.05027932960893855: ['Embarked'],
 0.04469273743016755: ['Age'],
 0.03910614525139666: ['Fare'],
 0.016759776536312887: ['Pclass', 'Ticket'],
 0.011173184357541888: ['SibSp', 'Cabin'],
 0.0: ['Parch']}

In [11]:
#or as an inversion of that presentation

featureimportance['FS_sorted']['column_key']

{'Sex': 0.08938547486033521,
 'Name': 0.08379888268156421,
 'Embarked': 0.05027932960893855,
 'Age': 0.04469273743016755,
 'Fare': 0.03910614525139666,
 'Pclass': 0.016759776536312887,
 'Ticket': 0.016759776536312887,
 'SibSp': 0.011173184357541888,
 'Cabin': 0.011173184357541888,
 'Parch': 0.0}