# Inversion Demonstration

This notebook will demonstrate a few variations on the inversion operation, in which data returned from automunge is inverted to recover its original form prior to transofmations. This type of operation may be useful for instance to recover the original form of labels after an inference operation, or to otherwise recover original form of train or test data.

----

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import AutoMunge
am = AutoMunge()

We'll demonstrate feature importance on the Titanic set, a common benchmark.

In [4]:
import pandas as pd

#titanic set
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

#labels and ID columns
labels_column = 'Survived'
trainID_column = 'PassengerId'

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In order to demonstrate inversion, first we'll apply a forward pass of transformations.

In [6]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               labels_column = labels_column,
               trainID_column = trainID_column,
               shuffletrain = False,
               printstatus = False)

In [7]:
train.head()

Unnamed: 0,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,Fare_nmbr,Cabin_hash,Pclass_1.0,Pclass_2.0,Pclass_3.0,Name_hash_0,...,Name_hash_9,Name_hash_10,Name_hash_11,Name_hash_12,Name_hash_13,Ticket_hash_0,Ticket_hash_1,Ticket_hash_2,Embarked_1010_0,Embarked_1010_1
0,1,-0.592148,0.43255,-0.473408,-0.502163,117,0,0,1,91,...,0,0,0,0,0,568,721,0,1,0
1,0,0.63843,0.43255,-0.473408,0.786404,276,1,0,0,898,...,0,0,0,0,0,464,708,0,0,0
2,0,-0.284503,-0.474279,-0.473408,-0.48858,117,0,0,1,465,...,0,0,0,0,0,429,818,0,1,0
3,0,0.407697,0.43255,-0.473408,0.420494,281,1,0,0,990,...,0,0,0,0,0,988,0,0,1,0
4,1,0.407697,-0.474279,-0.473408,-0.486064,117,0,0,1,449,...,0,0,0,0,0,685,0,0,1,0


In [8]:
labels.head()

Unnamed: 0,Survived_0.0,Survived_1.0
0,1,0
1,0,1
2,0,1
3,0,1
4,1,0


Notice that hashing is performed for unbounded categoric sets, such as Name and Cabin. We'll see below that inversion is not supported for those specific transforms.

Now when it comes time to invert, we can pass the postprocess_dict with the target set to postmunge and apply the inversion parameter.

The inversion parameter can be passed as:
- False for no inversion
- 'test' to invert a train or test set
- 'labels' to invert a labels set
- 'denselabels' to redundantly invert all configurations of a labels set (this one is kind of esoteric)
- or a list of column headers to only invert a subset of columns

Note that as kind of a quark postmunge only returns three sets with inversion.

In [9]:
df_invert, recovered_list, inversion_info_dict = \
am.postmunge(postprocess_dict, train, 
             inversion='test', \
             printstatus=True)

_______________
Begin Postmunge processing

Evaluating inversion paths for columns derived from:  Sex
Inversion path selected based on returned column  Sex_bnry
With full recovery.
Recovered source column:  Sex

Evaluating inversion paths for columns derived from:  Age
Inversion path selected based on returned column  Age_nmbr
With full recovery.
Recovered source column:  Age

Evaluating inversion paths for columns derived from:  SibSp
Inversion path selected based on returned column  SibSp_nmbr
With full recovery.
Recovered source column:  SibSp

Evaluating inversion paths for columns derived from:  Parch
Inversion path selected based on returned column  Parch_nmbr
With full recovery.
Recovered source column:  Parch

Evaluating inversion paths for columns derived from:  Fare
Inversion path selected based on returned column  Fare_nmbr
With full recovery.
Recovered source column:  Fare

Evaluating inversion paths for columns derived from:  Cabin
No inversion path available for source co

In [10]:
df_invert.head()

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Pclass,Embarked
0,male,22.0,1.0,0.0,7.250002,3.0,S
1,female,38.0,1.0,0.0,71.283295,1.0,C
2,female,26.0,5.960464e-08,0.0,7.925001,3.0,S
3,female,35.0,1.0,0.0,53.099998,1.0,S
4,male,35.0,5.960464e-08,0.0,8.050001,3.0,S


Note that the columns without successful inversion (such as 'Name' and 'Cabin') are not included in the returned set.

Similarly, we can invert a labels set. This might be useful for the sets returned from an inference operation. 

In [11]:
df_invert, recovered_list, inversion_info_dict = \
am.postmunge(postprocess_dict, labels, 
             inversion='labels', \
             printstatus=True)

_______________
Begin Postmunge processing

Performing inversion recovery of original columns for label set.

Evaluating inversion paths for columns derived from:  Survived
Inversion path selected based on returned column  Survived_0.0
With full recovery.
Recovered source column:  Survived

Inversion succeeded in recovering original form for columns:
['Survived']



In [12]:
df_invert.head()

Unnamed: 0,Survived
0,0.0
1,1.0
2,1.0
3,1.0
4,0.0


Will demonsrtate two more quick things. 

- When we only want to invert a subset of the columns we can pass those targets as a list.
- Note that the targets for inversion can also be passed as numpy arrays instead of dataframes. So column headers are not required, but the correct order of transformed columns must be intact.

In [13]:
np_train = train.to_numpy()

df_invert, recovered_list, inversion_info_dict = \
am.postmunge(postprocess_dict, np_train, 
             inversion= ['Age', 'Fare'], \
             printstatus=True)

_______________
Begin Postmunge processing

Evaluating inversion paths for columns derived from:  Age
Inversion path selected based on returned column  Age_nmbr
With full recovery.
Recovered source column:  Age

Evaluating inversion paths for columns derived from:  Fare
Inversion path selected based on returned column  Fare_nmbr
With full recovery.
Recovered source column:  Fare

Inversion succeeded in recovering original form for columns:
['Age', 'Fare']



In [14]:
df_invert[['Age', 'Fare']].head()

Unnamed: 0,Age,Fare
0,22.0,7.250002
1,38.0,71.283295
2,26.0,7.925001
3,35.0,53.099998
4,35.0,8.050001


In [15]:
#For comparison this is the original data

df_train[['Age', 'Fare']].head()

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
