# Transformation Specification Workflows

Automunge is available now for pip install:

In [1]:
# !pip install Automunge

Or to upgrade (we currently roll out upgrades pretty frequently):

In [2]:
# !pip install Automunge --upgrade

Once installed, run this in a local session to initialize:

In [3]:
from Automunge import Automunger
am = Automunger.AutoMunge()

This notebook will walk through a few variations on workflows for purposes of specifying transformations and transformation sets to target features. We'll walk through the following scenarios:
1. transformations under automation
2. mixed automation and specification
3. overwriting defaults under automation
4. excluding features from automation
5. custom transformation sets

To demonstrate, let's encode the Boston Housing set, a well known tabular benchmark:

In [4]:
import pandas as pd

#housing set
df_train = pd.read_csv('housing_train.csv')
df_test = pd.read_csv('housing_test.csv')

labels_column = 'SalePrice'
trainID_column = 'Id'

Here is what the data looks like in a raw form.

In [5]:
pd.set_option('display.max_columns', 300)
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## 1. Automation

Under automation we can simply pass the train set (and if available also the test set) to the automunge(.) function.

This will result in following:
- z-score normalizaiton of numeric sets
- binarization of bounded categoric sets
- hashing of unbounded categoric sets
- encoding of date-time entries

Each of these transforms will be fit to properties of entries found in the train set for processing on a consistent basis to test data, including test data passed to the automunge(.) function or subsequent test data passed to the postmunge(.) function.

In [6]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               printstatus = False
              )

train.head()

Unnamed: 0,MSSubClass_nmbr,LotFrontage_nmbr,LotArea_nmbr,Street_bnry,Alley_bnry,Utilities_bnry,OverallQual_nmbr,OverallCond_nmbr,YearBuilt_nmbr,YearRemodAdd_nmbr,MasVnrArea_nmbr,BsmtFinSF1_nmbr,BsmtFinSF2_nmbr,BsmtUnfSF_nmbr,TotalBsmtSF_nmbr,CentralAir_bnry,1stFlrSF_nmbr,2ndFlrSF_nmbr,LowQualFinSF_nmbr,GrLivArea_nmbr,BsmtFullBath_nmbr,FullBath_nmbr,BedroomAbvGr_nmbr,KitchenAbvGr_nmbr,TotRmsAbvGrd_nmbr,Fireplaces_nmbr,GarageYrBlt_nmbr,GarageCars_nmbr,GarageArea_nmbr,WoodDeckSF_nmbr,OpenPorchSF_nmbr,EnclosedPorch_nmbr,3SsnPorch_nmbr,ScreenPorch_nmbr,PoolArea_nmbr,MiscVal_nmbr,MoSold_nmbr,YrSold_nmbr,MSZoning_1010_0,MSZoning_1010_1,MSZoning_1010_2,LotShape_1010_0,LotShape_1010_1,LotShape_1010_2,LandContour_1010_0,LandContour_1010_1,LandContour_1010_2,LotConfig_1010_0,LotConfig_1010_1,LotConfig_1010_2,LandSlope_1010_0,LandSlope_1010_1,Neighborhood_1010_0,Neighborhood_1010_1,Neighborhood_1010_2,Neighborhood_1010_3,Neighborhood_1010_4,Condition1_1010_0,Condition1_1010_1,Condition1_1010_2,Condition1_1010_3,Condition2_1010_0,Condition2_1010_1,Condition2_1010_2,Condition2_1010_3,BldgType_1010_0,BldgType_1010_1,BldgType_1010_2,HouseStyle_1010_0,HouseStyle_1010_1,HouseStyle_1010_2,HouseStyle_1010_3,RoofStyle_1010_0,RoofStyle_1010_1,RoofStyle_1010_2,RoofMatl_1010_0,RoofMatl_1010_1,RoofMatl_1010_2,RoofMatl_1010_3,Exterior1st_1010_0,Exterior1st_1010_1,Exterior1st_1010_2,Exterior1st_1010_3,Exterior2nd_1010_0,Exterior2nd_1010_1,Exterior2nd_1010_2,Exterior2nd_1010_3,Exterior2nd_1010_4,MasVnrType_1010_0,MasVnrType_1010_1,MasVnrType_1010_2,ExterQual_1010_0,ExterQual_1010_1,ExterQual_1010_2,ExterCond_1010_0,ExterCond_1010_1,ExterCond_1010_2,Foundation_1010_0,Foundation_1010_1,Foundation_1010_2,BsmtQual_1010_0,BsmtQual_1010_1,BsmtQual_1010_2,BsmtCond_1010_0,BsmtCond_1010_1,BsmtCond_1010_2,BsmtExposure_1010_0,BsmtExposure_1010_1,BsmtExposure_1010_2,BsmtFinType1_1010_0,BsmtFinType1_1010_1,BsmtFinType1_1010_2,BsmtFinType2_1010_0,BsmtFinType2_1010_1,BsmtFinType2_1010_2,Heating_1010_0,Heating_1010_1,Heating_1010_2,HeatingQC_1010_0,HeatingQC_1010_1,HeatingQC_1010_2,Electrical_1010_0,Electrical_1010_1,Electrical_1010_2,BsmtHalfBath_0.0,BsmtHalfBath_1.0,BsmtHalfBath_2.0,HalfBath_0.0,HalfBath_1.0,HalfBath_2.0,KitchenQual_1010_0,KitchenQual_1010_1,KitchenQual_1010_2,Functional_1010_0,Functional_1010_1,Functional_1010_2,FireplaceQu_1010_0,FireplaceQu_1010_1,FireplaceQu_1010_2,GarageType_1010_0,GarageType_1010_1,GarageType_1010_2,GarageFinish_1010_0,GarageFinish_1010_1,GarageQual_1010_0,GarageQual_1010_1,GarageQual_1010_2,GarageCond_1010_0,GarageCond_1010_1,GarageCond_1010_2,PavedDrive_1010_0,PavedDrive_1010_1,PoolQC_1010_0,PoolQC_1010_1,Fence_1010_0,Fence_1010_1,Fence_1010_2,MiscFeature_1010_0,MiscFeature_1010_1,MiscFeature_1010_2,SaleType_1010_0,SaleType_1010_1,SaleType_1010_2,SaleType_1010_3,SaleCondition_1010_0,SaleCondition_1010_1,SaleCondition_1010_2
1183,-0.63586,-0.456318,0.02837,1,1,1,-0.794879,0.381612,-1.697446,-1.68879,-0.574214,0.827366,-0.288554,-0.607062,0.142625,1,-0.084397,-0.794891,-0.120201,-0.733545,1.107431,-1.025689,-1.062101,-0.211381,-0.93381,0.600289,-0.354504,0.311618,1.155352,1.07513,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.119069,-1.367186,0,1,1,0,1,1,0,1,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,1,1,0,1,1,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,1,1,1,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,1,1,1,0,0,1,0,1,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
425,0.07335,-0.456318,-0.715223,1,1,1,0.651256,2.178881,-0.836602,0.345561,-0.574214,-0.972685,-0.288554,0.189558,-0.926429,1,-1.178586,0.767436,-0.120201,-0.240663,-0.819683,-1.025689,0.163723,-0.211381,-0.318574,2.151479,-1.313053,-1.026506,-1.089686,-0.751918,-0.704242,1.702345,-0.116299,-0.270116,-0.068668,-0.087658,0.990552,0.891688,1,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1,1,0,0,1,0,0,1,1,0,1,0,0,0,1,0,1,1,0,1,1,0,1,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,1,0,1,0,0,1,0,0,0,1,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
143,-0.872264,0.360971,-0.018217,1,1,1,0.651256,-0.517023,0.918196,0.684619,0.439249,0.51603,-0.288554,0.551658,0.986016,1,0.875282,-0.794891,-0.120201,-0.027525,1.107431,0.78947,0.163723,-0.211381,-0.318574,-0.950901,0.854103,0.311618,0.486518,0.396968,-0.266546,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.119069,0.891688,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
770,0.664358,0.0,-0.327096,1,1,1,-0.794879,-0.517023,0.355336,-0.138808,-0.574214,0.529185,-0.288554,-0.892215,-0.454586,1,-0.787989,-0.794891,-0.120201,-1.251167,1.107431,-1.025689,-1.062101,-0.211381,-0.93381,-0.950901,0.187285,0.311618,0.481841,0.205487,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.858816,0.891688,0,1,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,1,1,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
640,1.49177,-0.365508,0.216423,1,1,1,1.374324,-0.517023,1.050634,0.926804,2.039744,1.697793,-0.288554,-0.604798,1.04984,1,0.919257,-0.794891,-0.120201,0.004827,-0.819683,-1.025689,-2.287924,-0.211381,-0.318574,0.600289,1.020807,0.311618,0.537967,0.724081,1.408773,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.858816,0.13873,0,1,1,0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0


For consistently processing subsequent test data in postmunge(.), we'll need the "postprocess_dict" dictionary returned from automunge(.), which can be saved externally such as with the pickle library (demonstrated in read me). We can then pass to postmunge(.) test data consistent in form to the original train data to be consistently encoded. 

Note that in general the postmunge(.) function will run much quicker than automunge(.) since it doesn't incur the overheads of column evaluations.

This same postmunge(.) function call to process subsequent data holds for all of the other demonstrations in this notebook, we'll omit further repetitions below for brevity.

In [7]:
test, testID, testlabels, \
labelsencoding_dict, postreports_dict \
= am.postmunge(postprocess_dict, 
               df_test,
               printstatus = False)

test.head()

Unnamed: 0,MSSubClass_nmbr,LotFrontage_nmbr,LotArea_nmbr,Street_bnry,Alley_bnry,Utilities_bnry,OverallQual_nmbr,OverallCond_nmbr,YearBuilt_nmbr,YearRemodAdd_nmbr,MasVnrArea_nmbr,BsmtFinSF1_nmbr,BsmtFinSF2_nmbr,BsmtUnfSF_nmbr,TotalBsmtSF_nmbr,CentralAir_bnry,1stFlrSF_nmbr,2ndFlrSF_nmbr,LowQualFinSF_nmbr,GrLivArea_nmbr,BsmtFullBath_nmbr,FullBath_nmbr,BedroomAbvGr_nmbr,KitchenAbvGr_nmbr,TotRmsAbvGrd_nmbr,Fireplaces_nmbr,GarageYrBlt_nmbr,GarageCars_nmbr,GarageArea_nmbr,WoodDeckSF_nmbr,OpenPorchSF_nmbr,EnclosedPorch_nmbr,3SsnPorch_nmbr,ScreenPorch_nmbr,PoolArea_nmbr,MiscVal_nmbr,MoSold_nmbr,YrSold_nmbr,MSZoning_1010_0,MSZoning_1010_1,MSZoning_1010_2,LotShape_1010_0,LotShape_1010_1,LotShape_1010_2,LandContour_1010_0,LandContour_1010_1,LandContour_1010_2,LotConfig_1010_0,LotConfig_1010_1,LotConfig_1010_2,LandSlope_1010_0,LandSlope_1010_1,Neighborhood_1010_0,Neighborhood_1010_1,Neighborhood_1010_2,Neighborhood_1010_3,Neighborhood_1010_4,Condition1_1010_0,Condition1_1010_1,Condition1_1010_2,Condition1_1010_3,Condition2_1010_0,Condition2_1010_1,Condition2_1010_2,Condition2_1010_3,BldgType_1010_0,BldgType_1010_1,BldgType_1010_2,HouseStyle_1010_0,HouseStyle_1010_1,HouseStyle_1010_2,HouseStyle_1010_3,RoofStyle_1010_0,RoofStyle_1010_1,RoofStyle_1010_2,RoofMatl_1010_0,RoofMatl_1010_1,RoofMatl_1010_2,RoofMatl_1010_3,Exterior1st_1010_0,Exterior1st_1010_1,Exterior1st_1010_2,Exterior1st_1010_3,Exterior2nd_1010_0,Exterior2nd_1010_1,Exterior2nd_1010_2,Exterior2nd_1010_3,Exterior2nd_1010_4,MasVnrType_1010_0,MasVnrType_1010_1,MasVnrType_1010_2,ExterQual_1010_0,ExterQual_1010_1,ExterQual_1010_2,ExterCond_1010_0,ExterCond_1010_1,ExterCond_1010_2,Foundation_1010_0,Foundation_1010_1,Foundation_1010_2,BsmtQual_1010_0,BsmtQual_1010_1,BsmtQual_1010_2,BsmtCond_1010_0,BsmtCond_1010_1,BsmtCond_1010_2,BsmtExposure_1010_0,BsmtExposure_1010_1,BsmtExposure_1010_2,BsmtFinType1_1010_0,BsmtFinType1_1010_1,BsmtFinType1_1010_2,BsmtFinType2_1010_0,BsmtFinType2_1010_1,BsmtFinType2_1010_2,Heating_1010_0,Heating_1010_1,Heating_1010_2,HeatingQC_1010_0,HeatingQC_1010_1,HeatingQC_1010_2,Electrical_1010_0,Electrical_1010_1,Electrical_1010_2,BsmtHalfBath_0.0,BsmtHalfBath_1.0,BsmtHalfBath_2.0,HalfBath_0.0,HalfBath_1.0,HalfBath_2.0,KitchenQual_1010_0,KitchenQual_1010_1,KitchenQual_1010_2,Functional_1010_0,Functional_1010_1,Functional_1010_2,FireplaceQu_1010_0,FireplaceQu_1010_1,FireplaceQu_1010_2,GarageType_1010_0,GarageType_1010_1,GarageType_1010_2,GarageFinish_1010_0,GarageFinish_1010_1,GarageQual_1010_0,GarageQual_1010_1,GarageQual_1010_2,GarageCond_1010_0,GarageCond_1010_1,GarageCond_1010_2,PavedDrive_1010_0,PavedDrive_1010_1,PoolQC_1010_0,PoolQC_1010_1,Fence_1010_0,Fence_1010_1,Fence_1010_2,MiscFeature_1010_0,MiscFeature_1010_1,MiscFeature_1010_2,SaleType_1010_0,SaleType_1010_1,SaleType_1010_2,SaleType_1010_3,SaleCondition_1010_0,SaleCondition_1010_1,SaleCondition_1010_2
0,-0.872264,0.451781,0.110725,1,1,1,-0.794879,0.381612,-0.339961,-1.155984,-0.574214,0.05341,0.604086,-0.672692,-0.39988,1,-0.689693,-0.794891,-0.120201,-1.178852,-0.819683,-1.025689,-1.062101,-0.211381,-0.93381,-0.950901,-0.729588,-1.026506,1.202124,0.365054,-0.704242,-0.359202,-0.116299,1.882064,-0.068668,-0.087658,-0.119069,1.644646,0,1,0,0,1,1,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,1,0,0,0,1,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,0,0,1,0,0
1,-0.872264,0.497186,0.375721,1,1,1,-0.071812,0.381612,-0.439289,-1.301294,0.023895,1.051003,-0.288554,-0.364907,0.619027,1,0.430364,-0.794891,-0.120201,-0.354844,-0.819683,-1.025689,0.163723,-0.211381,-0.318574,-0.950901,-0.854616,-1.026506,-0.75293,2.383584,-0.160895,-0.359202,-0.116299,-0.270116,-0.068668,25.107706,-0.119069,1.644646,0,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,0,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,0,0,0,1,0,0,0,1,0,0
2,0.07335,0.179352,0.331939,1,1,1,-0.794879,-0.517023,0.851977,0.636182,-0.574214,0.761591,-0.288554,-0.973688,-0.295026,1,-0.606917,0.810961,-0.120201,0.216062,-0.819683,0.78947,0.163723,-0.211381,-0.318574,0.600289,0.77075,0.311618,0.042187,0.939497,-0.191081,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-1.22869,1.644646,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,0,1,1,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,0,0,1,0,0
3,0.07335,0.360971,-0.053984,1,1,1,-0.071812,0.381612,0.885087,0.636182,-0.463453,0.347207,-0.288554,-0.550483,-0.299585,1,-0.612091,0.758273,-0.120201,0.168486,-0.819683,0.78947,0.163723,-0.211381,0.296662,0.600289,0.812427,0.311618,-0.013939,2.120297,-0.160895,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.119069,1.644646,0,1,1,0,0,0,0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,0,1,0,1,1,1,0,0,0,1,0,0,1,1,0,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
4,1.49177,-1.228202,-0.552217,1,1,1,1.374324,-0.517023,0.68643,0.345561,-0.574214,-0.396055,-0.288554,1.017862,0.507335,1,0.303614,-0.794891,-0.120201,-0.448092,-0.819683,0.78947,-1.062101,-0.211381,-0.93381,-0.950901,0.56237,0.311618,0.154439,-0.751918,0.533381,-0.359202,-0.116299,2.3125,-0.068668,-0.087658,-1.968437,1.644646,0,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,1,1,0,0,0,1,1,0,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0


## 2. Mixed automation and specification

Note that there are many many transformations available beyond those basic ones performed under automation. The recomended resource for navigating the extensive suite of options is the Library of Transformations section of the [read me](https://github.com/Automunge/AutoMunge/blob/master/README.md), which aggregates transformations into a few high level categories (such as e.g. normalizations, bins and grainings, categoric, etc). In general, each of these transformations will be fit to properties of a feature set found in the training data, enabling processing of subsequent data on a consistent basis.

Transformations to feature set columns may be mixed between transformations performed under automation and those assigned to designated target columns. Quite simply those feature sets that are not explcitily assigned to a transformation category will defer to automation. 

Assignment of feature sets to a transfomration category takes place in the "assigncat" parameter, formatted as a dictionary with transformation categories as keys and associated target feature sets (or lists of target feature sets) as values, where those target feature sets are designated by their column headers. (Note that for cases where the data passed to automunge(.) are numpy arrays instead of pandas dataframes, the column headers can be replaced with the integer index of a column.)

The general convention is that transformation categories are represemnted by four character strings, which partly is the convention because in general these strings will align with the suffix appenders on the returned columns logging the steps of transformations.

Here we'll demonsrtate applying a few different versions of assignments.

In [8]:
#A general weakness of two value sets on which bnry trasnform is applied
#are that if missing values are missing not at random that third entry type
#will be masked in the returned data
#By inspection the 'Alley' feature may be such an example
#So we'll change the transformation category from default to text
#which is one-hot encoding

assigncat = {'text' : 'Alley'}

#Another edge case is where integer sets may be treated with a z-score normalization
#when they would be better suited for ordinal encoding for categoric representation
#here ordl transfomration has encodings sorted by order of entries

integersets = ['OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 
               'BedroomAbvGr', 'KitchenAbvGr', 'Fireplaces', 'GarageCars', 'YrSold']

assigncat.update({'ordl' : integersets})

#We may have some sets that are power law distributed. This might be a good target 
#for bxcx transform which is a box-cox power law transform

assigncat.update({'bxcx' : 'LotArea'})


In [9]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               printstatus = False
              )

train.head()

Unnamed: 0,MSSubClass_nmbr,LotFrontage_nmbr,Street_bnry,Utilities_bnry,OverallQual_nmbr,OverallCond_ordl,YearBuilt_nmbr,YearRemodAdd_nmbr,MasVnrArea_nmbr,BsmtFinSF1_nmbr,BsmtFinSF2_nmbr,BsmtUnfSF_nmbr,TotalBsmtSF_nmbr,CentralAir_bnry,1stFlrSF_nmbr,2ndFlrSF_nmbr,LowQualFinSF_nmbr,GrLivArea_nmbr,BsmtFullBath_ordl,BsmtHalfBath_ordl,FullBath_ordl,HalfBath_ordl,BedroomAbvGr_ordl,KitchenAbvGr_ordl,TotRmsAbvGrd_nmbr,Fireplaces_ordl,GarageYrBlt_nmbr,GarageCars_ordl,GarageArea_nmbr,WoodDeckSF_nmbr,OpenPorchSF_nmbr,EnclosedPorch_nmbr,3SsnPorch_nmbr,ScreenPorch_nmbr,PoolArea_nmbr,MiscVal_nmbr,MoSold_nmbr,YrSold_ordl,MSZoning_1010_0,MSZoning_1010_1,MSZoning_1010_2,LotArea_bxcx_nmbr,Alley_Grvl,Alley_Pave,LotShape_1010_0,LotShape_1010_1,LotShape_1010_2,LandContour_1010_0,LandContour_1010_1,LandContour_1010_2,LotConfig_1010_0,LotConfig_1010_1,LotConfig_1010_2,LandSlope_1010_0,LandSlope_1010_1,Neighborhood_1010_0,Neighborhood_1010_1,Neighborhood_1010_2,Neighborhood_1010_3,Neighborhood_1010_4,Condition1_1010_0,Condition1_1010_1,Condition1_1010_2,Condition1_1010_3,Condition2_1010_0,Condition2_1010_1,Condition2_1010_2,Condition2_1010_3,BldgType_1010_0,BldgType_1010_1,BldgType_1010_2,HouseStyle_1010_0,HouseStyle_1010_1,HouseStyle_1010_2,HouseStyle_1010_3,RoofStyle_1010_0,RoofStyle_1010_1,RoofStyle_1010_2,RoofMatl_1010_0,RoofMatl_1010_1,RoofMatl_1010_2,RoofMatl_1010_3,Exterior1st_1010_0,Exterior1st_1010_1,Exterior1st_1010_2,Exterior1st_1010_3,Exterior2nd_1010_0,Exterior2nd_1010_1,Exterior2nd_1010_2,Exterior2nd_1010_3,Exterior2nd_1010_4,MasVnrType_1010_0,MasVnrType_1010_1,MasVnrType_1010_2,ExterQual_1010_0,ExterQual_1010_1,ExterQual_1010_2,ExterCond_1010_0,ExterCond_1010_1,ExterCond_1010_2,Foundation_1010_0,Foundation_1010_1,Foundation_1010_2,BsmtQual_1010_0,BsmtQual_1010_1,BsmtQual_1010_2,BsmtCond_1010_0,BsmtCond_1010_1,BsmtCond_1010_2,BsmtExposure_1010_0,BsmtExposure_1010_1,BsmtExposure_1010_2,BsmtFinType1_1010_0,BsmtFinType1_1010_1,BsmtFinType1_1010_2,BsmtFinType2_1010_0,BsmtFinType2_1010_1,BsmtFinType2_1010_2,Heating_1010_0,Heating_1010_1,Heating_1010_2,HeatingQC_1010_0,HeatingQC_1010_1,HeatingQC_1010_2,Electrical_1010_0,Electrical_1010_1,Electrical_1010_2,KitchenQual_1010_0,KitchenQual_1010_1,KitchenQual_1010_2,Functional_1010_0,Functional_1010_1,Functional_1010_2,FireplaceQu_1010_0,FireplaceQu_1010_1,FireplaceQu_1010_2,GarageType_1010_0,GarageType_1010_1,GarageType_1010_2,GarageFinish_1010_0,GarageFinish_1010_1,GarageQual_1010_0,GarageQual_1010_1,GarageQual_1010_2,GarageCond_1010_0,GarageCond_1010_1,GarageCond_1010_2,PavedDrive_1010_0,PavedDrive_1010_1,PoolQC_1010_0,PoolQC_1010_1,Fence_1010_0,Fence_1010_1,Fence_1010_2,MiscFeature_1010_0,MiscFeature_1010_1,MiscFeature_1010_2,SaleType_1010_0,SaleType_1010_1,SaleType_1010_2,SaleType_1010_3,SaleCondition_1010_0,SaleCondition_1010_1,SaleCondition_1010_2
281,-0.872264,-0.456318,1,1,-0.071812,4,1.149962,1.023678,-0.197627,1.011537,-0.288554,-0.4758,0.466305,1,0.257052,-0.794891,-0.120201,-0.482347,0,0,2,0,2,1,-0.93381,0,1.145835,2,0.463132,-0.751918,1.106914,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.488943,0,0,0,1,-0.449242,0,1,0,1,1,0,1,1,1,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,0,1,0,1,1,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,0,1,1,0,1,0,1
1182,0.07335,4.084178,1,1,2.820459,4,0.818868,0.539309,-0.574214,3.622818,-0.288554,-0.604798,3.051184,1,3.229211,3.935614,-0.120201,5.633962,1,0,3,1,4,1,2.142369,2,0.729074,3,1.590328,0.612384,0.473009,-0.359202,-0.116299,-0.270116,13.7451,-0.087658,0.250805,1,0,1,1,1.056067,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,0,0,1,1,1,0,1,0,0,1,1,1,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0
178,-0.872264,-0.320103,1,1,2.097391,4,1.216181,1.168989,3.568244,3.201856,-0.288554,-0.577641,2.640886,1,2.771359,-0.794891,-0.120201,1.367389,1,0,2,0,1,1,1.527133,1,1.270863,3,3.241367,-0.751918,0.201336,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,0.250805,3,0,1,1,1.270877,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,0,1,1,0,1,0,1
760,-0.872264,-0.002268,1,1,-0.071812,5,-0.40618,-1.252858,-0.574214,0.369132,-0.288554,-0.713428,-0.44091,1,-0.772468,-0.794891,-0.120201,-1.239749,0,0,1,0,2,1,-0.93381,0,1.229187,1,-0.809056,-0.751918,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,0.819375,1.360426,3,0,1,1,0.002038,0,0,0,1,1,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,1,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,1,1,1,0,1,0,1,1,0,1,1,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,1,0,0,0,1,0,0
1211,-0.163054,3.720939,1,1,1.374324,6,0.553993,0.975241,-0.574214,-0.036483,-0.288554,-0.985003,-1.136137,1,-0.213733,0.744528,-0.120201,0.450133,0,0,2,0,4,1,0.911897,0,0.395666,2,0.088959,1.841054,-0.523126,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.119069,4,0,1,1,0.560991,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,0,1,1,1,1,0,1,0,1,0,1,0,0,1,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0


Where if we want to inspect the returned form of one of these assignments, can make use of the column map saved in postprocess_dict which converts recieved feature headers into returned feature headers with suffix appenders included.

In [10]:
postprocess_dict['column_map']['LotArea']

['LotArea_bxcx_nmbr']

In [11]:
train[postprocess_dict['column_map']['LotArea']].head()

Unnamed: 0,LotArea_bxcx_nmbr
281,-0.449242
1182,1.056067
178,1.270877
760,0.002038
1211,0.560991


And of course the column assignments will be included as the basis for subsequent data processed in postmunge(.).

## 3. Overwriting defaults under automation

Once again, the defaults under automation are as follows, here listed with the associated root transformation category:

- 'nmbr': z-score normalizaiton of numeric sets
- 'bnry': binarization to two unique entry categoric sets
- '1010': binarization of bounded categoric sets
- 'hash': hashing of unbounded categoric sets
- 'dat6': aggregated encodings of date-time entries

If we want to apply different trasnfomations as our defaults under automation, we can achieve this by overwriting the "family trees" associated with these root categories and passing those trees to automunge(.) in the "transformdict" parameter. 

Here we'll demonstrate applying defaults of min-max scaling to numerical data by 'mnmx' and one-hot encoding to categoric data via 'text'.

The family tree primitives were introduced in another tutorial. We are showing the full set here for presentatyion purposes, please note that primitives without entries can be omitted is prefered. Here since we're just applying a single transform without offspring and replacing the input column we can pass our entries to the 'auntsuncles' primitive.

In [12]:
transformdict = {}

#this overwrites the nmbr root category family tree 
#(the defult numeric normalization under automation)
#to apply min-max scaling instead of z-score
transformdict.update({'nmbr' : {'parents'       : [],
                                'siblings'      : [],
                                'auntsuncles'   : ['mnmx'], 
                                'cousins'       : [],
                                'children'      : [],
                                'niecesnephews' : [],
                                'coworkers'     : [],
                                'friends'       : []}})

#or as an equivalent is this form that omits primitives without entries
#can also omit list brackets for single entries
transformdict.update({'nmbr' : {'auntsuncles'   : 'mnmx'}})

#this overwrite the 1010 root category family tree
#(the default categoric binarization under automation)
#to apply one-hot encoding instead of binarization
transformdict.update({'1010' : {'parents'       : [],
                                'siblings'      : [],
                                'auntsuncles'   : ['text'], 
                                'cousins'       : [],
                                'children'      : [],
                                'niecesnephews' : [],
                                'coworkers'     : [],
                                'friends'       : []}})

#or as an equivalent is this form that omits primitives without entries
transformdict.update({'1010' : {'auntsuncles'   : 'text'}})

Now if we were defining new root categories from scratch we would need to populate corresponding entries in the processdict parameter, here since we are only overwriting existing family trees using internally defined transformation categories we only need to pass a transformdict.

The updated family trees can then be passed to automunge(.) for updates to the default transformations under automation.

In [13]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               transformdict = transformdict,
               printstatus = False
              )

train.head()

Unnamed: 0,MSSubClass_mnmx,LotFrontage_mnmx,Street_bnry,Utilities_bnry,OverallQual_mnmx,OverallCond_ordl,YearBuilt_mnmx,YearRemodAdd_mnmx,MasVnrArea_mnmx,BsmtFinSF1_mnmx,BsmtFinSF2_mnmx,BsmtUnfSF_mnmx,TotalBsmtSF_mnmx,CentralAir_bnry,1stFlrSF_mnmx,2ndFlrSF_mnmx,LowQualFinSF_mnmx,GrLivArea_mnmx,BsmtFullBath_ordl,BsmtHalfBath_ordl,FullBath_ordl,HalfBath_ordl,BedroomAbvGr_ordl,KitchenAbvGr_ordl,TotRmsAbvGrd_mnmx,Fireplaces_ordl,GarageYrBlt_mnmx,GarageCars_ordl,GarageArea_mnmx,WoodDeckSF_mnmx,OpenPorchSF_mnmx,EnclosedPorch_mnmx,3SsnPorch_mnmx,ScreenPorch_mnmx,PoolArea_mnmx,MiscVal_mnmx,MoSold_mnmx,YrSold_ordl,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,LotArea_bxcx_nmbr,Alley_Grvl,Alley_Pave,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_ClyTile,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_Po,ExterCond_TA,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,BsmtQual_Ex,BsmtQual_Fa,BsmtQual_Gd,BsmtQual_TA,BsmtCond_Fa,BsmtCond_Gd,BsmtCond_Po,BsmtCond_TA,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtFinType1_ALQ,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_ALQ,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_Rec,BsmtFinType2_Unf,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HeatingQC_Ex,HeatingQC_Fa,HeatingQC_Gd,HeatingQC_Po,HeatingQC_TA,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,KitchenQual_Ex,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_Po,FireplaceQu_TA,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Ex,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA,GarageCond_Ex,GarageCond_Fa,GarageCond_Gd,GarageCond_Po,GarageCond_TA,PavedDrive_N,PavedDrive_P,PavedDrive_Y,PoolQC_Ex,PoolQC_Fa,PoolQC_Gd,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
286,0.176471,0.191781,1,1,0.555556,6,0.652174,0.516667,0.0,0.106308,0.0,0.133562,0.149264,1,0.172327,0.314286,0.0,0.263753,0,0,1,1,3,1,0.416667,1,0.563636,2,0.310296,0.0,0.0,0.0,0.0,0.266667,0.0,0.0,0.454545,0,0,0,0,1,0,0.142752,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
50,0.235294,0.167979,1,1,0.555556,5,0.905797,0.783333,0.0,0.032247,0.0,0.261986,0.129951,1,0.105553,0.327361,0.0,0.214017,0,1,2,0,3,1,0.333333,0,0.881818,2,0.273625,0.0,0.137112,0.0,0.0,0.0,0.0,0.0,0.545455,1,0,0,0,1,0,0.822304,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
439,0.176471,0.157534,1,1,0.555556,7,0.347826,0.833333,0.0,0.0,0.0,0.292808,0.111948,1,0.080312,0.247942,0.0,0.162396,0,0,1,0,3,1,0.416667,0,0.954545,2,0.372355,0.0,0.084095,0.0,0.0,0.0,0.0,0.051613,0.636364,3,0,0,0,1,0,0.596061,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
886,0.411765,0.167808,1,1,0.444444,4,0.630435,0.916667,0.07625,0.093551,0.0,0.470034,0.266121,1,0.3162,0.0,0.0,0.259608,0,0,2,0,4,2,0.5,0,0.954545,2,0.414669,0.317386,0.09872,0.0,0.0,0.0,0.0,0.0,0.454545,0,0,0,0,1,0,-0.154177,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0
297,0.235294,0.15411,1,1,0.666667,4,0.905797,0.8,1.0,0.114989,0.0,0.139555,0.159574,1,0.147086,0.472155,0.0,0.304446,0,0,2,1,3,1,0.416667,1,0.881818,2,0.406206,0.0,0.018282,0.0,0.0,0.4125,0.0,0.0,0.454545,1,0,1,0,0,0,-0.396873,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0


Note that one-hot encoding results in much higher dimensionality of the returned set. Also note the suffix appender convention for one-hot encoding via 'text' transform is that the suffixes are the unique entry from the set associated with the activation. (For privacy preserving column headers with one-hot encoding we could have instead applied the 'onht' transform.)

## 4. Excluding features from automated encodings

Since the convention is that columns not explicity assigned to a transformation category in assigncat are subjected to the default encodings under automation, a natural question is how can we only selectively apply transformations and leave other columns unaltered. There are a few different ways to accomplish this, we'll walk through each.

First, the easiest way if we want to turn off automated encodings for columns not asdsigned in assigncat, we can simply pass the parameter powertransform = 'excl'.

In [14]:
#'Alley' will be the only column to receive an encoding, the others wll be pass-through
assigncat = {'text' : 'Alley'}

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               powertransform = 'excl',
               printstatus = False
              )

train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,Alley_Grvl,Alley_Pave
1124,80,RL,,9125,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,SLvl,7,5,1992,1992,Gable,CompShg,HdBoard,HdBoard,BrkFace,170.0,TA,TA,PConc,Gd,TA,No,Unf,0,Unf,0,384,384,GasA,Gd,Y,SBrkr,812,670,0,1482,0,0,2,1,3,1,Gd,7,Typ,1,TA,Attchd,1992.0,Fin,2,392,TA,TA,Y,100,25,0,0,0,0,,,,0,7,2007,WD,Normal,0,0
237,60,RL,,9453,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,RRNe,Norm,1Fam,2Story,7,7,1993,2003,Gable,CompShg,HdBoard,HdBoard,,0.0,Gd,TA,PConc,Gd,TA,No,BLQ,402,Unf,0,594,996,GasA,Ex,Y,SBrkr,1014,730,0,1744,0,0,2,1,3,1,Gd,7,Typ,0,,Attchd,1993.0,RFn,2,457,TA,TA,Y,370,70,0,238,0,0,,,,0,2,2010,WD,Normal,0,0
208,60,RL,,14364,Pave,IR1,Low,AllPub,Inside,Mod,SawyerW,Norm,Norm,1Fam,2Story,7,5,1988,1989,Gable,CompShg,Plywood,Plywood,BrkFace,128.0,Gd,TA,CBlock,Gd,TA,Gd,GLQ,1065,Unf,0,92,1157,GasA,Ex,Y,SBrkr,1180,882,0,2062,1,0,2,1,3,1,TA,7,Typ,1,Gd,Attchd,1988.0,Fin,2,454,TA,TA,Y,60,55,0,0,154,0,,,,0,4,2007,WD,Normal,0,0
162,20,RL,95.0,12182,Pave,Reg,Lvl,AllPub,Corner,Gtl,NridgHt,Norm,Norm,1Fam,1Story,7,5,2005,2005,Gable,CompShg,VinylSd,VinylSd,BrkFace,226.0,Gd,TA,PConc,Gd,TA,Mn,BLQ,1201,Unf,0,340,1541,GasA,Ex,Y,SBrkr,1541,0,0,1541,0,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2005.0,RFn,2,532,TA,TA,Y,0,70,0,0,0,0,,,,0,5,2010,New,Partial,0,0
1160,160,RL,24.0,2280,Pave,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,2Story,6,5,1978,1978,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,311,Unf,0,544,855,GasA,Fa,Y,SBrkr,855,601,0,1456,0,0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1978.0,Unf,2,440,TA,TA,Y,26,0,0,0,0,0,,,,0,7,2010,WD,Normal,0,0


Some caution is warranted for application of pass-through. There are several methods in the library, such as ML infill and feature selection, that rely on all data being numerically encoded. 

For another (more roundabout) approach, if we still want to apply ML infill but only on a basis of a selection of the features, leaving the others as direct pass-through without transformations, we can carve out passthrough columns to be returned seperately in the 'ID' sets (consistently partitioned and shuffled) by passing a list of pass-through columns to the trainID_column parameter.

Here we'll demonstrate carving out a selection of columns for pass-through via the ID sets.

In [15]:
passthrough_columns = ['Id', 'MSSubClass', 'MSZoning', 'LotFrontage']

trainID_column = passthrough_columns

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               printstatus = False
              )

train.head()

Unnamed: 0,LotArea_nmbr,Street_bnry,Alley_bnry,Utilities_bnry,OverallQual_nmbr,OverallCond_nmbr,YearBuilt_nmbr,YearRemodAdd_nmbr,MasVnrArea_nmbr,BsmtFinSF1_nmbr,BsmtFinSF2_nmbr,BsmtUnfSF_nmbr,TotalBsmtSF_nmbr,CentralAir_bnry,1stFlrSF_nmbr,2ndFlrSF_nmbr,LowQualFinSF_nmbr,GrLivArea_nmbr,BsmtFullBath_nmbr,FullBath_nmbr,BedroomAbvGr_nmbr,KitchenAbvGr_nmbr,TotRmsAbvGrd_nmbr,Fireplaces_nmbr,GarageYrBlt_nmbr,GarageCars_nmbr,GarageArea_nmbr,WoodDeckSF_nmbr,OpenPorchSF_nmbr,EnclosedPorch_nmbr,3SsnPorch_nmbr,ScreenPorch_nmbr,PoolArea_nmbr,MiscVal_nmbr,MoSold_nmbr,YrSold_nmbr,LotShape_1010_0,LotShape_1010_1,LotShape_1010_2,LandContour_1010_0,LandContour_1010_1,LandContour_1010_2,LotConfig_1010_0,LotConfig_1010_1,LotConfig_1010_2,LandSlope_1010_0,LandSlope_1010_1,Neighborhood_1010_0,Neighborhood_1010_1,Neighborhood_1010_2,Neighborhood_1010_3,Neighborhood_1010_4,Condition1_1010_0,Condition1_1010_1,Condition1_1010_2,Condition1_1010_3,Condition2_1010_0,Condition2_1010_1,Condition2_1010_2,Condition2_1010_3,BldgType_1010_0,BldgType_1010_1,BldgType_1010_2,HouseStyle_1010_0,HouseStyle_1010_1,HouseStyle_1010_2,HouseStyle_1010_3,RoofStyle_1010_0,RoofStyle_1010_1,RoofStyle_1010_2,RoofMatl_1010_0,RoofMatl_1010_1,RoofMatl_1010_2,RoofMatl_1010_3,Exterior1st_1010_0,Exterior1st_1010_1,Exterior1st_1010_2,Exterior1st_1010_3,Exterior2nd_1010_0,Exterior2nd_1010_1,Exterior2nd_1010_2,Exterior2nd_1010_3,Exterior2nd_1010_4,MasVnrType_1010_0,MasVnrType_1010_1,MasVnrType_1010_2,ExterQual_1010_0,ExterQual_1010_1,ExterQual_1010_2,ExterCond_1010_0,ExterCond_1010_1,ExterCond_1010_2,Foundation_1010_0,Foundation_1010_1,Foundation_1010_2,BsmtQual_1010_0,BsmtQual_1010_1,BsmtQual_1010_2,BsmtCond_1010_0,BsmtCond_1010_1,BsmtCond_1010_2,BsmtExposure_1010_0,BsmtExposure_1010_1,BsmtExposure_1010_2,BsmtFinType1_1010_0,BsmtFinType1_1010_1,BsmtFinType1_1010_2,BsmtFinType2_1010_0,BsmtFinType2_1010_1,BsmtFinType2_1010_2,Heating_1010_0,Heating_1010_1,Heating_1010_2,HeatingQC_1010_0,HeatingQC_1010_1,HeatingQC_1010_2,Electrical_1010_0,Electrical_1010_1,Electrical_1010_2,BsmtHalfBath_0.0,BsmtHalfBath_1.0,BsmtHalfBath_2.0,HalfBath_0.0,HalfBath_1.0,HalfBath_2.0,KitchenQual_1010_0,KitchenQual_1010_1,KitchenQual_1010_2,Functional_1010_0,Functional_1010_1,Functional_1010_2,FireplaceQu_1010_0,FireplaceQu_1010_1,FireplaceQu_1010_2,GarageType_1010_0,GarageType_1010_1,GarageType_1010_2,GarageFinish_1010_0,GarageFinish_1010_1,GarageQual_1010_0,GarageQual_1010_1,GarageQual_1010_2,GarageCond_1010_0,GarageCond_1010_1,GarageCond_1010_2,PavedDrive_1010_0,PavedDrive_1010_1,PoolQC_1010_0,PoolQC_1010_1,Fence_1010_0,Fence_1010_1,Fence_1010_2,MiscFeature_1010_0,MiscFeature_1010_1,MiscFeature_1010_2,SaleType_1010_0,SaleType_1010_1,SaleType_1010_2,SaleType_1010_3,SaleCondition_1010_0,SaleCondition_1010_1,SaleCondition_1010_2
1220,-0.272193,1,1,1,-0.794879,-0.517023,-0.240633,-1.010673,-0.574214,-0.288622,3.430779,-1.283736,-0.331497,1,-0.648305,-0.794891,-0.120201,-1.148404,-0.819683,-1.025689,-1.062101,-0.211381,-0.93381,-0.950901,-0.60456,-1.026506,-0.865182,-0.751918,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,1.730299,-1.367186,0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,1,0,0,0,1,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,0,0,0
63,-0.021724,1,1,1,0.651256,0.381612,-1.664337,-1.68879,-0.574214,-0.972685,-0.288554,0.019824,-1.097387,1,-0.674172,1.056077,-0.120201,0.370207,-0.819683,0.78947,0.163723,-0.211381,1.527133,-0.950901,0.479018,0.311618,0.032833,-0.656178,-0.538219,0.687933,-0.116299,-0.270116,-0.068668,-0.087658,-0.858816,1.644646,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,1,0,1,1,0,1,1,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,1,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,1,1,0,1,1,0,1,0,0,1,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,0,1,0,0
166,0.019153,1,1,1,-0.794879,-0.517023,-0.538617,0.393998,-0.574214,-0.141723,4.472191,-0.220067,1.275504,1,1.822027,-0.794891,-0.120201,0.668981,1.107431,-1.025689,-1.062101,-0.211381,0.296662,3.702669,-0.979645,-1.026506,-0.795025,3.045789,-0.704242,-0.359202,-0.116299,2.276631,-0.068668,-0.087658,1.730299,0.891688,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,1,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,0,0,1,1,0,0,0,0,0,0,1,0,0
1113,-0.159682,1,1,1,-0.794879,1.280247,-0.604836,1.023678,-0.574214,0.4371,-0.288554,-0.457695,-0.112671,1,-0.399978,-0.794891,-0.120201,-0.965714,1.107431,-1.025689,-1.062101,-0.211381,-0.318574,-0.950901,-1.062997,-1.026506,-1.089686,-0.751918,-0.432569,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.488943,-0.614228,0,1,1,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,1,0,1,0,1,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
1259,-0.076827,1,1,1,-0.071812,2.178881,-0.075086,-0.768488,-0.574214,0.347207,2.426559,-1.252052,-0.007817,1,-0.280989,-0.794891,-0.120201,-0.878175,1.107431,-1.025689,0.163723,-0.211381,-0.318574,-0.950901,-0.39618,0.311618,-0.06071,0.684189,-0.704242,-0.359202,-0.116299,1.164671,-0.068668,-0.087658,0.250805,0.13873,0,1,1,0,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,1,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,1,1,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0


The passthrough columns can then be found in the returned ID sets.

In the returned ID sets with the carveouts we'll also see that a new column was included labeled "Automunge_index_###' containing index values. The ### is a unique identyifier associated with each automunge(.) call. Note that if our passed df_train set included a non-range index set that would also have been carved out and included in the returned ID sets.

In [16]:
trainID.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,Automunge_index_760368845356
1220,1221,20,RL,66.0,1220
63,64,70,RM,50.0,63
166,167,20,RL,,166
1113,1114,20,RL,66.0,1113
1259,1260,20,RL,65.0,1259


Another approach for excluding columns from processing can be applied when columns are assigned to transformation categories in assigncat. A few transformation categories of note that can be applied in assigncat follow:

- 'excl': columns assigned to excl are treated as direct pass-through with no processing or infill
- 'exc2': columns assigned to exc2 are also treated as pass-through but subjected to mode-infill to ensure returned data is numeric
- 'eval': columns assigned to eval are treated to evaluation for same processing that takes place under automation. This might be useful when excluding unassigned columns with the powertransform = 'excl' option described above and there are still columns you'd like to encode under automation

Let's demonstrate a few of these assignment operations in an assigncat.

In [17]:
passthrough_columns = ['MSSubClass', 'MSZoning']

passthrough_columns_with_infill = ['LotFrontage']

automated_encodings_columns = ['LotArea']

assigncat = \
{'excl' : passthrough_columns,
 'exc2' : passthrough_columns_with_infill,
 'eval' : automated_encodings_columns}

#revert trainID_column to original value
trainID_column = 'Id'

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               printstatus = False
              )

train.head()


Unnamed: 0,MSSubClass,MSZoning,LotFrontage_exc2,LotArea_nmbr,Street_bnry,Alley_bnry,Utilities_bnry,OverallQual_nmbr,OverallCond_nmbr,YearBuilt_nmbr,YearRemodAdd_nmbr,MasVnrArea_nmbr,BsmtFinSF1_nmbr,BsmtFinSF2_nmbr,BsmtUnfSF_nmbr,TotalBsmtSF_nmbr,CentralAir_bnry,1stFlrSF_nmbr,2ndFlrSF_nmbr,LowQualFinSF_nmbr,GrLivArea_nmbr,BsmtFullBath_nmbr,FullBath_nmbr,BedroomAbvGr_nmbr,KitchenAbvGr_nmbr,TotRmsAbvGrd_nmbr,Fireplaces_nmbr,GarageYrBlt_nmbr,GarageCars_nmbr,GarageArea_nmbr,WoodDeckSF_nmbr,OpenPorchSF_nmbr,EnclosedPorch_nmbr,3SsnPorch_nmbr,ScreenPorch_nmbr,PoolArea_nmbr,MiscVal_nmbr,MoSold_nmbr,YrSold_nmbr,LotShape_1010_0,LotShape_1010_1,LotShape_1010_2,LandContour_1010_0,LandContour_1010_1,LandContour_1010_2,LotConfig_1010_0,LotConfig_1010_1,LotConfig_1010_2,LandSlope_1010_0,LandSlope_1010_1,Neighborhood_1010_0,Neighborhood_1010_1,Neighborhood_1010_2,Neighborhood_1010_3,Neighborhood_1010_4,Condition1_1010_0,Condition1_1010_1,Condition1_1010_2,Condition1_1010_3,Condition2_1010_0,Condition2_1010_1,Condition2_1010_2,Condition2_1010_3,BldgType_1010_0,BldgType_1010_1,BldgType_1010_2,HouseStyle_1010_0,HouseStyle_1010_1,HouseStyle_1010_2,HouseStyle_1010_3,RoofStyle_1010_0,RoofStyle_1010_1,RoofStyle_1010_2,RoofMatl_1010_0,RoofMatl_1010_1,RoofMatl_1010_2,RoofMatl_1010_3,Exterior1st_1010_0,Exterior1st_1010_1,Exterior1st_1010_2,Exterior1st_1010_3,Exterior2nd_1010_0,Exterior2nd_1010_1,Exterior2nd_1010_2,Exterior2nd_1010_3,Exterior2nd_1010_4,MasVnrType_1010_0,MasVnrType_1010_1,MasVnrType_1010_2,ExterQual_1010_0,ExterQual_1010_1,ExterQual_1010_2,ExterCond_1010_0,ExterCond_1010_1,ExterCond_1010_2,Foundation_1010_0,Foundation_1010_1,Foundation_1010_2,BsmtQual_1010_0,BsmtQual_1010_1,BsmtQual_1010_2,BsmtCond_1010_0,BsmtCond_1010_1,BsmtCond_1010_2,BsmtExposure_1010_0,BsmtExposure_1010_1,BsmtExposure_1010_2,BsmtFinType1_1010_0,BsmtFinType1_1010_1,BsmtFinType1_1010_2,BsmtFinType2_1010_0,BsmtFinType2_1010_1,BsmtFinType2_1010_2,Heating_1010_0,Heating_1010_1,Heating_1010_2,HeatingQC_1010_0,HeatingQC_1010_1,HeatingQC_1010_2,Electrical_1010_0,Electrical_1010_1,Electrical_1010_2,BsmtHalfBath_0.0,BsmtHalfBath_1.0,BsmtHalfBath_2.0,HalfBath_0.0,HalfBath_1.0,HalfBath_2.0,KitchenQual_1010_0,KitchenQual_1010_1,KitchenQual_1010_2,Functional_1010_0,Functional_1010_1,Functional_1010_2,FireplaceQu_1010_0,FireplaceQu_1010_1,FireplaceQu_1010_2,GarageType_1010_0,GarageType_1010_1,GarageType_1010_2,GarageFinish_1010_0,GarageFinish_1010_1,GarageQual_1010_0,GarageQual_1010_1,GarageQual_1010_2,GarageCond_1010_0,GarageCond_1010_1,GarageCond_1010_2,PavedDrive_1010_0,PavedDrive_1010_1,PoolQC_1010_0,PoolQC_1010_1,Fence_1010_0,Fence_1010_1,Fence_1010_2,MiscFeature_1010_0,MiscFeature_1010_1,MiscFeature_1010_2,SaleType_1010_0,SaleType_1010_1,SaleType_1010_2,SaleType_1010_3,SaleCondition_1010_0,SaleCondition_1010_1,SaleCondition_1010_2
1273,80,RL,124.0,0.099704,1,1,1,-0.071812,1.280247,-0.40618,1.023678,-0.109018,0.60373,-0.288554,-0.604798,-0.087597,1,0.502792,-0.794891,-0.120201,-0.30156,1.107431,-1.025689,-1.062101,-0.211381,-0.93381,0.600289,-0.81294,-1.026506,-0.75293,-0.751918,-0.704242,-0.359202,-0.116299,2.653262,-0.068668,-0.087658,-0.488943,0.13873,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0,1,1,0,0,1,0,1,0,1,0,0,0,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,0,0,0,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,0,1,0,0
938,60,RL,73.0,-0.176013,1,1,1,0.651256,-0.517023,1.149962,1.023678,-0.574214,0.04464,-0.288554,0.814181,0.760352,1,0.590741,0.513157,-0.120201,0.849768,-0.819683,0.78947,0.163723,-0.211381,0.296662,-0.950901,1.145835,1.649742,1.847572,-0.751918,0.654125,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,0.620678,-1.367186,0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,1,1,1,0,0,0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1,0,0,1,0,1,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,0,1,1,0,1,0,1
1175,50,RL,85.0,0.016147,1,1,1,1.374324,-0.517023,0.68643,0.733056,1.292108,0.562073,-0.288554,0.940916,1.425947,1,2.499752,0.907175,-0.120201,2.581517,-0.819683,0.78947,1.389547,-0.211381,1.527133,0.600289,0.56237,0.311618,0.31814,-0.751918,-0.206174,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.858816,-0.614228,0,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
490,160,RM,60.0,-0.786657,1,1,1,-0.794879,0.381612,0.15668,-0.42943,-0.574214,-0.972685,-0.288554,-0.686271,-1.80857,1,-1.413978,0.781181,-0.120201,-0.40242,-0.819683,-1.025689,0.163723,-0.211381,-1.549046,0.600289,-0.104447,-1.026506,-0.640678,0.373033,-0.342011,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.119069,0.13873,0,1,1,0,1,1,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,1,1,0,0,0,1,0,0,1,0,0,1,1,0,1,0,1,0,1,1,0,1,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0
210,30,RL,67.0,-0.492205,1,1,1,-0.794879,0.381612,-1.531899,-1.68879,-0.574214,0.05341,-0.288554,-0.387538,-0.44091,0,-0.772468,-0.794891,-0.120201,-1.239749,1.107431,-1.025689,-1.062101,-0.211381,-0.93381,-0.950901,0.0,-2.36463,-2.212205,-0.751918,-0.704242,1.211501,-0.116299,-0.270116,-0.068668,-0.087658,-0.858816,0.13873,0,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,1,0,1,1,0,1,1,0,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,1,0,1,1,1,0,0,1,0,1,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1,1,1,1,0,1,0,1,1,1,0,1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,1,0,0,1,0,0,0,1,0,0


## 5. Custom transformation sets

We demonsrtated above creating transformdict family trees to overwrite existing transformation categories found in the library. It's also possible to create entirely new root transformation categories and populate their family trees with transformation categories entries to the primitives, however for new root categories there is an additional step as we will need to populate a corresponding processdict entry which specifies transformation functions associated with a transformation category as well as defines some column properties. Not to worry, for most cases a processdict entry can just point to some other transformation category with comparable transformation functions so specification is pretty simple.

Let's demonstrate defining a new root category 'newt' and populating a family tree and correpsonding processdict pointer to duplicate another transfomration category's transfomration functions. For this purpose we'll want to populate an upstream UPCS (uppercase conversion) with a downstream '1010' binarization and also a downstream 'nmrc' for extraction of numeric entries.

Here the 'newt' root category will have a family tree defined and 'newt' will also be included as an unprstream primitive entry for the applicaiton of the upstream UPCS transfomration. Since parents is a primitive with offspring, aftger the transfomation funcitons associated with the 'newt' transfomration category is applied, the 'newt' family tree will be inspected for doqwnstream primitive entries where we'll find '1010' and 'nmrc'. And since we don't want the intermediate uppercase conversion configuration in the returned set, the downstream entries will be passed to coworkers which is a replacement primitive without further offspirng. 

(Note that in an alternate configuration if we wanted the nmrc numeric extraction to have a downstream normalization applied, we could pass nmrc to children which will result in downstream primitives of the nmrc family tree to be inspected where will be found a nmbr z-score normalization.)

In [18]:
transformdict = \
{'newt' : {'parents'       : ['newt'],
           'siblings'      : [],
           'auntsuncles'   : [], 
           'cousins'       : [],
           'children'      : [],
           'niecesnephews' : [],
           'coworkers'     : ['1010', 'nmrc'],
           'friends'       : []}}

#this is equivalent to defining as
transformdict = \
{'newt' : {'parents'       : 'newt',
           'coworkers'     : ['1010', 'nmrc']}}

And for the corresponding processdict entry, since the upstream 'newt' category is intended for an UPCS transformation function, we can just apply a pointer to match the processdict entry associated with UPCS transform.

In [19]:
processdict = \
{'newt' : {'functionpointer' : 'UPCS'}}

Great now we can assign a column to this newly defined root category in assigncat.

In [20]:
assigncat = {'newt' : 'Neighborhood'}

In [21]:
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict \
= am.automunge(df_train,
               df_test = df_test,
               labels_column = labels_column,
               trainID_column = trainID_column,
               assigncat = assigncat,
               transformdict = transformdict,
               processdict = processdict,
               printstatus = False
              )

train[postprocess_dict['column_map']['Neighborhood']].head()

Unnamed: 0,Neighborhood_UPCS_1010_0,Neighborhood_UPCS_1010_1,Neighborhood_UPCS_1010_2,Neighborhood_UPCS_1010_3,Neighborhood_UPCS_1010_4,Neighborhood_UPCS_nmrc
1137,1,0,0,1,0,0.0
379,0,1,0,0,0,0.0
1285,0,0,0,1,1,0.0
1274,0,0,1,1,0,0.0
557,0,1,0,0,1,0.0


In [22]:
#voila