## String Parsing Validation Experiments

Here we will validate string parsing on two example features from the IEEE-CIS Fraud Detection Kaggle competition, with data set available [here](https://www.kaggle.com/c/ieee-fraud-detection).

We will perform our evaluation on a subset of features selected based on the top ten features from a feature importance evaluation and then two additional features selected for their viability as a string parsing target based on inspection.

The results of the valiation are reported in the final cell of the notebook.

In [1]:
import pandas as pd
import numpy as np

from Automunge import AutoMunge
am = AutoMunge()


In [2]:
pd.set_option("display.max_columns", 200)

In [3]:
train_identity_path = 'train_identity.csv'
train_transaction_path = 'train_transaction.csv'
#test_identity_path = 'test_identity.csv'
#test_transaction_path = 'test_transaction.csv'

In [4]:
ID_column = 'TransactionID'
label_column = 'isFraud'

In [5]:
train_identity = pd.read_csv(train_identity_path, error_bad_lines=False, index_col="TransactionID")
train_identity.head()

Unnamed: 0_level_0,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
2987004,0.0,70787.0,,,,,,,,,100.0,NotFound,,-480.0,New,NotFound,166.0,,542.0,144.0,,,,,,,,New,NotFound,Android 7.0,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
2987008,-5.0,98945.0,,,0.0,-5.0,,,,,100.0,NotFound,49.0,-300.0,New,NotFound,166.0,,621.0,500.0,,,,,,,,New,NotFound,iOS 11.1.2,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,NotFound,52.0,,Found,Found,121.0,,410.0,142.0,,,,,,,,Found,Found,,chrome 62.0,,,,F,F,T,T,desktop,Windows
2987011,-5.0,221832.0,,,0.0,-6.0,,,,,100.0,NotFound,52.0,,New,NotFound,225.0,,176.0,507.0,,,,,,,,New,NotFound,,chrome 62.0,,,,F,F,T,T,desktop,
2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,0.0,100.0,NotFound,,-300.0,Found,Found,166.0,15.0,529.0,575.0,,,,,,,,Found,Found,Mac OS X 10_11_6,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In [6]:
#upon inspection it appears that feature 'id_30' is a good candidate for string parsing

train_identity['id_30'].unique()

array(['Android 7.0', 'iOS 11.1.2', nan, 'Mac OS X 10_11_6', 'Windows 10',
       'Android', 'Linux', 'iOS 11.0.3', 'Mac OS X 10_7_5',
       'Mac OS X 10_12_6', 'Mac OS X 10_13_1', 'iOS 11.1.0',
       'Mac OS X 10_9_5', 'Windows 7', 'Windows 8.1', 'Mac', 'iOS 10.3.3',
       'Mac OS X 10.12', 'Mac OS X 10_10_5', 'Mac OS X 10_11_5',
       'iOS 9.3.5', 'Android 5.1.1', 'Android 7.1.1', 'Android 6.0',
       'iOS 10.3.1', 'Mac OS X 10.9', 'iOS 11.1.1', 'Windows Vista',
       'iOS 10.3.2', 'iOS 11.0.2', 'Mac OS X 10.11', 'Android 8.0.0',
       'iOS 10.2.0', 'iOS 10.2.1', 'iOS 11.0.0', 'Mac OS X 10.10',
       'Mac OS X 10_12_3', 'Mac OS X 10_12', 'Android 6.0.1', 'iOS',
       'Mac OS X 10.13', 'Mac OS X 10_12_5', 'Mac OS X 10_8_5',
       'iOS 11.0.1', 'iOS 10.0.2', 'Android 5.0.2', 'Windows XP',
       'iOS 11.2.0', 'Mac OS X 10.6', 'Windows 8', 'Mac OS X 10_6_8',
       'Mac OS X 10_11_4', 'Mac OS X 10_12_1', 'iOS 10.1.1',
       'Mac OS X 10_11_3', 'Mac OS X 10_12_4', 'Mac OS X 10

In [7]:
#as is feature 'id_31'

train_identity['id_31'].unique()

array(['samsung browser 6.2', 'mobile safari 11.0', 'chrome 62.0', nan,
       'chrome 62.0 for android', 'edge 15.0', 'mobile safari generic',
       'chrome 49.0', 'chrome 61.0', 'edge 16.0', 'safari generic',
       'edge 14.0', 'chrome 56.0 for android', 'firefox 57.0',
       'chrome 54.0 for android', 'mobile safari uiwebview', 'chrome',
       'chrome 62.0 for ios', 'firefox', 'chrome 60.0 for android',
       'mobile safari 10.0', 'chrome 61.0 for android',
       'ie 11.0 for desktop', 'ie 11.0 for tablet', 'mobile safari 9.0',
       'chrome generic', 'other', 'chrome 59.0 for android',
       'firefox 56.0', 'android webview 4.0', 'chrome 55.0', 'opera 49.0',
       'ie', 'chrome 55.0 for android', 'firefox 52.0',
       'chrome 57.0 for android', 'chrome 56.0',
       'chrome 46.0 for android', 'chrome 58.0', 'firefox 48.0',
       'chrome 59.0', 'samsung browser 4.0', 'edge 13.0',
       'chrome 53.0 for android', 'chrome 58.0 for android',
       'chrome 60.0', 'mobile sa

In [8]:
#the labels are found in the transaction set

train_transaction = pd.read_csv(train_transaction_path, error_bad_lines=False, index_col="TransactionID")
train_transaction.head()

Unnamed: 0_level_0,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,M5,M6,M7,M8,M9,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,...,V240,V241,V242,V243,V244,V245,V246,V247,V248,V249,V250,V251,V252,V253,V254,V255,V256,V257,V258,V259,V260,V261,V262,V263,V264,V265,V266,V267,V268,V269,V270,V271,V272,V273,V274,V275,V276,V277,V278,V279,V280,V281,V282,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1
2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,T,T,T,M2,F,T,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,M0,T,T,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,287.0,,outlook.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,315.0,,,,315.0,T,T,T,M0,F,F,F,F,F,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,,,yahoo.com,,2.0,5.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,,,,,84.0,,,,,111.0,,,,M0,T,F,,,,,,,,,,,,,,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,28.0,0.0,0.0,0.0,0.0,10.0,0.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,38.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.0,1758.0,925.0,0.0,354.0,0.0,135.0,0.0,0.0,0.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
#so we need to concatinate identity and transaction sets based on TransactionID which we set as our index column

train_identity = pd.read_csv(train_identity_path, error_bad_lines=False, index_col="TransactionID")
train_transaction = pd.read_csv(train_transaction_path, error_bad_lines=False, index_col="TransactionID")
#test_identity = pd.read_csv(test_identity_path, error_bad_lines=False, index_col="TransactionID")
#test_transaction = pd.read_csv(test_transaction_path, error_bad_lines=False, index_col="TransactionID")

df_train = pd.concat([train_transaction, train_identity], axis=1, sort=False)
#df_test = pd.concat([test_transaction, test_identity], axis=1, sort=False)


In [10]:
df_train.shape

(590540, 433)

In [11]:
#as a contrivance to make influence of our string parse target features more prominant
#we'll only base our evaluation on the two target features from identity set 
#and top ten features from transaction set
#so will drop remaining features other than the labels
#we derived this top ten list from a feature importance evaluation by automunge(.) not shown

topten = ['card6', 'C13', 'C1', 'C14', 'V317', \
          'V318', 'P_emaildomain', 'TransactionAmt', 'C11', 'TransactionDT']

targets_for_stringparse = ['id_30', 'id_31']

labels = ['isFraud']

retainedcolumns = topten + targets_for_stringparse + labels

df_train = df_train[retainedcolumns]

df_train.head()


Unnamed: 0_level_0,card6,C13,C1,C14,V317,V318,P_emaildomain,TransactionAmt,C11,TransactionDT,id_30,id_31,isFraud
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2987000,credit,1.0,1.0,1.0,117.0,0.0,,68.5,2.0,86400,,,0
2987001,credit,1.0,1.0,1.0,0.0,0.0,gmail.com,29.0,1.0,86401,,,0
2987002,debit,1.0,1.0,1.0,0.0,0.0,outlook.com,59.0,1.0,86469,,,0
2987003,debit,25.0,2.0,1.0,1404.0,790.0,yahoo.com,50.0,1.0,86499,,,0
2987004,credit,1.0,1.0,1.0,0.0,0.0,gmail.com,50.0,1.0,86506,Android 7.0,samsung browser 6.2,0


## Scenario 1

### one-hot encoding

In [12]:
#now let's try running again and applying 'text'
#to our two target features id_30, id_31

#(text is one-hot encoding)

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, \
             labels_column = label_column, \
             randomseed = 42, eval_ratio = .0001, \
             pandasoutput = True, \
             featureselection = True, featuremethod = 'report', \
             ML_cmnd = {'autoML_type':'randomforest', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{'n_estimators':222}}}, \
             assigncat = {'text':['id_30', 'id_31']}, \
             processdict = {}, transformdict = {}, \
             printstatus = False)

In [13]:
print("base accuracy")
print(featureimportance['FS_sorted']['baseaccuracy'])
print()
print("feature importance metric for 'id_30'")
print(featureimportance['FS_sorted']['column_key']['id_30'])
print()
print("feature importance metric for 'id_31'")
print(featureimportance['FS_sorted']['column_key']['id_31'])


base accuracy
0.9802892268093609

feature importance metric for 'id_30'
0.0013546923155078883

feature importance metric for 'id_31'
0.004902292816744036


## Scenario 2

### ordinal encoding

In [14]:
#now we'll run a base scenario under full automation
#and use the model trained as part of feature importance 
#to measure the result

#now we'll run featuremethod = 'report'
#which will only return feature improtance results 
#to save time of processing data

#and set the eval_ratio to .0001 to speed it up a little

#we'll increase the n_estimators for the Random Forest call

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, \
             labels_column = label_column, \
             randomseed = 42, eval_ratio = .0001, \
             pandasoutput = True, \
             featureselection = True, featuremethod = 'report', \
             ML_cmnd = {'autoML_type':'randomforest', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{'n_estimators':222}}}, \
             assigncat = {'ord3':['id_30', 'id_31']}, \
             processdict = {}, transformdict = {}, \
             printstatus = False)

In [15]:
print("base accuracy")
print(featureimportance['FS_sorted']['baseaccuracy'])
print()
print("feature importance metric for 'id_30'")
print(featureimportance['FS_sorted']['column_key']['id_30'])
print()
print("feature importance metric for 'id_31'")
print(featureimportance['FS_sorted']['column_key']['id_31'])


base accuracy
0.9803992955599959

feature importance metric for 'id_30'
0.0019304365495986797

feature importance metric for 'id_31'
0.005808243302739879


## Scenario 3

### binary encoding

In [16]:
#now let's try running again and applying 'text'
#to our two target features id_30, id_31

#(1010 is binary encoding)

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, \
             labels_column = label_column, \
             randomseed = 42, eval_ratio = .0001, \
             pandasoutput = True, \
             featureselection = True, featuremethod = 'report', \
             ML_cmnd = {'autoML_type':'randomforest', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{'n_estimators':222}}}, \
             assigncat = {'1010':['id_30', 'id_31']}, \
             processdict = {}, transformdict = {}, \
             printstatus = False)

In [17]:
print("base accuracy")
print(featureimportance['FS_sorted']['baseaccuracy'])
print()
print("feature importance metric for 'id_30'")
print(featureimportance['FS_sorted']['column_key']['id_30'])
print()
print("feature importance metric for 'id_31'")
print(featureimportance['FS_sorted']['column_key']['id_31'])


base accuracy
0.9804500965218275

feature importance metric for 'id_30'
0.0024469129948859747

feature importance metric for 'id_31'
0.006985132251837278


## Scenario 4

### 'or19' string parsing applied to id_30, id_31

In [18]:
#now let's try running again and applying 'or19'
#to our two target features id_30, id_31

#(or19 was described in detail in the paper)

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, \
             labels_column = label_column, \
             randomseed = 42, eval_ratio = .0001, \
             pandasoutput = True, \
             featureselection = True, featuremethod = 'report', \
             ML_cmnd = {'autoML_type':'randomforest', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{'n_estimators':222}}}, \
             assigncat = {'or19':['id_30', 'id_31']}, \
             processdict = {}, transformdict = {}, \
             printstatus = False)

In [19]:
print("base accuracy")
print(featureimportance['FS_sorted']['baseaccuracy'])
print()
print("feature importance metric for 'id_30'")
print(featureimportance['FS_sorted']['column_key']['id_30'])
print()
print("feature importance metric for 'id_31'")
print(featureimportance['FS_sorted']['column_key']['id_31'])


base accuracy
0.9808226369085922

feature importance metric for 'id_30'
0.0029464557862295404

feature importance metric for 'id_31'
0.009144173129677968


## Scenario 5

### 'or23' string parsing applied to id_30, id_31

In [20]:
#now let's try running again and applying 'or23'
#to our two target features id_30, id_31

#where or23 applies an upstream UPCS followed by sp19 supplemented with nmcm and ord3

#where sp19 is string parsing with cocurrent activations consolidated into binary encoding
#nmrc extracts numeric portions of entries
#and ord3 is an ordinal encoding sorted by frequency

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, \
             labels_column = label_column, \
             randomseed = 42, eval_ratio = .0001, \
             pandasoutput = True, \
             featureselection = True, featuremethod = 'report', \
             ML_cmnd = {'autoML_type':'randomforest', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{'n_estimators':222}}}, \
             assigncat = {'or23':['id_30', 'id_31']}, \
             printstatus = False)
                        

In [21]:
print("base accuracy")
print(featureimportance['FS_sorted']['baseaccuracy'])
print()
print("feature importance metric for 'id_30'")
print(featureimportance['FS_sorted']['column_key']['id_30'])
print()
print("feature importance metric for 'id_31'")
print(featureimportance['FS_sorted']['column_key']['id_31'])


base accuracy
0.980839570562536

feature importance metric for 'id_30'
0.002887187997426155

feature importance metric for 'id_31'
0.009135706302705993


## Results

The results of the evaluation are summarized here:

|            |                     | base accuracy | id_30 metric | id_31 metric |
|------------|---------------------|---------------|--------------|--------------|
|            |                     |               |              |              |
| scenario 1 | one hot encoding    | 0.980289      | 0.001354     | 0.004902     |
|            |                     |               |              |              |
| scenario 2 | ordinal encoding    | 0.980399      | 0.001930     | 0.005808     |
|            |                     |               |              |              |
| scenario 3 | binary  encoding    | 0.980450      | 0.002446     | 0.006985     |
|            |                     |               |              |              |
| scenario 4 | or19 string parsing | 0.980822      | 0.002946     | 0.009144     |
|            |                     |               |              |              |
| scenario 5 | or23 string parsing | 0.980839      | 0.002887     | 0.009135     |
|            |                     |               |              |              |



Here the base accuracy represents the feature importance model
trained on the entire data set, and the metrics are derived in automunge(.) by shuffle permutation, in other words by
evaluating accuracy impact of shuffling the target feature set.

Here we see that the base accuracy of the model was benefited
by both types of string parsing, the or19 in scenario2 and the
sp19 in scenario 3 (in comparison to scenario 1).

The larger feature importance metrics in scenarios 2 & 3 for the two features
also indicate that the string parsing operation had a positive influence.

It appears the or19 version of string parsing from scenario 4 
was slightly more beneficial than the or23 version of string parsing from scenario 5.

More information on string parsed encodings is available in the paper [String Theory
Parsed Categoric Encodings with Automunge](https://medium.com/automunge/string-theory-acbd208eb8ca).