# Feature Engineering w/ h2o.ai | Word2Vec emp_title

Using Lending Club's dataset from 2015 and prior, there is reason to believe that the emp_title field holds predictive information which we can feature engineer. The Word2Vec algorithm helps find synonyms of employment titles, and transforms the data to vectors.

Read more about Word2Vec here:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/word2vec.html

In [1]:
!pip install requests
!pip install tabulate
!pip install scikit-learn
!pip install colorama
!pip install future



In [73]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,50 mins 24 secs
H2O cluster version:,3.16.0.3
H2O cluster version age:,6 days
H2O cluster name:,H2O_from_python_darrenklee_zw6bx3
H2O cluster total nodes:,1
H2O cluster free memory:,1.591 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:54321


In [74]:
import pandas as pd
import numpy as np

In [98]:
path = "/Users/darrenklee/Desktop/DS-Project/emp_title_loan_status_data.csv"

df = h2o.import_file(path=path)[:,1:3].ascharacter()

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [99]:
df.head()

emp_title,loan_status
Emergency Department technician,0
Corporate Insurance,0
Teacher,0
Regional Sales Direcdtor,1
hvac technician,1
Lead manufacturing,0
Engineer,0
Contract Specialist,0
Systems Support Engineer,0
Application Analyst,0




In [129]:
df.describe()

Rows:183660
Cols:2




Unnamed: 0,emp_title,loan_status
type,string,string
mins,,
mean,,
maxs,,
sigma,,
zeros,0,0
missing,0,0
0,Emergency Department technician,0
1,Corporate Insurance,0
2,Teacher,0


In [101]:
import h2o
h2o.init()
from h2o.estimators.word2vec import H2OWord2vecEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 hour 13 mins
H2O cluster version:,3.16.0.3
H2O cluster version age:,6 days
H2O cluster name:,H2O_from_python_darrenklee_zw6bx3
H2O cluster total nodes:,1
H2O cluster free memory:,1.510 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:54321


In [102]:
STOP_WORDS = ["ax","i","you","edu","s","t","m","subject","can","lines","re","what",
               "there","all","we","one","the","a","an","of","or","in","for","by","on",
               "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have",
               "from","at","my","be","by","not","that","to","from","com","org","like","likes","so"]

In [103]:
def tokenize(sentences, stop_word = STOP_WORDS):
    tokenized = sentences.tokenize("\\W+")
    tokenized_lower = tokenized.tolower()
    tokenized_filtered = tokenized_lower[(tokenized_lower.nchar() >= 2) | (tokenized_lower.isna()),:]
    tokenized_words = tokenized_filtered[tokenized_filtered.grep("[0-9]",invert=True,output_logical=True),:]
    tokenized_words = tokenized_words[(tokenized_words.isna()) | (~ tokenized_words.isin(STOP_WORDS)),:]
    return tokenized_words

In [104]:
def predict(df,w2v, gbm):
    words = tokenize(h2o.H2OFrame(df).ascharacter())
    df_vec = w2v.transform(words, aggregate_method="AVERAGE")
    print(gbm.predict(test_data=job_title_vec))

In [105]:

print("Break job titles into sequence of words")
words = tokenize(df["emp_title"])

Break job titles into sequence of words


In [106]:
print("Build word2vec model")
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)

Build word2vec model
word2vec Model Build progress: |██████████████████████████████████████████| 100%


In [138]:
print("Sanity check - find synonyms for the word 'painter'")
w2v_model.find_synonyms("painter", count = 10)

Sanity check - find synonyms for the word 'teacher'


OrderedDict([(u'glazier', 0.7945498824119568),
             (u'plumber', 0.7230222821235657),
             (u'carpenter', 0.6847513914108276),
             (u'blaster', 0.667998194694519),
             (u'electrician', 0.6575730443000793),
             (u'assembler', 0.6496596932411194),
             (u'pipefitter', 0.6402847766876221),
             (u'craftsman', 0.6321462392807007),
             (u'roofer', 0.6291592121124268),
             (u'working', 0.6070107817649841)])

In [125]:
print("Calculate a vector for each emp_title")
emp_title_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

Calculate a vector for each emp_title


In [152]:
data2 = df.cbind(emp_title_vecs)
data2.shape

(183660, 102)

In [126]:
print("Prepare training&validation data (keep only job titles made of known words)")
valid_emp_titles = ~ emp_title_vecs["C1"].isna()
data = df[valid_emp_titles,:].cbind(emp_title_vecs[valid_emp_titles,:])
data_split = data.split_frame(ratios=[0.8])

Prepare training&validation data (keep only job titles made of known words)


In [144]:
data

emp_title,loan_status,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36,C37,C38,C39,C40,C41,C42,C43,C44,C45,C46,C47,C48,C49,C50,C51,C52,C53,C54,C55,C56,C57,C58,C59,C60,C61,C62,C63,C64,C65,C66,C67,C68,C69,C70,C71,C72,C73,C74,C75,C76,C77,C78,C79,C80,C81,C82,C83,C84,C85,C86,C87,C88,C89,C90,C91,C92,C93,C94,C95,C96,C97,C98,C99,C100
Emergency Department technician,0,-0.237751,-0.0729316,0.0894989,-0.139048,-0.182811,0.239338,0.00941441,0.0824646,-0.332516,-0.344786,0.227957,-0.193663,0.14162,0.0782099,-0.317593,0.100721,0.0503919,-0.101892,-0.380651,-0.155539,0.211975,-0.160445,0.0343821,0.114849,-0.266019,-0.17087,-0.0313839,0.0446568,-0.163996,0.0782537,0.160748,-0.0504089,-0.128587,0.406312,0.119234,-0.0112055,0.132531,0.0268563,-0.090749,0.220527,-0.00394953,0.443641,-0.169868,-0.156508,0.0884958,-0.0816054,0.189869,0.0377647,0.0448495,0.0268733,-0.0199931,-0.514803,0.117256,-0.0347837,-0.330134,0.241807,-0.00643388,-0.573299,-0.128931,0.0943115,-0.218511,0.3349,0.163035,-0.114219,0.121819,0.0171678,0.0486514,-0.173862,0.149217,0.356777,0.0638926,-0.0508236,-0.249609,-0.118606,0.268285,0.135556,-0.153362,-0.185457,0.129406,0.199583,-0.179948,-0.0128305,0.109334,0.0762421,-0.148976,0.162148,0.184708,0.0413629,-0.0431332,0.146621,0.0263277,-0.0819337,0.0930371,-0.240905,-0.0900113,0.0637557,-0.258116,-0.550784,-0.072068,0.314838
Corporate Insurance,0,0.136679,-0.144567,-0.219635,0.176496,0.0589623,0.115669,0.0956495,0.240872,-0.316507,-0.125302,0.225058,0.160342,-0.163577,0.0327881,-0.153917,-0.155174,0.0310965,-0.198121,0.23992,-0.1159,0.349041,-0.119669,-0.147016,0.345455,-0.477145,0.112978,0.45149,-0.3525,0.0841605,0.0959851,0.154836,0.151768,0.365497,-0.146185,0.241859,-0.0574159,0.219442,0.165528,-0.0866035,-0.408591,0.327945,0.772904,-0.00804143,-0.36964,0.229165,-0.0881138,0.115113,0.167157,0.128234,-0.280985,-0.213696,0.116153,-0.52627,0.225373,-0.333489,0.0850746,0.123853,-0.105959,0.343277,-0.0378136,0.306728,0.137913,0.130111,0.390109,-0.0933352,-0.00156192,-0.131371,-0.284189,0.190074,-0.178177,-0.35391,-0.0950503,0.353733,-0.348374,0.0182431,0.117761,0.14017,0.120819,0.275642,0.300094,0.273035,-0.416597,0.177683,-0.145328,0.0676711,-0.0101007,-0.0448569,-0.0682249,0.62681,-0.106503,0.00962499,-0.216761,-0.132466,0.0869347,0.115855,-0.268649,-0.298816,0.18687,0.140972,0.203839
Teacher,0,-0.420405,0.337638,-0.0418651,-0.713847,-0.55096,0.108866,0.171319,-0.330193,-0.873747,0.549869,0.0328137,-0.0576892,0.365431,-0.00504889,-0.421131,-0.0436886,-0.313658,0.309085,0.411751,0.0619689,0.0512558,-0.0219861,-0.416153,-0.897691,0.146338,-0.54097,-0.681878,-0.821806,0.421511,-0.0488635,0.644557,0.000863975,0.373951,-0.39089,-0.479522,-0.7585,0.164389,0.116286,-0.0350933,-0.486592,-0.781227,0.199507,-0.556219,-0.509122,0.499199,-0.143516,0.181543,-0.160599,0.295796,-0.523104,-0.217884,-0.264076,0.592124,-1.00842,-0.158945,0.221271,-0.50484,0.758098,-0.0792309,0.420502,-0.549695,-0.46556,-0.305701,-0.659274,-0.485684,-0.330519,0.266906,-0.631643,0.483587,0.674028,0.192482,-0.184162,-0.38473,0.122553,-0.371431,-0.415303,0.0378919,-0.00271226,-0.412394,0.0226645,-0.179652,-0.0218947,0.131175,0.730714,-0.695195,-0.416138,-0.524377,-0.219182,0.014905,-0.336285,0.520902,0.33412,-0.838555,0.173891,-0.727107,-0.246951,-0.474045,-0.0591594,0.952233,0.324939
Regional Sales Direcdtor,1,0.398176,0.0716235,-0.295504,0.082468,0.415381,-0.258422,0.184185,-0.0737997,0.490681,-0.207707,0.371796,-0.22867,-0.102725,-0.0743461,-0.311047,0.0111619,0.256179,0.104013,0.0728996,-0.0256856,0.0853533,-0.00739908,0.0527599,0.116054,-0.281037,-0.0478498,0.229697,-0.133704,0.105423,0.229776,-0.000675047,0.182249,0.144669,0.405736,0.358409,-0.125551,0.0174678,0.251763,-0.354572,-0.0957421,-0.24682,-0.0757709,-0.149326,-0.0373113,-0.0524477,0.0217119,0.289294,0.163319,0.177032,-0.066881,-0.06129,0.036511,-0.194871,0.0516559,0.0516741,0.481722,-0.269713,0.0924776,0.0516088,0.240429,0.262264,0.0758954,0.124252,-0.0293645,-0.13306,-0.013596,-0.0489953,-0.30802,0.108015,0.00581424,0.0800567,0.0812842,0.143877,0.0814111,0.145227,0.264934,-0.260739,0.1226,0.0762803,-0.0323156,0.0655467,-0.0504784,0.141709,0.0730022,0.126606,0.020015,-0.055066,-0.173311,0.37401,-0.0619877,-0.0926827,-0.258902,-0.0123698,-0.316318,0.367773,0.0356743,0.075773,0.139615,0.108686,0.422246
hvac technician,1,-0.212824,-0.113866,0.238909,0.357511,-0.287052,-0.00832681,0.352402,0.0994346,-0.37568,-0.141079,-0.141427,-0.147803,-0.00987143,0.0461876,-0.428041,0.405577,-0.268525,-0.069638,-0.221971,0.198546,0.101741,-0.085888,0.306565,0.204161,0.0575542,-0.0563551,-0.084572,0.446016,-0.0450018,-0.0933209,0.215862,-0.198794,-0.198299,0.509787,-0.232565,0.330041,-0.161348,-0.132127,-0.566176,0.270611,0.0947441,0.587291,-0.118882,-0.0981318,0.278887,0.152655,-0.13732,0.286077,-0.493493,-0.13369,0.163839,-0.529656,-0.341545,-0.178871,-0.0743705,0.3972,-0.273283,-0.204732,0.0520847,0.189519,-0.503988,0.259151,0.0996019,0.032785,0.0619611,0.409164,0.307695,-0.0623976,-0.219813,0.478638,0.0975299,-0.153238,-0.37579,-0.212904,0.113359,0.190329,-0.411136,0.0564018,-0.0763657,0.18912,-0.1551,-0.139509,-0.205682,-0.207922,0.0434312,0.243771,0.210408,-0.348253,0.0809473,0.503262,-0.184388,0.122519,0.265062,-0.325889,-0.0642895,0.019013,0.0932365,-0.25059,-0.0891076,0.0415649
Lead manufacturing,0,0.0680732,-0.159856,-0.0654807,0.188518,-0.0399866,-0.555612,0.204821,-0.130744,-0.0383909,-0.181632,0.114323,-0.169012,-0.256556,-0.0222704,-0.0245958,0.0258956,0.104345,-0.116415,-0.0895325,-0.0269214,0.177575,-0.140527,-0.0870932,0.263223,0.297084,-0.237988,-0.170505,0.213833,-0.102811,0.178548,0.179535,0.0621667,-0.308095,-0.0987662,0.0597309,-0.115749,-0.121424,0.285265,-0.00703202,-0.0987914,0.0529274,-0.264932,0.0826302,0.0116705,0.126316,0.072078,-0.105755,0.186419,0.0309582,-0.047685,0.111183,-0.485062,-0.0955053,-0.268279,0.10279,-0.114846,-0.177691,0.0327783,-0.193541,-0.0575981,0.0337624,0.249736,0.00107761,-0.0318207,-0.0111274,0.234071,0.131584,-0.173733,0.184368,0.127015,-0.12085,0.0125986,0.43148,-0.215699,0.0539528,0.189225,-0.111726,0.287554,0.0520448,0.265933,0.486666,-0.253434,-0.0454413,0.12524,-0.0710634,0.0109541,0.177445,-0.359155,0.0897875,0.305256,-0.109186,-0.00403148,-0.0712729,0.026817,0.182758,0.509435,0.186689,-0.488411,0.110295,-0.15809
Engineer,0,0.0432906,-0.280989,-0.26634,0.592988,0.175916,-0.22695,0.227675,-0.235035,0.327391,0.0547116,-0.111492,-0.413858,-0.432478,-0.0410874,0.552479,0.453357,-0.024748,0.0332593,-0.583987,-0.152869,0.568207,-0.067148,-0.16812,-0.144658,0.0173439,-0.0620938,0.196309,0.921263,-0.00786382,-0.0807914,-0.585245,-0.427552,-0.426994,-0.246288,-0.214203,-0.269895,-0.427617,-0.333581,0.174216,0.229601,0.636433,0.878366,0.244099,-0.121457,-0.163686,0.11907,-0.000216687,0.24091,-0.615335,-0.104063,-0.0217467,-0.072378,-0.451811,0.0361581,-0.315426,0.305117,-0.0902097,-0.0379103,-0.344062,0.5646,-0.369861,-0.321474,-0.125033,-0.697259,-0.380082,0.318876,0.515675,-0.440788,0.124273,-0.147038,0.269242,-0.390634,-0.000580059,0.0678039,0.0253659,0.480072,-0.413799,-0.026174,-0.398679,0.275476,0.417214,0.119687,-0.0534679,0.384587,-0.36488,-0.27703,0.210551,-0.320878,-0.220984,0.250062,0.13378,-0.329876,0.510862,-0.332151,-0.109394,-0.197412,-0.167017,0.220023,-0.0298212,-0.0992493
Contract Specialist,0,0.032775,-0.0320274,-0.00896089,0.0706747,-0.00265171,0.245287,0.271131,0.228285,-0.00566971,0.12842,-0.348535,-0.245882,0.238621,0.356025,0.0165622,-0.0307522,0.115744,-0.497511,0.014185,-0.0142435,0.370156,-0.148665,-0.105429,0.0133449,-0.163834,-0.259436,0.229749,-0.130434,-0.192055,-0.0907753,-0.0996999,-0.0204409,0.128266,-0.304653,-0.138106,0.0971431,0.105038,-0.0195391,0.296802,0.139417,0.389331,0.16439,-0.062506,0.0183705,0.125671,-0.255377,-0.104514,0.332355,0.120258,-0.0182978,-0.233178,-0.178254,-0.294359,0.31825,-0.145995,0.190266,-0.0750366,0.205487,0.102891,0.0932801,-0.113988,0.174835,-0.176222,0.0781593,-0.086329,-0.032235,0.186685,-0.371314,0.281652,0.107179,0.00993429,-0.164902,0.25046,0.219706,-0.176226,-0.127815,0.00600795,-0.050586,0.0525248,-0.0208949,0.154136,-0.184871,0.0965039,-0.0854458,-0.356269,0.15288,-0.123131,-0.0583567,0.492171,0.269765,0.0765704,-0.367327,0.143632,-0.0163462,0.0521141,0.0845637,-0.251717,0.210503,0.0495196,-0.0316647
Systems Support Engineer,0,-0.0276228,0.017621,-0.258733,0.121914,0.214669,-0.192732,0.301617,-0.0347885,0.150109,0.0497113,-0.0893598,-0.174545,-0.0969793,0.0167976,0.181608,0.0931504,0.115854,-0.11659,-0.368549,-0.229688,0.381052,-0.352805,-0.196527,0.0407665,0.0967526,-0.108001,0.0173934,0.239068,-0.0743567,-0.193971,-0.416381,0.0812096,-0.00896775,-0.166121,-0.0594778,-0.148953,-0.42846,-0.23101,0.0614749,0.12338,0.473516,0.343354,0.25497,-0.0678564,0.17556,0.157521,0.231403,0.360775,-0.0667373,0.0841115,0.136411,-0.162928,-0.295597,-0.0576928,-0.0688471,0.339954,-0.325464,0.0660646,-0.126977,0.179745,-0.278522,-0.237899,-0.0504514,-0.190617,-0.239967,0.274576,0.361448,-0.188177,0.453891,-0.0288194,-0.0244445,-0.210557,0.230709,0.235251,-0.090175,0.23518,-0.189493,0.0152205,-0.0466626,0.214356,0.110696,-0.067462,-0.261819,0.20054,-0.297887,-0.0878498,-0.0781877,-0.332607,-0.118286,0.333861,0.189326,0.0471708,0.319544,-0.255441,0.0804612,0.038272,-0.189709,-0.0868916,0.00664527,-0.0433095
Application Analyst,0,0.111489,-0.342902,-0.243788,0.100499,0.0674149,-0.480227,0.357206,0.132527,-0.0714135,0.323351,-0.224174,-0.277343,0.358129,0.0275049,0.275417,0.496461,-0.0266093,-0.303108,-0.0272624,0.0233991,0.560595,-0.0877066,0.00885085,-0.110022,-0.0532038,-0.025626,0.250557,0.158136,0.106147,-0.563872,-0.497721,0.076148,0.0448917,0.0655454,-0.341709,0.299274,-0.423372,-0.0699574,0.190981,-0.373467,0.32387,0.187745,-0.0411242,-0.252908,0.0864685,0.210877,-0.0101816,0.17926,0.35508,-0.0166486,0.0256083,-0.0974841,-0.63264,-0.0838824,-0.134492,-0.144004,-0.270256,0.0990946,0.0741996,0.352776,0.0345127,-0.000723168,-0.127703,-0.0482076,0.0142643,0.285632,0.199092,-0.200344,0.446485,0.154172,-0.0301932,-0.260289,0.288735,0.188451,0.154465,-0.317281,-0.275319,-0.0501562,-0.0859219,-0.0834627,0.147849,-0.213836,-0.2452,-0.0067181,-0.200165,-0.309984,-0.242736,-0.205803,0.258131,0.280163,0.379537,-0.343447,-0.139874,0.0231892,0.601618,-0.153325,0.179624,-0.354828,0.210252,-0.18189




In [145]:
# export data to csv file!
import pandas as pd
import numpy as np

In [147]:
data.shape

(180614, 102)

In [131]:
print("Build a basic GBM model")
gbm_model = H2OGradientBoostingEstimator()
gbm_model.train(x = emp_title_vecs.names,
                y="loan_status", 
                training_frame = data_split[0], 
                validation_frame = data_split[1])

Build a basic GBM model
gbm Model Build progress: | (failed)


EnvironmentError: Job with key $03017f00000132d4ffffffff$_b660ce92acd31ba66461f5324c7e4f12 failed with an exception: DistributedException from /127.0.0.1:54321: 'Operation not allowed on string vector.', caused by java.lang.IllegalArgumentException: Operation not allowed on string vector.
stacktrace: 
DistributedException from /127.0.0.1:54321: 'Operation not allowed on string vector.', caused by java.lang.IllegalArgumentException: Operation not allowed on string vector.
	at water.MRTask.getResult(MRTask.java:478)
	at water.MRTask.getResult(MRTask.java:486)
	at water.MRTask.doAll(MRTask.java:390)
	at water.MRTask.doAll(MRTask.java:377)
	at water.MRTask.doAll(MRTask.java:376)
	at hex.tree.SharedTree.getInitialValue(SharedTree.java:981)
	at hex.tree.gbm.GBM.access$1800(GBM.java:24)
	at hex.tree.gbm.GBM$GBMDriver.initializeModelSpecifics(GBM.java:177)
	at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:353)
	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.IllegalArgumentException: Operation not allowed on string vector.
	at water.fvec.CStrChunk.atd_impl(CStrChunk.java:90)
	at water.fvec.Chunk.atd(Chunk.java:260)
	at hex.tree.SharedTree$InitialValue.map(SharedTree.java:1005)
	at water.MRTask.compute2(MRTask.java:639)
	at water.MRTask.compute2(MRTask.java:591)
	at water.MRTask.compute2(MRTask.java:591)
	at water.MRTask.compute2(MRTask.java:591)
	at water.MRTask.compute2(MRTask.java:591)
	at water.H2O$H2OCountedCompleter.compute1(H2O.java:1266)
	at hex.tree.SharedTree$InitialValue$Icer.compute1(SharedTree$InitialValue$Icer.java)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1262)
	... 5 more
