# RPV Solo Disconnect Model

#### Garrett Lappe - September 2019

Just a very brief PoC for using machine learning to predict whether or not a phone number is disconnected or not.

##### Training data:

All archived RPV results as of 2019-09-25, WSLive results for those numbers, Kari's Disconnect Model results for those numbers.

##### Input: 
encoded RPV results (status, carrier, iscell) + pred_probability score from Kari's model + area code + area_prefix

##### Target: 
WSLive COMMENTS = 'NOT IN SERVICE'

In [22]:
import pandas as pd
import pickle as pk
import numpy as np

from sklearn import metrics

### Getting all tested RPV numbers

In [5]:
archive_path = 'U:\\Source Files\\Data Analytics\\Data-Science\\Data\RPV\\output\\_archive\\RPV_archive.csv'

rpv_archive = pd.read_csv(archive_path, dtype=object).drop_duplicates()  # some duplicates exist because I manually pasted in some
len(rpv_archive)

5076

In [6]:
rpv_archive.head()

Unnamed: 0,phone,status,error_text,iscell,carrier,date_checked
0,2012000318,connected,,N,Comcast of MD,2019-08-29
1,2012002626,connected,,N,MCImetro Former MCI,2019-08-29
2,2012040004,connected,,N,Level 3,2019-08-29
3,2012071052,connected,,Y,Verizon Wireless,2019-08-29
4,2012074846,connected,,Y,Verizon Wireless,2019-08-29


### Loading WSLive results from 2019

I previously saved off this data from Derek's wslive_results SAS table for 2019.

In [7]:
wslive_2019 = pk.load(open('wslive_2019.pk','rb'))

##### Trimming...

In [8]:
wslive = wslive_2019[wslive_2019['WSLIVE_FILE_DT'] >= '2019-07-01']
len(wslive)

76676

In [None]:
keep = [
    'PHYSICIAN_ME_NUMBER',
    'OFFICE_TELEPHONE',
    'PHYSICIAN_FIRST_NAME',
    'PHYSICIAN_MIDDLE_NAME',
    'PHYSICIAN_LAST_NAME',
    'DEGREE',
    'OFFICE_ADDRESS_LINE_1',
    'OFFICE_ADDRESS_LINE_2',
    'OFFICE_ADDRESS_CITY',
    'OFFICE_ADDRESS_STATE',
    'OFFICE_ADDRESS_ZIP',
    'OFFICE_ADDRESS_VERIFIED_UPDATED',
    'OFFICE_PHONE_VERIFIED_UPDATED',
    'PRESENT_EMPLOYMENT_CODE',
    'PRESENT_EMPLOYMENT_UPDATED',
    'SPECIALTY',
    'SPECIALTY_UPDATED',
    'COMMENTS',
    'WSLIVE_FILE_DT'
]

wslive = wslive[keep]

### Loading scored numbers from Kari's Disconnect Model.

This file is a trimmed version of the output of her model--containing only phone number and the estimated probability

In [9]:
scores = pd.read_csv('scores_recall.csv', dtype=object)

In [10]:
len(scores)  # numer of records scored in PPD

507637

In [11]:
scores = scores.sort_values(by='pred_recall').groupby('ppd_telephone_number').first().reset_index()
len(scores)  # dropping duplicate numbers, keeping the minimum estimated probability.

262812

### Merging WSLive and predicted probability

In [12]:
df = pd.merge(left=wslive, right=scores, left_on='OFFICE_TELEPHONE', right_on='ppd_telephone_number', how='inner')
len(df)

66401

In [13]:
df = df.sort_values(by='pred_recall').groupby('OFFICE_TELEPHONE').first().reset_index()
len(df)

57206

In [14]:
df.drop(columns='ppd_telephone_number', axis=1, inplace=True)

In [15]:
df.head()

Unnamed: 0,OFFICE_TELEPHONE,PHYSICIAN_ME_NUMBER,PHYSICIAN_FIRST_NAME,PHYSICIAN_MIDDLE_NAME,PHYSICIAN_LAST_NAME,SUFFIX,DEGREE,OFFICE_ADDRESS_LINE_1,OFFICE_ADDRESS_LINE_2,OFFICE_ADDRESS_CITY,...,COMMENTS,Source,WSLIVE_SOURCE,WSLIVE_FILE_DT,MATCH_ADDR,MATCH_PHONE,MATCH_ADDR_LONG,SPECIALTY,SPECIALTY_UPDATED,pred_recall
0,2012100200,12501790026,AMAL,,MEZHOUDI,,,,5301 BROADWAY,WEST NEW YORK,...,"MOVED, NO FORWARDING INFO",C,OTHERS,2019-08-02,1250179002530107093,12501790022012100200,,FM,1.0,0.60514144
1,2012160844,56108790011,VICTOR,,MARCHIONE,,,,600 PAVONIA AVE STE 5,JERSEY CITY,...,FAIL,C,FAIL,2019-08-30,561087900160007306,56108790012012160844,,,,0.555166029
2,2012162012,24316830265,XIAOLING,,LI,,,,241 ERIE ST,JERSEY CITY,...,FAIL,C,FAIL,2019-08-30,243168302624107310,24316830262012162012,,,,0.520751854
3,2012169791,49549820033,RAVI,,RATHI,,,,120 FRANKLIN ST,JERSEY CITY,...,FAIL,C,FAIL,2019-07-31,495498200312007307,49549820032012169791,,,,0.269932338
4,2012175600,30803930054,ELIZABETH,,RAMIREZ,,,,435 CENTRAL AVE,JERSEY CITY,...,FAIL,C,FAIL,2019-08-30,308039300543507307,30803930052012175600,,,,0.290065693


### Adding RPV results

In [16]:
df = df.merge(rpv_archive, left_on='OFFICE_TELEPHONE', right_on='phone', how='inner')
len(df)

3966

In [17]:
df.head()

Unnamed: 0,OFFICE_TELEPHONE,PHYSICIAN_ME_NUMBER,PHYSICIAN_FIRST_NAME,PHYSICIAN_MIDDLE_NAME,PHYSICIAN_LAST_NAME,SUFFIX,DEGREE,OFFICE_ADDRESS_LINE_1,OFFICE_ADDRESS_LINE_2,OFFICE_ADDRESS_CITY,...,MATCH_ADDR_LONG,SPECIALTY,SPECIALTY_UPDATED,pred_recall,phone,status,error_text,iscell,carrier,date_checked
0,2013422550,1201041607,HARSHPAL,,SINGH,,,,680 KINDERKAMACK RD STE 300,ORADELL,...,,NS,1.0,0.440718201,2013422550,connected,,N,Onvoy LLC former Neutral Tandem,2019-09-04
1,2013586776,13201810111,MIRTA,BEATRIZ,VEBER,,,,773 ROUTE 9W S,NYACK,...,,NPM,1.0,0.522008626,2013586776,connected,,N,Cablevision Corp,2019-09-04
2,2013639844,3508041801,DEAN,KIMTON,FONG,,,,185 BRIDGE PLZ N STE 206,FORT LEE,...,,,,0.522008626,2013639844,connected,,V,TimeWarnerCable,2019-09-04
3,2014181000,49528770044,AFZAL,J,SHEIKH,,,,308 WILLOW AVE,HOBOKEN,...,,AN,1.0,0.283836401,2014181000,connected,,N,Cablevision Corp,2019-09-25
4,2014386916,3306050973,ANDREW,HONGSIANG,THE,,,,130 ORIENT WAY STE BB,RUTHERFORD,...,,SME,1.0,0.555166029,2014386916,connected,,N,Comcast of MD,2019-09-04


### Encoding target

In [18]:
df['isDisconnected'] = df['COMMENTS'].apply(lambda x: 1 if x=='NOT IN SERVICE' else 0)

### Reducing data to RPV results + scored probability + target

In [19]:
Xy = df[['phone','status','iscell','carrier','isDisconnected']]

### RPV's precision and recall by itself

In [20]:
Xy.head()

Unnamed: 0,phone,status,iscell,carrier,isDisconnected
0,2013422550,connected,N,Onvoy LLC former Neutral Tandem,0
1,2013586776,connected,N,Cablevision Corp,1
2,2013639844,connected,V,TimeWarnerCable,0
3,2014181000,connected,N,Cablevision Corp,0
4,2014386916,connected,N,Comcast of MD,0


In [23]:
rpv_pred = [1 if s in ['disconnected', 'disconnected-70'] else 0 for s in Xy['status'].values]
target = Xy['isDisconnected'].values

print('Precision:', round(metrics.precision_score(y_pred=rpv_pred, y_true=target),3))
print('Recall:', round(metrics.recall_score(y_pred=rpv_pred, y_true=target),3))


Precision: 0.341
Recall: 0.543


# Can we improve RPV's results with machine learning using just RPV results?

### Encoding RPV status

If we think about it, the RPV status is ordinal data.

It can be ranked on a spectrum from connected to disconnected.

In [24]:
statuses = {
    'disconnected':4,
    'disconnected-70':3,
    'connected-75':2,
    'connected':1
}

def get_rank(status):
    if status in statuses:
        return statuses[status]
    return None

In [25]:
Xy['rpv_dc'] = Xy['status'].apply(lambda x: statuses[x] if x in statuses else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [26]:
Xy['rpv_dc'].drop_duplicates()

0      1
7      4
9      3
241    2
425    0
Name: rpv_dc, dtype: int64

In [27]:
len(Xy)

3966

In [28]:
Xy = Xy[np.isfinite(Xy['rpv_dc'])]  # dropping rows where the 'rpv_dc' is null
len(Xy)

3966

In [30]:
Xy.head()

Unnamed: 0,phone,status,iscell,carrier,isDisconnected,rpv_dc
0,2013422550,connected,N,Onvoy LLC former Neutral Tandem,0,1
1,2013586776,connected,N,Cablevision Corp,1,1
2,2013639844,connected,V,TimeWarnerCable,0,1
3,2014181000,connected,N,Cablevision Corp,0,1
4,2014386916,connected,N,Comcast of MD,0,1


In [31]:
Xy.groupby('iscell').size()

iscell
        1
N    3721
V     228
Y      15
dtype: int64

In [32]:
Xy['isCell_Y'] = Xy['iscell'].apply(lambda x: 1 if x=='Y' else 0)
Xy['isCell_V'] = Xy['iscell'].apply(lambda x: 1 if x=='V' else 0)
Xy.drop(columns='iscell', axis=1, inplace=True)

In [33]:
Xy.dtypes

phone             object
status            object
carrier           object
isDisconnected     int64
rpv_dc             int64
isCell_Y           int64
isCell_V           int64
dtype: object

In [34]:
Xy['area'] = Xy['phone'].apply(lambda x: x[:3]).astype('category')
Xy['area_prefix'] = Xy['phone'].apply(lambda x: x[:6]).astype('category')
Xy['carrier'] = Xy['carrier'].astype('category')
Xy = Xy.set_index('phone')
Xy.drop(columns='status', axis=1, inplace=True)


In [35]:
Xy.head()

Unnamed: 0_level_0,carrier,isDisconnected,rpv_dc,isCell_Y,isCell_V,area,area_prefix
phone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2013422550,Onvoy LLC former Neutral Tandem,0,1,0,0,201,201342
2013586776,Cablevision Corp,1,1,0,0,201,201358
2013639844,TimeWarnerCable,0,1,0,1,201,201363
2014181000,Cablevision Corp,0,1,0,0,201,201418
2014386916,Comcast of MD,0,1,0,0,201,201438


In [36]:
y = Xy['isDisconnected']
X = Xy.drop(columns='isDisconnected', axis=1)
len(X) == len(y)

True

In [37]:
X.dtypes

carrier        category
rpv_dc            int64
isCell_Y          int64
isCell_V          int64
area           category
area_prefix    category
dtype: object

### Viewing most common phone line carriers

In [91]:
cc = Xy.groupby('carrier').size().sort_values(ascending=False).reset_index().rename(columns={0: 'count'})
cc[:20]

Unnamed: 0,carrier,count
0,TCG,285
1,Verizon,274
2,Level 3,252
3,Pacific Bell - Nevada Bell,182
4,Qwest Communications,180
5,Comcast Phone,179
6,BellSouth,170
7,BANDWIDTH.COM,157
8,Ameritech,154
9,SouthWestern Bell,144


In [92]:
carrier_counts = {}

for i, r in cc.iterrows():
    carrier_counts[r['carrier']] = r['count']
carrier_counts

{'TCG': 285,
 'Verizon': 274,
 'Level 3': 252,
 'Pacific Bell - Nevada Bell': 182,
 'Qwest Communications': 180,
 'Comcast Phone': 179,
 'BellSouth': 170,
 'BANDWIDTH.COM': 157,
 'Ameritech': 154,
 'SouthWestern Bell': 144,
 'TimeWarnerCable': 132,
 'PAETEC': 130,
 'Frontier Rochester': 116,
 'tw telecom': 78,
 'Onvoy LLC former Neutral Tandem': 70,
 'CenturyLink': 66,
 'MCImetro Former Worldcom': 61,
 'Comcast Phone LLC': 60,
 'Cox Communications': 57,
 'Charter Fiber': 44,
 'BHNIS - Florida, LLC': 41,
 '(UNKNOWN)': 41,
 'Windstream SL': 35,
 'Lightpath Cable': 34,
 'Comcast Phone FL': 34,
 'MCImetro Former MCI': 33,
 'TelePacific': 28,
 'Vonage': 28,
 'Qwest Comms Company, LLC': 27,
 'SBC IP': 23,
 'Cablevision-OptimumLightpath': 22,
 'Cincinnati Bell Tel': 22,
 'RingCentral': 21,
 'McLeod USA': 21,
 'NuVox Communications': 21,
 'Comcast Phone NER': 21,
 'Frontier Communications': 19,
 'Cablevision Corp': 18,
 'Comcast of MD': 17,
 'Broadview Networks': 16,
 'Peerless': 16,
 'Verizon

In [93]:
Xy['carrier'] = Xy['carrier'].apply(lambda x: x if carrier_counts[x] else 'other')
Xy['carrier'] = Xy['carrier'].astype('category')


In [94]:
X = Xy.drop(columns='isDisconnected', axis=1)
y = Xy['isDisconnected']
len(X) == len(y)

True

### LightGBM

In [38]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split

In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

train_dataset = lgb.Dataset(X_train, y_train)
valid_dataset = lgb.Dataset(X_test, y_test)


print('POS:', pos)


# experiment with different parameters
param = {
    'num_leaves': 31, 
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.15
}
n_iter = 500

bst = lgb.train(param, train_dataset, n_iter, valid_sets=valid_dataset, early_stopping_rounds=50)

y_pred = bst.predict(X_test)
y_pred_c = [1 if p >= 0.5 else 0 for p in y_pred]

print()
print('Precision:', metrics.precision_score(y_true=y_test, y_pred=y_pred_c))
print('Recall:', metrics.recall_score(y_true=y_test, y_pred=y_pred_c))

POS: -0.9122541603630863
[1]	valid_0's auc: 0.740126
Training until validation scores don't improve for 50 rounds.
[2]	valid_0's auc: 0.732165
[3]	valid_0's auc: 0.735361
[4]	valid_0's auc: 0.735966
[5]	valid_0's auc: 0.733502
[6]	valid_0's auc: 0.727027
[7]	valid_0's auc: 0.727932
[8]	valid_0's auc: 0.726862
[9]	valid_0's auc: 0.728496
[10]	valid_0's auc: 0.728413
[11]	valid_0's auc: 0.730156
[12]	valid_0's auc: 0.734026
[13]	valid_0's auc: 0.731288
[14]	valid_0's auc: 0.733315
[15]	valid_0's auc: 0.736077
[16]	valid_0's auc: 0.735875
[17]	valid_0's auc: 0.737375
[18]	valid_0's auc: 0.736793
[19]	valid_0's auc: 0.737662
[20]	valid_0's auc: 0.740617
[21]	valid_0's auc: 0.741915
[22]	valid_0's auc: 0.74094
[23]	valid_0's auc: 0.739466
[24]	valid_0's auc: 0.73882
[25]	valid_0's auc: 0.736322
[26]	valid_0's auc: 0.736478
[27]	valid_0's auc: 0.739112
[28]	valid_0's auc: 0.738373
[29]	valid_0's auc: 0.738256
[30]	valid_0's auc: 0.739348
[31]	valid_0's auc: 0.737972
[32]	valid_0's auc: 0.737

### Testing different threshholds

In [65]:
t = 0.7

y_pred_c = [1 if p >= t else 0 for p in y_pred]

print()
print('Precision:', metrics.precision_score(y_true=y_test, y_pred=y_pred_c))
print('Recall:', metrics.recall_score(y_true=y_test, y_pred=y_pred_c))


Precision: 0.9090909090909091
Recall: 0.07518796992481203


### Neural Network

In [66]:
from keras import Sequential
from keras.utils import to_categorical
from keras.layers import Dense

Using TensorFlow backend.


In [67]:
Xy.dtypes

carrier           category
isDisconnected       int64
rpv_dc               int64
isCell_Y             int64
isCell_V             int64
area              category
area_prefix       category
dtype: object

In [68]:
categoricals = ['carrier', 'area', 'area_prefix']
dummies = pd.get_dummies(Xy[categoricals])
X_nocat = Xy.drop(columns=categoricals, axis=1)
X_nocat = X_nocat.merge(dummies, left_index=True, right_index=True, how='inner').drop_duplicates()

y = X_nocat['isDisconnected']
X_nocat.drop(columns='isDisconnected', axis=1, inplace=True)

X_nocat.head()

Unnamed: 0_level_0,rpv_dc,isCell_Y,isCell_V,carrier_,carrier_(UNKNOWN),carrier_123.Net,carrier_ACS,carrier_AT&T Wireless,carrier_ATI,carrier_Access Integrated Networks,...,area_prefix_985652,area_prefix_985764,area_prefix_985853,area_prefix_985872,area_prefix_985873,area_prefix_985893,area_prefix_989224,area_prefix_989583,area_prefix_989588,area_prefix_989799
phone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013422550,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2013586776,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2013639844,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2014181000,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2014386916,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
print(len(X))
print(len(X_nocat))
print(len(y))
assert len(X_nocat) == len(y)
X_train, X_test, y_train, y_test = train_test_split(X_nocat, y, test_size=0.4)

3966
3601
3601


In [71]:
X_nocat.shape

(3601, 3912)

In [88]:
model = Sequential()

model.add(Dense(500, activation='relu', input_dim=3))  # change this to 3912 if using categorical data
model.add(Dense(units=500, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

#Compile
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [73]:
Xynocat = Xy.drop(columns=categoricals, axis=1)
Xy_target = Xynocat['isDisconnected']
Xynocat2 = Xynocat.drop(columns='isDisconnected', axis=1)
X_tr, X_t, y_tr, y_t = train_test_split(Xynocat2, Xy_target)

X_tr['y'] = y_tr

X_tr_dc = X_tr[X_tr['y']==1]
X_tr_c = X_tr[X_tr['y']==0].sample(500)



X_tr = X_tr_dc.append(X_tr_c, ignore_index=True).sample(frac=1)

y_tr = X_tr['y']
X_tr.drop(columns='y', axis=1, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [75]:
history = model.fit(X_train, 
                    y_train,
                    epochs=10,
                    shuffle=True,
                    verbose=2
                   )

Instructions for updating:
Use tf.cast instead.
Epoch 1/50
 - 2s - loss: 0.3218 - acc: 0.9032
Epoch 2/50
 - 1s - loss: 0.1809 - acc: 0.9130
Epoch 3/50
 - 1s - loss: 0.0925 - acc: 0.9639
Epoch 4/50
 - 1s - loss: 0.0345 - acc: 0.9894
Epoch 5/50
 - 1s - loss: 0.0244 - acc: 0.9898
Epoch 6/50
 - 1s - loss: 0.0187 - acc: 0.9926
Epoch 7/50
 - 2s - loss: 0.0161 - acc: 0.9931
Epoch 8/50
 - 2s - loss: 0.0168 - acc: 0.9926
Epoch 9/50
 - 1s - loss: 0.0152 - acc: 0.9917
Epoch 10/50
 - 1s - loss: 0.0173 - acc: 0.9921
Epoch 11/50
 - 1s - loss: 0.0134 - acc: 0.9921
Epoch 12/50
 - 1s - loss: 0.0143 - acc: 0.9921
Epoch 13/50
 - 1s - loss: 0.0111 - acc: 0.9931
Epoch 14/50
 - 1s - loss: 0.0123 - acc: 0.9931
Epoch 15/50
 - 1s - loss: 0.0119 - acc: 0.9921
Epoch 16/50
 - 1s - loss: 0.0109 - acc: 0.9917
Epoch 17/50
 - 1s - loss: 0.0104 - acc: 0.9921
Epoch 18/50
 - 1s - loss: 0.0097 - acc: 0.9917
Epoch 19/50
 - 2s - loss: 0.0092 - acc: 0.9917
Epoch 20/50
 - 2s - loss: 0.0098 - acc: 0.9917
Epoch 21/50
 - 1s - l

In [89]:
# different version without categorical data, recompile model with input_dim = 3
history = model.fit(X_tr, 
                    y_tr,
                    epochs=50,
                    shuffle=True,
                    verbose=2
                   )

Epoch 1/50
 - 0s - loss: 0.6683 - acc: 0.7209
Epoch 2/50
 - 0s - loss: 0.5706 - acc: 0.7685
Epoch 3/50
 - 0s - loss: 0.5388 - acc: 0.7579
Epoch 4/50
 - 0s - loss: 0.5490 - acc: 0.7619
Epoch 5/50
 - 0s - loss: 0.5518 - acc: 0.7447
Epoch 6/50
 - 0s - loss: 0.5391 - acc: 0.7632
Epoch 7/50
 - 0s - loss: 0.5385 - acc: 0.7632
Epoch 8/50
 - 0s - loss: 0.5381 - acc: 0.7579
Epoch 9/50
 - 0s - loss: 0.5469 - acc: 0.7579
Epoch 10/50
 - 0s - loss: 0.5370 - acc: 0.7672
Epoch 11/50
 - 0s - loss: 0.5426 - acc: 0.7646
Epoch 12/50
 - 0s - loss: 0.5375 - acc: 0.7579
Epoch 13/50
 - 0s - loss: 0.5378 - acc: 0.7606
Epoch 14/50
 - 0s - loss: 0.5338 - acc: 0.7685
Epoch 15/50
 - 0s - loss: 0.5434 - acc: 0.7593
Epoch 16/50
 - 0s - loss: 0.5362 - acc: 0.7698
Epoch 17/50
 - 0s - loss: 0.5313 - acc: 0.7712
Epoch 18/50
 - 0s - loss: 0.5335 - acc: 0.7712
Epoch 19/50
 - 0s - loss: 0.5381 - acc: 0.7619
Epoch 20/50
 - 0s - loss: 0.5341 - acc: 0.7712
Epoch 21/50
 - 0s - loss: 0.5345 - acc: 0.7672
Epoch 22/50
 - 0s - lo

### Validation - No categorical

In [90]:
y_pred = model.predict(X_t)

y_predc = [1 if p >= 0.5 else 0 for p in y_pred]

print('Precision:', metrics.precision_score(y_true=y_t, 
                                            y_pred=y_predc))
print('Recall:', metrics.recall_score(y_true=y_t, 
                                      y_pred=y_predc))

Precision: 0.34965034965034963
Recall: 0.5434782608695652


### Validation - with categorical

In [84]:

y_pred = model.predict(X_test)

y_predc = [1 if p >= 0.5 else 0 for p in y_pred]

print('Precision:', metrics.precision_score(y_true=y_test, 
                                            y_pred=y_predc))
print('Recall:', metrics.recall_score(y_true=y_test, 
                                      y_pred=y_predc))

Precision: 0.3006134969325153
Recall: 0.33793103448275863


# Conclusion

More data needed.

Inclusing all categoricals introduces too much variety for the model to handle with limited data.

Need more data.

Data.

Disregard results, acquire data

In [86]:
import keras

In [87]:
keras.backend.clear_session()  # for clearing model sheeeeeeit