# How to use the model

To understand the model it would be convenient if you have gone through demo1 and 2, however you can learn how to use the model simply reading this notebook. 

I will use 3 examples to illustrate the different set-ups that can be used with this pytorch implementation of wide and deep.

### 0. Load the data

Note that, as long as your dataset is in a state similar to that of `adult_data.csv` below (remove NaN, impute missing values, etc..), you are "good to go".

In [1]:
from __future__ import print_function
import pandas as pd
import numpy as np

DF = pd.read_csv('data/adult_data.csv')

DF.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,income_label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0


## 1. Logistic regression with varying embedding dimensions, no dropout and Adam optimizer.

#### 1_1. Set the experiment

In [2]:
# Let's define a target for logistic regression:
DF['income_label'] = (DF["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

# Experiment set up
wide_cols = ['age','hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["age","hours_per_week"]
target = 'income_label'
method = 'logistic'

#### 1_2. prepare the data

In [3]:
from wide_deep.data_utils import prepare_data

# just call prepare_data
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target,scale=True)

#### 1_3. Build the model

In [4]:
# Network set up
wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1 # for logistic and regression
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = None

# Build the model. Again you just need to call WideDeep
from wide_deep.torch_model import WideDeep
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers, dropout, encoding_dict,n_class)

# I have included a compile method if you want to change the fitting method or the optimizer
model.compile(method=method, optimizer="Adam")

let's have a look:

In [5]:
print(model)

WideDeep(
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_workclass): Embedding(9, 10)
  (linear_1): Linear(in_features=50, out_features=100, bias=True)
  (linear_2): Linear(in_features=100, out_features=50, bias=True)
  (output): Linear(in_features=848, out_features=1, bias=True)
)


#### 1_4. Fit and Predict

In [6]:
train_dataset = wd_dataset['train_dataset']
test_dataset  = wd_dataset['test_dataset']

# As your usual Sklearn model, simply call fit/predict
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
pred = model.predict(dataset=test_dataset)

from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test_dataset.labels))

Epoch 1 of 10, Loss: 0.136, accuracy: 0.8246
Epoch 2 of 10, Loss: 0.106, accuracy: 0.8392
Epoch 3 of 10, Loss: 0.513, accuracy: 0.8421
Epoch 4 of 10, Loss: 0.345, accuracy: 0.8414
Epoch 5 of 10, Loss: 0.29, accuracy: 0.843
Epoch 6 of 10, Loss: 0.227, accuracy: 0.8443
Epoch 7 of 10, Loss: 0.426, accuracy: 0.845
Epoch 8 of 10, Loss: 0.183, accuracy: 0.8454
Epoch 9 of 10, Loss: 0.322, accuracy: 0.8461
Epoch 10 of 10, Loss: 0.246, accuracy: 0.8469
0.8382583771241384


I have included a method to easily get the learned embeddings. This will return a dictionary where the keys are the column values and the values are the embeddings.

In [7]:
model.get_embeddings('education')

{'Bachelors': array([-1.1927266 ,  0.13337217,  0.751513  , -0.3854133 , -1.512503  ,
         0.43075648,  0.03185017,  0.2740599 , -1.3502986 , -0.51524764],
       dtype=float32),
 'HS-grad': array([ 0.01510752, -0.41036212, -1.2737428 , -0.03190449,  0.30465913,
        -0.4891645 , -0.35087353,  1.7667191 ,  0.90333945, -0.42637545],
       dtype=float32),
 '11th': array([-1.3361819 , -1.0304003 , -0.7671982 ,  1.1118906 ,  0.6290409 ,
         0.09973534, -0.41261104, -0.79101914,  1.2672484 ,  0.7189385 ],
       dtype=float32),
 'Masters': array([ 0.5837133 , -1.3451334 ,  0.9863935 ,  0.35932744, -0.13541682,
         0.34770364, -0.8982047 ,  0.4550249 , -1.326133  , -0.08214497],
       dtype=float32),
 '9th': array([ 0.00944321, -0.2883264 ,  1.1186845 ,  0.16699162,  0.20891678,
        -2.222243  ,  0.90257394, -2.499814  ,  0.32215422, -0.02830464],
       dtype=float32),
 'Some-college': array([ 0.11737815, -0.9354352 , -1.6950701 , -0.3879866 , -0.34800476,
         0.

## 2. Multiclass classification with fixed embedding dimensions (10), varying dropout and RMSProp. 

Let's first define a feature for multiclass classification. Note that **this is only for illustration purposes**. 

In [8]:
# Let's define age groups
age_groups = [0, 25, 50, 90]
age_labels = range(len(age_groups) - 1)
DF['age_group'] = pd.cut(DF['age'], age_groups, labels=age_labels)

# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = ['education', 'relationship','workclass','occupation','native_country']
continuous_cols = ["hours_per_week"]
target = 'age_group'
method = 'multiclass'

wd_dataset = prepare_data(DF,wide_cols,crossed_cols,embeddings_cols,continuous_cols,target,scale=True,def_dim=10)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=3
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]

model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method, optimizer="RMSprop")

# Let's have a look to the model
print(model)

WideDeep(
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 10)
  (emb_layer_occupation): Embedding(15, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_workclass): Embedding(9, 10)
  (linear_1): Linear(in_features=51, out_features=100, bias=True)
  (linear_1_drop): Dropout(p=0.5)
  (linear_2): Linear(in_features=100, out_features=50, bias=True)
  (linear_2_drop): Dropout(p=0.2)
  (output): Linear(in_features=847, out_features=3, bias=True)
)


In [9]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
test_dataset  = wd_dataset['test_dataset']

# The model object also has a predict_proba method in case you want probabilities instead of class
pred = model.predict_proba(test_dataset)
print('\n {}'.format(pred))

Epoch 1 of 10, Loss: 1.131, accuracy: 0.6735
Epoch 2 of 10, Loss: 0.846, accuracy: 0.6843
Epoch 3 of 10, Loss: 0.902, accuracy: 0.686
Epoch 4 of 10, Loss: 0.806, accuracy: 0.691
Epoch 5 of 10, Loss: 1.015, accuracy: 0.6931
Epoch 6 of 10, Loss: 0.77, accuracy: 0.694
Epoch 7 of 10, Loss: 0.868, accuracy: 0.6962
Epoch 8 of 10, Loss: 0.808, accuracy: 0.6973
Epoch 9 of 10, Loss: 0.977, accuracy: 0.6972
Epoch 10 of 10, Loss: 0.851, accuracy: 0.6968

 [[9.9808323e-01 1.9167198e-03 1.2708337e-07]
 [1.8705309e-12 1.0000000e+00 1.0048575e-09]
 [2.1682714e-08 9.9999905e-01 9.1604261e-07]
 ...
 [1.0082698e-03 9.6010476e-01 3.8887005e-02]
 [2.4448596e-07 9.9994826e-01 5.1442345e-05]
 [6.8863249e-01 3.0600473e-01 5.3628702e-03]]


In [10]:
from sklearn.metrics import f1_score, accuracy_score

print("\n {}".format(f1_score(model.predict(test_dataset), test_dataset.labels, average="weighted")))

print("\n {}".format(accuracy_score(model.predict(test_dataset), test_dataset.labels)))


 0.7318796365288384

 0.7006756295639118


## 3. Linear regression with varying embedding dimensions and varying dropout.

Again, bear in mind that here we use `age` as target just **for illustration purposes**

In [11]:
# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols  = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols  = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["hours_per_week"]
target = 'age'
method = 'regression'

# Prepare the dataset
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method)
print(model)

WideDeep(
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_workclass): Embedding(9, 10)
  (linear_1): Linear(in_features=49, out_features=100, bias=True)
  (linear_1_drop): Dropout(p=0.5)
  (linear_2): Linear(in_features=100, out_features=50, bias=True)
  (linear_2_drop): Dropout(p=0.2)
  (output): Linear(in_features=847, out_features=1, bias=True)
)


In [12]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=100, batch_size=64)

test_dataset  = wd_dataset['test_dataset']
pred = model.predict(test_dataset)

from sklearn.metrics import mean_squared_error
print("\n RMSE: {}".format(np.sqrt(mean_squared_error(pred, test_dataset.labels))))

Epoch 1 of 100, Loss: 246.073
Epoch 2 of 100, Loss: 94.48
Epoch 3 of 100, Loss: 126.0
Epoch 4 of 100, Loss: 115.982
Epoch 5 of 100, Loss: 133.325
Epoch 6 of 100, Loss: 158.583
Epoch 7 of 100, Loss: 54.36
Epoch 8 of 100, Loss: 79.254
Epoch 9 of 100, Loss: 182.114
Epoch 10 of 100, Loss: 51.386
Epoch 11 of 100, Loss: 104.073
Epoch 12 of 100, Loss: 143.09
Epoch 13 of 100, Loss: 70.531
Epoch 14 of 100, Loss: 97.966
Epoch 15 of 100, Loss: 59.099
Epoch 16 of 100, Loss: 94.067
Epoch 17 of 100, Loss: 183.913
Epoch 18 of 100, Loss: 47.979
Epoch 19 of 100, Loss: 179.66
Epoch 20 of 100, Loss: 137.305
Epoch 21 of 100, Loss: 127.122
Epoch 22 of 100, Loss: 229.17
Epoch 23 of 100, Loss: 151.097
Epoch 24 of 100, Loss: 161.884
Epoch 25 of 100, Loss: 128.496
Epoch 26 of 100, Loss: 157.841
Epoch 27 of 100, Loss: 369.787
Epoch 28 of 100, Loss: 94.269
Epoch 29 of 100, Loss: 134.898
Epoch 30 of 100, Loss: 74.488
Epoch 31 of 100, Loss: 214.186
Epoch 32 of 100, Loss: 112.555
Epoch 33 of 100, Loss: 78.977
Epoch