# How to use the model

To understand the model it would be convenient if you have gone through demo1 and 2, however can learn how to use the model simply reading this notebook. 

I will use 3 examples to illustrate the different set-ups that can be used with this pytorch implementation of wide and deep.

### 0. Load the data

Note that, as long as your dataset is in a state similar to that of `adult_data.csv` below (remove NaN, impute missing values, etc..), you are "good to go".

In [37]:
from __future__ import print_function
import pandas as pd
import numpy as np

DF = pd.read_csv('data/adult_data.csv')

DF.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket,income_label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0


## 1. Logistic regression with varying embedding dimensions and no dropout

#### 1_1. Set the experiment

In [12]:
# Let's define a target for logistic regression:
DF['income_label'] = (DF["income_bracket"].apply(lambda x: ">50K" in x)).astype(int)

# Experiment set up
wide_cols = ['age','hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["age","hours_per_week"]
target = 'income_label'
method = 'logistic'

#### 1_2. prepare the data

In [13]:
from wide_deep.data_utils import prepare_data

# just call prepare_data
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target)

#### 1_3. Build the model

In [22]:
# Network set up
wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1 # for logistic and regression
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = None

# Build the model. Again you just need to call WideDeep
from wide_deep.torch_model import WideDeep
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers, dropout, encoding_dict,n_class)

# I have included a compile method if you want to change the fitting method or the optimizer
model.compile(method=method, optimizer="Adam")

let's have a look:

In [23]:
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (50 -> 100)
  (linear_2): Linear (100 -> 50)
  (output): Linear (848 -> 1)
)


#### 1_4. Fit and Predict

In [24]:
train_dataset = wd_dataset['train_dataset']
test_dataset  = wd_dataset['test_dataset']

# As your usual Sklearn model, simply call fit/predict
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
pred = model.predict(dataset=test_dataset)

from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test_dataset.labels))

  "Please ensure they have the same size.".format(target.size(), input.size()))
  "Please ensure they have the same size.".format(target.size(), input.size()))


Epoch 1 of 10, Loss: 0.425, accuracy: 0.8086
Epoch 2 of 10, Loss: 0.173, accuracy: 0.8361
Epoch 3 of 10, Loss: 0.251, accuracy: 0.8384
Epoch 4 of 10, Loss: 0.237, accuracy: 0.8405
Epoch 5 of 10, Loss: 0.112, accuracy: 0.8404
Epoch 6 of 10, Loss: 0.188, accuracy: 0.8413
Epoch 7 of 10, Loss: 0.074, accuracy: 0.8423
Epoch 8 of 10, Loss: 0.17, accuracy: 0.8432
Epoch 9 of 10, Loss: 0.228, accuracy: 0.8428
Epoch 10 of 10, Loss: 0.376, accuracy: 0.8439
0.837029959735


I have included a method to easily get the learned embeddings. This will return a dictionary where the keys are the column values and the values are the embeddings.

In [29]:
model.get_embeddings('education')

{'10th': array([ 1.0558697 , -0.10497121,  1.2519902 , -1.20969331, -0.37003803,
         0.26222366,  1.39537013,  0.66922128, -1.14872277, -1.66497922], dtype=float32),
 '11th': array([ 0.76582593, -0.15720901, -0.79173702,  0.17092067, -1.01140571,
        -0.15254961,  1.59629261, -1.03472006, -0.1246258 ,  0.87272727], dtype=float32),
 '12th': array([ 2.80748963, -0.40501541, -1.66380119,  1.119385  ,  0.11228444,
         0.46560571, -0.2575815 , -0.78553766,  0.40721282,  2.17365384], dtype=float32),
 '1st-4th': array([-0.59988064, -0.91489893,  0.77964532,  1.34235549, -2.21585774,
        -1.20931304,  1.87390292,  0.40189996, -1.43448257,  0.0121912 ], dtype=float32),
 '5th-6th': array([ 0.13168913,  0.50879979,  0.44774669,  0.75261694, -2.11371017,
        -0.86445326, -0.59014183, -1.84488511, -0.8879115 , -0.68353879], dtype=float32),
 '7th-8th': array([ 1.42483819,  0.34507382, -0.05195802,  1.38898981,  0.17512439,
        -0.58219528,  0.94600356, -0.67991239, -1.80070

## 2. Multiclass classification with fixed embedding dimensions (10) and varying dropout

Let's first define a feature for multiclass classification. Note that **this is only for illustration purposes**. 

In [31]:
# Let's define age groups
age_groups = [0, 25, 50, 90]
age_labels = range(len(age_groups) - 1)
DF['age_group'] = pd.cut(DF['age'], age_groups, labels=age_labels)

# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols  = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols  = ['education', 'relationship','workclass','occupation','native_country']
continuous_cols = ["hours_per_week"]
target = 'age_group'
method = 'multiclass'

wd_dataset = prepare_data(DF,wide_cols,crossed_cols,embeddings_cols,continuous_cols,target,def_dim=10)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=3
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]

model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method)

# Let's have a look to the model
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (49 -> 100)
  (linear_1_drop): Dropout (p = 0.5)
  (linear_2): Linear (100 -> 50)
  (linear_2_drop): Dropout (p = 0.2)
  (output): Linear (847 -> 3)
)


In [32]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)
test_dataset  = wd_dataset['test_dataset']

# The model object also has a predict_proba method in case you want probabilities instead of class
pred = model.predict_proba(test_dataset)
print('\n {}'.format(pred))

Epoch 1 of 10, Loss: 0.964, accuracy: 0.6522
Epoch 2 of 10, Loss: 1.013, accuracy: 0.6829
Epoch 3 of 10, Loss: 0.992, accuracy: 0.6873
Epoch 4 of 10, Loss: 0.991, accuracy: 0.69
Epoch 5 of 10, Loss: 1.024, accuracy: 0.693
Epoch 6 of 10, Loss: 0.706, accuracy: 0.6933
Epoch 7 of 10, Loss: 0.833, accuracy: 0.6959
Epoch 8 of 10, Loss: 0.76, accuracy: 0.6958
Epoch 9 of 10, Loss: 0.783, accuracy: 0.6971
Epoch 10 of 10, Loss: 0.898, accuracy: 0.698

 [[  9.97471273e-01   2.52866116e-03   4.56306566e-08]
 [  9.44395465e-11   1.00000000e+00   5.53709922e-09]
 [  1.76757031e-09   9.99999881e-01   9.40417166e-08]
 ..., 
 [  3.58941092e-04   9.91232693e-01   8.40835553e-03]
 [  3.10289147e-06   9.99976993e-01   1.99342994e-05]
 [  5.78610539e-01   4.08240706e-01   1.31487865e-02]]


In [33]:
from sklearn.metrics import f1_score, accuracy_score

print("\n {}".format(f1_score(model.predict(test_dataset), test_dataset.labels, average="weighted")))

print("\n {}".format(accuracy_score(model.predict(test_dataset), test_dataset.labels)))


 0.735653027645

 0.703610182215


## 3. Linear regression with varying embedding dimensions and varying dropout

Again, bear in mind that here we use `age` as target just **for illustration purposes**

In [40]:
# Set the experiment
wide_cols = ['hours_per_week','education', 'relationship','workclass',
             'occupation','native_country','gender']
crossed_cols  = (['education', 'occupation'], ['native_country', 'occupation'])
embeddings_cols  = [('education',10), ('relationship',8), ('workclass',10),
                    ('occupation',10),('native_country',10)]
continuous_cols = ["hours_per_week"]
target = 'age'
method = 'regression'

# Prepare the dataset
wd_dataset = prepare_data(DF, wide_cols,crossed_cols,embeddings_cols,continuous_cols,target)

wide_dim = wd_dataset['train_dataset'].wide.shape[1]
n_class=1
deep_column_idx = wd_dataset['deep_column_idx']
embeddings_input= wd_dataset['embeddings_input']
encoding_dict   = wd_dataset['encoding_dict']
hidden_layers = [100,50]
dropout = [0.5, 0.2]
model = WideDeep(wide_dim,embeddings_input,continuous_cols,deep_column_idx,hidden_layers,dropout,encoding_dict,n_class)
model.compile(method=method)
print(model)

WideDeep (
  (emb_layer_workclass): Embedding(9, 10)
  (emb_layer_education): Embedding(16, 10)
  (emb_layer_native_country): Embedding(42, 10)
  (emb_layer_relationship): Embedding(6, 8)
  (emb_layer_occupation): Embedding(15, 10)
  (linear_1): Linear (49 -> 100)
  (linear_1_drop): Dropout (p = 0.5)
  (linear_2): Linear (100 -> 50)
  (linear_2_drop): Dropout (p = 0.2)
  (output): Linear (847 -> 1)
)


In [41]:
train_dataset = wd_dataset['train_dataset']
model.fit(dataset=train_dataset, n_epochs=10, batch_size=64)

test_dataset  = wd_dataset['test_dataset']
pred = model.predict(test_dataset)

from sklearn.metrics import mean_squared_error
print("\n RMSE: {}".format(np.sqrt(mean_squared_error(pred, test_dataset.labels))))

Epoch 1 of 10, Loss: 293.391
Epoch 2 of 10, Loss: 190.727
Epoch 3 of 10, Loss: 229.331
Epoch 4 of 10, Loss: 186.07
Epoch 5 of 10, Loss: 192.077
Epoch 6 of 10, Loss: 59.602
Epoch 7 of 10, Loss: 178.112
Epoch 8 of 10, Loss: 137.38
Epoch 9 of 10, Loss: 135.515
Epoch 10 of 10, Loss: 66.123

 RMSE: 11.2608167188
