# Mortgage Workflow with Deep Learning

## Dataset

The dataset used with this workflow is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

Preprocessing ETL has already been precalculated and is located at /tmp/eoldridge/fnma_full_data_proc_out4/dnn/

## PyTorch Deep Neural Network

### Model
The model constructed below starts with an initial embedding layer ([`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/nn.html#embeddingbag)) that takes the indices from the ETL pipeline, looks up the embeddings in the hash table and takes their mean. This vector then passes to a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) which finally outputs a single score.

Many of the model architecture parameters can be configured by the user such as embedding dimension, number and size of hidden layers, and activation functions.

### Training
To cut down on boilerplate code and realize the benefits of [early stopping](https://en.wikipedia.org/wiki/Early_stopping)
we use the [`ignite`](https://pytorch.org/ignite/) library.


## Requirements
Beyond the dependencies that come installed in the standard 
[RAPIDS docker containers](https://hub.docker.com/r/rapidsai/rapidsai) we'll also
need the following `pip` dependencies installed:

In [1]:
!pip install torch pytorch-ignite

Collecting torch==1.0.1
[?25l  Downloading https://files.pythonhosted.org/packages/f7/92/1ae072a56665e36e81046d5fb8a2f39c7728c25c21df1777486c49b179ae/torch-1.0.1-cp36-cp36m-manylinux1_x86_64.whl (560.0MB)
[K     |################################| 560.1MB 26kB/s 
[?25hCollecting pytorch-ignite==0.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/19/79/7d53d47407668c1e73c4f22efceb40a787fe662017fffe8f2835d7e57a1b/pytorch_ignite-0.1.2-py2.py3-none-any.whl (44kB)
[K     |################################| 51kB 26.8MB/s 
[?25hInstalling collected packages: torch, pytorch-ignite
Successfully installed pytorch-ignite-0.1.2 torch-1.0.1


## CODE
Most of the details are buried/organized within the .py files.

### Imports

In [1]:
from collections import defaultdict, OrderedDict
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import pyarrow.parquet as pq

In [2]:
import cudf
cudf.__version__

'0.7.2+0.g3ebd286.dirty'

In [3]:
import pdb

In [4]:
%load_ext autoreload
%autoreload 2

## Configuration

#### ETL - Discretization

In [5]:
max_quantiles = 20  # Used for computing histograms of continuous features
num_features = 2 ** 22  # When hashing features range will be [0, num_features)

#### Training - Model Details

In [6]:
embedding_size = 64
hidden_dims = [600,600,600,600]

device = 'cuda'
dropout = None  # Can add dropout probability in [0, 1] here
activation = nn.ReLU()

batch_size = 8096

## Torch Dataset from Parquet
The preprocessing ETL has already been precalculated and is stored at: /tmp/eoldridge/fnma_full_data_proc_out4/dnn/

In [7]:
data_dir = '/data/mortgage/'
!ls -al --block-size=M /data/mortgage/

total 1M
drwxr-xr-x 5 root root 1M Apr  5 17:00 .
drwxr-xr-x 3 root root 1M May 27 22:31 ..
drwxr-xr-x 2 root root 1M Apr  5 17:24 test
drwxr-xr-x 2 root root 1M Apr  5 17:23 train
drwxr-xr-x 2 root root 1M Apr  5 17:24 validation


### Training starts here

In [8]:
from training import run_training
from model import MortgageNetwork

In [9]:
model = None
model = MortgageNetwork(num_features, embedding_size, hidden_dims,
                        dropout=dropout, activation=activation, use_cuda=True)

In [10]:
model.device

device(type='cuda')

In [None]:
run_training(model, data_dir, batch_dataload=True, num_workers=0, batch_size)

Epoch[1] Iteration[63/2258] Loss: 0.04063 Example/s: 78709.486 (Total examples: 510048)
Epoch[1] Iteration[126/2258] Loss: 0.02626 Example/s: 106202.064 (Total examples: 1020096)
Epoch[1] Iteration[189/2258] Loss: 0.03163 Example/s: 119798.259 (Total examples: 1530144)
Epoch[1] Iteration[252/2258] Loss: 0.02707 Example/s: 127801.786 (Total examples: 2040192)
Epoch[1] Iteration[315/2258] Loss: 0.03424 Example/s: 133189.829 (Total examples: 2550240)
Epoch[1] Iteration[378/2258] Loss: 0.03752 Example/s: 137085.818 (Total examples: 3060288)
Epoch[1] Iteration[441/2258] Loss: 0.02668 Example/s: 139878.462 (Total examples: 3570336)
Epoch[1] Iteration[504/2258] Loss: 0.03745 Example/s: 142286.898 (Total examples: 4080384)
Epoch[1] Iteration[567/2258] Loss: 0.02969 Example/s: 144299.902 (Total examples: 4590432)
Epoch[1] Iteration[630/2258] Loss: 0.02793 Example/s: 145894.987 (Total examples: 5100480)
Epoch[1] Iteration[693/2258] Loss: 0.03268 Example/s: 147275.914 (Total examples: 5610528)
