## Deep learning for medical claims -- Autoencoders for billing codes

I've recently been exploring ways to extract more information from the billing claims data that I tend to work with. 

Anthem covers 40.2 million lives, so you can imagine that the size/scope of our claims data is in the billions of records. Each claim contains a bit of information about a member -- what services were delivered and where, what diseases/symptoms prompted the service, what was the cost of the service, who delivered the service, etc..Trying to extract insights from these bits and pieces can be a challenge that typically requires a great deal of human input. Data scientists would call it heavy feature engineering. 

Sure, some information is encoded because coding systems are designed to enforce meanings (e.g., ICD, CPT-4, and Medispan GPI codes have a hierarchical structure where higher digits aggregate diseases/procedures/drugs into  larger clinically similar groups). Still, in practice we've noticed a few issues with codes *as-is*:

1. There are few, if any, systems that permit understanding of relationships across coding systems. That is, absent clinical expertise, it is a challenge to know whether and how a 3-digit ICD-9 code like 520 'Disorder of tooth development and eruption' relates to an NDC code like 64116001101, which codes for a type of Interferon-Gamma.
2. When semantic relations among codes are captured, it is usually on an ad-hoc basis and requires input from clinical experts. Relationships have to be continuously monitored and updated (especially relations using NDC or Medispan GPI relationships) depending on new technologies/drugs/etc...
3. Coding schemes are discrete and high-dimensional; for ex., there are over 70,000 ICD-10 CM codes. This means that in predictive modeling or other tasks some form of dimensionality reduction has to be used, whether it be the exlcusion of certain codes because they aren't relevant or 

In struggling with these problems we've started playing with the use of deep learning techniques to extract structure from our data. In particular, we've been exploring the use of [autoencoders](https://en.wikipedia.org/wiki/Autoencoder) to extract meaning among this large, disordered set of codes. 

## An overly simple intro to autoencoders

Autoencoders are a type of deep learning or a type of neural network architecture. Their general goal is to take a set of inputs and then predict those same inputs using a lower-dimensional representation of the data. In other words, these tools learn more parsimonius ways of representing potentially high-dimensional data. Below is a nice picture, courtesy of Stanford's deep learning group, that I actually found on [another site](https://www.doc.ic.ac.uk/~js4416/163/website/nlp/) providing a good description of autoencoders in the context of natural lnaguage processing.

![Image of autoencoder](http://ufldl.stanford.edu/tutorial/images/Autoencoder636.png) 

In the image you see an input of 6 cells (X) + 1 bias term cell. These feed to an intermediate layer of 4 cells (3 with inputs, 1 bias) and an output layer of, again, 6 cells. In those 4 cells we enage in 'encoding' information, condensing it down into a smaller dimension. We then engage in decoding this information back out.

There's actually a lot of depth and variation to this basic idea. For example, we can add a little bit of noise or drop some input connections out to make it more difficult for the model to learn, or we can add multiple intermediate layers (stacked autoencoders), or... Still, though, the basic idea remains the same: learn how to predict or re-create data using a lower-dimensional representation of it. At least, that's my high-level understanding.

In the code below I'll show how to apply the concept of autoencoders to ICD-9 CM codes using TensorFlow.

## Step 1: Calling in and getting the data ready

Our data will come from the Medical Expenditures Panel Survey (MEPS), specifically the 'medical component' from the 2015 and 2016 years. I've actually done a little pre-processing in R prior to this step. First, I used the R `lodown` package to download the data. Second, I did a little data tidying with `dplyr`. The result is a data set where eah individual is a single row, and each column corresponds to an ICD-9 code. A value of `1` indicates the individual has the medical condition, a value of `0` means they do not (i.e., a 'one-hot' encoding of the data). You can see the data below:

In [11]:
import tensorflow as tf
import pandas as pd
import numpy as np

d_in = pd.read_csv("~/Desktop/Mike-McLaughlin.github.io/_drafts/meps_onehot.csv")

print(d_in.head(n = 10))

#Drop the id column, add a bias column
d_in2 = d_in.drop(['dupersid'], axis=1)
n, p = d_in2.shape

   dupersid  005  008  009  034  041  053  054  070  074 ...   V68  V70  V71  \
0  40001101    0    0    0    0    0    0    0    0    0 ...     0    0    0   
1  40001102    0    0    0    0    0    0    0    0    0 ...     0    0    0   
2  40001104    0    0    0    0    0    0    0    0    0 ...     0    0    0   
3  40002101    0    0    0    0    0    0    0    0    0 ...     0    0    0   
4  40004102    0    0    0    0    0    0    0    0    0 ...     0    0    0   
5  40004103    0    0    0    0    0    0    0    0    0 ...     0    0    0   
6  40004104    0    0    0    0    0    0    0    0    0 ...     0    1    0   
7  40004105    0    0    0    0    0    0    0    0    0 ...     0    1    0   
8  40004106    0    0    0    0    0    0    0    0    0 ...     0    0    0   
9  40004107    0    0    0    0    0    0    0    0    0 ...     1    1    0   

   V72  V74  V76  V77  V80  V81  V82  
0    0    0    0    0    0    0    0  
1    0    0    0    0    0    0    0  
2 

In [13]:
#Need to extract 
print(d_in2[0:5])

   005  008  009  034  041  053  054  070  074  075 ...   V68  V70  V71  V72  \
0    0    0    0    0    0    0    0    0    0    0 ...     0    0    0    0   
1    0    0    0    0    0    0    0    0    0    0 ...     0    0    0    0   
2    0    0    0    0    0    0    0    0    0    0 ...     0    0    0    0   
3    0    0    0    0    0    0    0    0    0    0 ...     0    0    0    0   
4    0    0    0    0    0    0    0    0    0    0 ...     0    0    0    0   

   V74  V76  V77  V80  V81  V82  
0    0    0    0    0    0    0  
1    0    0    0    0    0    0  
2    0    0    0    0    0    0  
3    0    0    0    0    0    0  
4    0    1    0    0    0    0  

[5 rows x 366 columns]


## Setting up the autoencoder in TensorFlow

In [None]:
from tensorflow.contrib.layers import fully_connected

#Drop 1 column because we don't want to include the ID var
n_inputs = p
n_hidden_1 = 25 #25 cells--> Reduce from ~350 to 25
n_outputs = n_inputs

learning_rate = 0.01

X = tf.placeholder(tf.float32, shape = [None, n_inputs])
hidden = fully_connected(X, n_hidden)
logits = fully_connected(hidden, n_outputs)
outputs = tf.sigmoid(logits)
reconstruction_loss = tf.reduce_sum(
    tf.nn.sigmoid_cross_entropy_with_logits(labels = X, logits = logits))
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = tf.optimizer.minimize(reconstruction_loss)
init = tf.global_variables_initializer()

n_iterations = 100
codings = hidden

Unnamed: 0,005,008,009,034,041,053,054,070,074,075,...,V68,V70,V71,V72,V74,V76,V77,V80,V81,V82
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [6]:
with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        training_op.run(feed_dict={X : })
    

NameError: name 'tensorflow' is not defined