## Computational Healthcare Library : Training an embedding model using TensorFlow

#### The goal of Computational Healthcare Library is to help computer scientists do high impact healthcare research by providing a simple interface to large publicly available healthcare datasets.

With Computational Healthcare library you can:

- Load & analyze data from up to 200 Million visits & 70 Million patients
- Specify aggregation strategies and compute aggregate statistics in a privacy preserving manner   
- Build embedding models, perform transfer learning, predict rehospitalizations/revisits using TensorFlow 
- Benchmark results against baseline algorithms trained on publicly available datasets
- In future it can be used for testing Differential Privacy algorithms for computing aggregate statistics & privacy preserving Machine Learning   

Computational Healthcare library (chlib) is also used for building [Computational Healthcare: A Medical Search & Aggregation Engine](http://www.computationalhealthcare.com/).

To use this notebook, follow instructions in [README.md](https://github.com/AKSHAYUBHAT/ComputationalHealthcare/blob/master/README.md). Once the docker container is running, you can provide the CSV file and run prepare_nrd.sh to process and load NRD dataset. After processing is finished, you can go to localhost:8888 to use jupyter notebook server running inside container. Open this notebook (inside blog/introduction.ipynb) inside jupyter. 

## Building an embedding model for Diagnosis, Procedure and E Codes

###  Import libraries

In [1]:
import pandas as pd
import tensorflow as tf
import numpy as np
import random
import sys
from collections import defaultdict,Counter
sys.path.append('../') ## since chlib is in parent directory
import chlib

## Get object for Texas dataset

In [6]:
TX = chlib.data.Data.get_from_config('../config.json','TX') # Texas dataset

### Texas data does not contains any patients only individual visits wrapped inside a patient object

In [14]:
for pkey,p in TX.iter_patients():
#     print p
    for v in p.visits:
#         print v
        print "uncomment above line"
    break

uncomment above line


## Building a Word2Vec style embedding model

### Vocabulary of codes 

In [8]:
vocab = {}
reverse_vocab = {}
index = 0
min_count = 100
for code in TX.iter_codes():
    if code.visits_count() > min_count:
        vocab[code.code] = index
        reverse_vocab[index] = code.code        
        index += 1
print len(vocab),len(reverse_vocab)

10027 10027


###  Load procedure, diagnosis & external event codes from 1,000,000 visits

In [9]:
count = 0
data = []
for _,p in TX.iter_patients():
    v_codes = []
    for v in p.visits:
        for pr in v.prs:
            if pr.pcode in vocab:
                v_codes.append(vocab[pr.pcode])
        for dx in v.dxs:
            if dx in vocab:
                v_codes.append(vocab[dx])
        for ex in v.exs:
            if ex in vocab:
                v_codes.append(vocab[ex])
    v_codes = list(set(v_codes))
    random.shuffle(v_codes)
    data.append(v_codes)
    count += 1
    if count == 1000000:
        break
random.shuffle(data)        

### List of codes for a single visit

In [None]:
coder = chlib.codes.Coder() # to get string description
for code_index in data[7001]:
    print code_index,reverse_vocab[code_index],coder[reverse_vocab[code_index]]

### Batch Generator

In [10]:
class BatchGenerator():
    def __init__(self,data):
        self.data = data
        self.buffer = []
        self.vindex = 0 
        
    def fill_buffer(self):
        v = self.data[self.vindex % len(self.data)]
        self.vindex += 1
        for i,c in enumerate(v):
            for n in v[i+1:]:                    
                self.buffer.append((c,n))
                    
    def get_pairs(self,batch_size):
        while len(self.buffer) < batch_size:
            self.fill_buffer()            
        while batch_size > 0:
            yield self.buffer.pop()
            batch_size -= 1

    def generate_batch(self,batch_size):
        batch = np.ndarray(shape=(batch_size), dtype=np.int32)
        labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)    
        for i,c_n in enumerate(self.get_pairs(batch_size)):            
            c,n = c_n  
            batch[i] = c
            labels[i] = n
        return batch, labels            

generator = BatchGenerator(data)

In [None]:
batch, labels = generator.generate_batch(5)
print generator.vindex
for i in range(5):
    print batch[i], coder[reverse_vocab[batch[i]]][:50],'->', labels[i, 0], coder[reverse_vocab[labels[i, 0]]][:50]            
batch, labels = generator.generate_batch(5)
print generator.vindex
for i in range(5):
    print batch[i], coder[reverse_vocab[batch[i]]][:50],'->', labels[i, 0], coder[reverse_vocab[labels[i, 0]]][:50]                    

### Defining the computation graph

In [12]:
import math
batch_size = 128
vocabulary_size = len(vocab)
embedding_size = 50  # Dimension of the embedding vector.
valid_size = 4     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

graph = tf.Graph()
with graph.as_default():
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    with tf.device('/cpu:0'):
        embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
        nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,biases=nce_biases,labels=train_labels,inputs=embed,num_sampled=num_sampled,num_classes=vocabulary_size))
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
    init = tf.initialize_all_variables() #     init = tf.global_variables_initializer

## Train embedding model

In [13]:
num_steps = 400000
with tf.Session(graph=graph) as session:
    init.run()
    print("Initialized")
    average_loss = 0
    for step in xrange(num_steps):
        batch_inputs, batch_labels = generator.generate_batch(batch_size)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val

        if step % 5000 == 0:
            if step > 0:
                average_loss /= 2000
            print("Average loss at step ", step, ": ", average_loss)
            average_loss = 0

        if step % 50000 == 0:
            sim = similarity.eval()
            for i in xrange(valid_size):
                valid_word = coder[reverse_vocab[valid_examples[i]]]
                top_k = 5  # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                log_str = "\nNearest to %s:\n" % valid_word
                for k in xrange(top_k):
                    close_word = coder[reverse_vocab[nearest[k]]]
                    log_str = "%s:\t%s\t%s,\n" % (log_str,reverse_vocab[nearest[k]],close_word)
                print(log_str)
    final_embeddings = normalized_embeddings.eval()


Initialized
('Average loss at step ', 0, ': ', 233.60054016113281)

Nearest to Hb-SS disease with crisis:
:	DE9323	Insulins and antidiabetic agents causing adverse effects in therapeutic use,
:	D78322	Underweight,
:	P9211	Cerebral scan,
:	D8064	Closed fracture of lumbar spine with spinal cord injury,
:	P3562	Repair of ventricular septal defect with tissue graft,


Nearest to Outcome of delivery, twins, both liveborn:
:	D8250	Fracture of calcaneus, closed,
:	P0211	Simple suture of dura mater of brain,
:	D25801	Multiple endocrine neoplasia [MEN] type I,
:	DG546	DRG V24 : SPINAL FUSION EXC CERV WITH CURVATURE OF THE SPINE OR MALIG,
:	D1469	Malignant neoplasm of oropharynx, unspecified site,


Nearest to Multiple cranial nerve palsies:
:	D78060	Fever, unspecified,
:	D32725	Congenital central alveolar hypoventilation syndrome,
:	P9605_2	Other intubation of respiratory tract 2nd ,
:	D70707	Pressure ulcer, heel,
:	D81231	Open fracture of shaft of humerus,


Nearest to Incision of vessel, uppe