# Named Entity Recognition: Disease Extraction

### Tutorial from https://appliedmachinelearning.blog/2019/04/01/training-deep-learning-based-named-entity-recognition-from-scratch-disease-extraction-hackathon/

## Libraries 

In [1]:
import pandas as pd
import os
import numpy as np
from tqdm import tqdm, trange
import unicodedata
 
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense
from keras.layers import TimeDistributed, Dropout, Bidirectional
 
# Defining Constants
 
# Maximum length of text sentences
MAXLEN = 180
# Number of LSTM units
LSTM_N = 150
# batch size
BS=48

Using TensorFlow backend.


### Check if GPU is enabled 

In [2]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7785096482919198375
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 10729594913264868872
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 5065308776090252323
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7427958375
locality {
  bus_id: 1
  links {
  }
}
incarnation: 1973663827580528049
physical_device_desc: "device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1"
]


In [3]:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()

['/job:localhost/replica:0/task:0/device:GPU:0']

## Importing Data 

In [4]:
os.getcwd()

'/usr/tf/notebooks'

In [9]:
# Reading the training set
data = pd.read_csv("/usr/tf/notebooks/dataset/train.csv", encoding="latin1")

In [10]:
data.head(10)

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag
0,1,1,1,Obesity,O
1,2,1,1,in,O
2,3,1,1,Low-,O
3,4,1,1,and,O
4,5,1,1,Middle-Income,O
5,6,1,1,Countries,O
6,7,1,1,:,O
7,8,1,1,Burden,O
8,9,1,1,",",O
9,10,1,1,Drivers,O


In [11]:
print("Number of uniques docs, sentences and words in Training set:\n",data.nunique())

Number of uniques docs, sentences and words in Training set:
 id         4543833
Doc_ID       30000
Sent_ID     191282
Word        184505
tag              3
dtype: int64


In [12]:
# Reading the test set
test_data = pd.read_csv("/usr/tf/notebooks/dataset/test.csv", encoding="latin1")
test_data.head(10)

Unnamed: 0,id,Doc_ID,Sent_ID,Word
0,4543834,30001,191283,CCCVA
1,4543835,30001,191283,","
2,4543836,30001,191283,MANOVA
3,4543837,30001,191283,","
4,4543838,30001,191283,my
5,4543839,30001,191283,black
6,4543840,30001,191283,hen
7,4543841,30001,191283,.
8,4543842,30001,191284,Comments
9,4543843,30001,191284,on


In [13]:
print("Number of uniques docs, sentences and words in Training set:\n",data.nunique())
print("\nNumber of uniques docs, sentences and words in Test set:\n",test_data.nunique())
 
# Creating a vocabulary
words = list(set(data["Word"].append(test_data["Word"]).values))
words.append("ENDPAD")
 
# Converting greek characters to ASCII characters eg. 'naïve café' to 'naive cafe'
words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words]
n_words = len(words)
print("\nLength of vocabulary = ",n_words)
 
tags = list(set(data["tag"].values))
n_tags = len(tags)
print("\nnumber of tags = ",n_tags)
 
# Creating words to indices dictionary.
word2idx = {w: i for i, w in enumerate(words)}
# Creating tags to indices dictionary.
tag2idx = {t: i for i, t in enumerate(tags)}

Number of uniques docs, sentences and words in Training set:
 id         4543833
Doc_ID       30000
Sent_ID     191282
Word        184505
tag              3
dtype: int64

Number of uniques docs, sentences and words in Test set:
 id         2994463
Doc_ID       20000
Sent_ID     125840
Word        139891
dtype: int64

Length of vocabulary =  257203

number of tags =  3


In [14]:
tags

['B-indications', 'I-indications', 'O']

# Named Entity Tag

The target ‘tag’ follows the Inside-outside-beginning (IOB) tagging format. The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in named entity recognition. The target ‘tag’ has three kinds of tags:

1. B-indications : Beginning tag indicates that the token is the beginning of a disease entity (disease name in this case).
2. I-indications : Inside tag indicates that the token is inside an entity.
3. O : Outside tag indicates that a token is outside a disease entity.

Therefore, any word which does not represent the disease name has to be classified as “O” tag. Similarly, the first word of disease name has to be classified as “B-Indication” and following words of disease name as “I-Indication”.

An example with IOB format:

```
Alex I-PER
is O
going O
to O
Los B-LOC
Angeles I-LOC
```

In [15]:
data.loc[data['tag'] != 'O']

Unnamed: 0,id,Doc_ID,Sent_ID,Word,tag
171,172,1,8,strategies,B-indications
211,212,2,10,MICROCEPHALIA,B-indications
212,213,2,10,VERA,I-indications
233,234,3,12,reactive,B-indications
234,235,3,12,hyperemia,I-indications
...,...,...,...,...,...
4543377,4543378,29999,191264,hepatitis,I-indications
4543398,4543399,29999,191265,chronic,B-indications
4543399,4543400,29999,191265,hepatitis,I-indications
4543432,4543433,29999,191267,serum,B-indications


## Getting Train & Test Sentences

In [16]:
def get_tagged_sentences(data):

# Objective: To get list of sentences along with labelled tags.
# Returns a list of lists of (word,tag) tuples.
# Each inner list contains a words of a sentence along with tags.

    agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(), s["tag"].values.tolist())]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences
 
def get_test_sentences(data):

# Objective: To get list of sentences.
# Returns a list of lists of words.
# Each inner list contains a words of a sentence.

 
    agg_func = lambda s: [w for w in s["Word"].values.tolist()]
    grouped = data.groupby("Sent_ID").apply(agg_func)
    sentences = [s for s in grouped]
    return sentences
# Getting training sentences in a list
sentences = get_tagged_sentences(data)
print("First 2 sentences in a word list format:\n",sentences[0:2])

# Getting test sentences in a list
test_sentences = get_test_sentences(test_data)
print("First 2 sentences in a word list format:\n",test_sentences[0:2])

First 2 sentences in a word list format:
 [[('Obesity', 'O'), ('in', 'O'), ('Low-', 'O'), ('and', 'O'), ('Middle-Income', 'O'), ('Countries', 'O'), (':', 'O'), ('Burden', 'O'), (',', 'O'), ('Drivers', 'O'), (',', 'O'), ('and', 'O'), ('Emerging', 'O'), ('Challenges', 'O'), ('.', 'O')], [('We', 'O'), ('have', 'O'), ('reviewed', 'O'), ('the', 'O'), ('distinctive', 'O'), ('features', 'O'), ('of', 'O'), ('excess', 'O'), ('weight', 'O'), (',', 'O'), ('its', 'O'), ('causes', 'O'), (',', 'O'), ('and', 'O'), ('related', 'O'), ('prevention', 'O'), ('and', 'O'), ('management', 'O'), ('efforts', 'O'), (',', 'O'), ('as', 'O'), ('well', 'O'), ('as', 'O'), ('data', 'O'), ('gaps', 'O'), ('and', 'O'), ('recommendations', 'O'), ('for', 'O'), ('future', 'O'), ('research', 'O'), ('in', 'O'), ('low-', 'O'), ('and', 'O'), ('middle-income', 'O'), ('countries', 'O'), ('(', 'O'), ('LMICs', 'O'), (')', 'O'), ('.', 'O')]]
First 2 sentences in a word list format:
 [['CCCVA', ',', 'MANOVA', ',', 'my', 'black', 'he

## Feature Extraction for DL Model

In [17]:
# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in train set eg. 'naïve café' to 'naive cafe'
X = [[word2idx[unicodedata.normalize('NFKD', str(w[0])).
encode('ascii','ignore')] for w in s] for s in sentences]
 
# Converting words to indices for test sentences (Features)
# Converting greek characters to ASCII characters in test-set eg. 'naïve café' to 'naive cafe'
X_test = [[word2idx[unicodedata.normalize('NFKD', str(w)).
encode('ascii','ignore')] for w in s] for s in test_sentences]
 
'''
Padding train and test sentences to 180 words.
Sentences of length greater than 180 words are truncated.
Sentences of length less than 180 words are padded with a high value.
'''
X = pad_sequences(maxlen=MAXLEN, sequences=X, padding="post", value=n_words - 1)
X_test = pad_sequences(maxlen=MAXLEN, sequences=X_test, padding="post", value=n_words - 1)
 
# Converting tags to indices for test sentences (labels)
y = [[tag2idx[w[1]] for w in s] for s in sentences]
# Padding tag labels to 180 words.
y = pad_sequences(maxlen=MAXLEN, sequences=y, padding="post", value=tag2idx["O"])
 
# Making labels in one hot encoded form for DL model
y = [to_categorical(i, num_classes=n_tags) for i in y]

## Building Bidirectional LSTM Model

In [18]:
# 180 dimensional word indices as input
input = Input(shape=(MAXLEN,))
 
# Embedding layer of same length output (180 dim embedding will be generated)
model = Embedding(input_dim=n_words, output_dim=MAXLEN, input_length=MAXLEN)(input)
 
# Adding dropout layer
model = Dropout(0.2)(model)
 
# Bidirectional LSTM to learn from both forward as well as backward context
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)
 
# Adding a TimeDistributedDense, to applying a Dense layer on each 180 timesteps
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)
 
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(X, np.array(y), batch_size=BS, epochs=2, validation_split=0.05, verbose=1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 181717 samples, validate on 9565 samples
Epoch 1/2
Epoch 2/2


In [19]:
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 180)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 180, 180)          46296540  
_________________________________________________________________
dropout_1 (Dropout)          (None, 180, 180)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 180, 300)          397200    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 180, 3)            903       
Total params: 46,694,643
Trainable params: 46,694,643
Non-trainable params: 0
_________________________________________________________________


## Prediction on Test Set

In [20]:
# Predicting on trained model
pred = model.predict(X_test)
print("Predicted Probabilities on Test Set:\n",pred.shape)
# taking tag class with maximum probability
pred_index = np.argmax(pred, axis=-1)
print("Predicted tag indices: \n",pred_index.shape)

Predicted Probabilities on Test Set:
 (125840, 180, 3)
Predicted tag indices: 
 (125840, 180)


In [21]:
# Flatten both the features and predicted tags for submission
ids,tagids = X_test.flatten().tolist(), pred_index.flatten().tolist()
 
# converting each word indices back to words
words_test = [words[ind].decode('utf-8') for ind in ids]
# converting each predicted tag indices back to tags
tags_test = [tags[ind] for ind in tagids]
print("Length of words in Padded test set:",len(words_test))
print("Length of tags in Padded test set:",len(tags_test))
print("\nCheck few of words and predicted tags:\n",words_test[:10],tags_test[:10])

Length of words in Padded test set: 22651200
Length of tags in Padded test set: 22651200

Check few of words and predicted tags:
 ['CCCVA', ',', 'MANOVA', ',', 'my', 'black', 'hen', '.', 'ENDPAD', 'ENDPAD'] ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [22]:
type(words_test)

list

### Compose the two lists into a panda Dataframe 

In [23]:
d={'words':words_test,'tags':tags_test}

In [25]:
df = pd.DataFrame(d)

In [26]:
df.head(10)

Unnamed: 0,words,tags
0,CCCVA,O
1,",",O
2,MANOVA,O
3,",",O
4,my,O
5,black,O
6,hen,O
7,.,O
8,ENDPAD,O
9,ENDPAD,O


### Getting a flavor of some of the words labeled by the model 

In [27]:
df.loc[df['tags'] != 'O']

Unnamed: 0,words,tags
1991,Pasteurella,B-indications
1992,multocida,I-indications
4680,Breast,B-indications
4681,cancer,I-indications
4872,breast,B-indications
...,...,...
22646703,pigmentosa,I-indications
22647968,disorders,I-indications
22648506,excitable,B-indications
22648687,excitable,B-indications
