# NER with Dictionary of Diseases

In [1]:
import os
import re
import numpy as np

# Add functions in lib folder
import sys
import os
module_path = os.path.abspath(os.path.join('../code'))
if module_path not in sys.path:
    sys.path.append(module_path)
from lib.DataProcess import DataProcess
from lib.Jaccard import Jaccard

import spacy
from spacy.symbols import ORTH, LEMMA, POS

Using TensorFlow backend.


## Read the dataset

In [2]:
DATA_PATH = '../data/'

train_file = 'NCBI_corpus_training.txt'
test_file = 'NCBI_corpus_testing.txt'
diseases_file = 'diseases.txt'

with open(DATA_PATH + train_file, 'r') as fp:
    train_dataset = fp.readlines()

with open(DATA_PATH + test_file, 'r') as fp:
    test_dataset = fp.readlines()

dataset = train_dataset + test_dataset

with open(DATA_PATH + diseases_file, 'r') as fp:
    diseases = fp.readlines()

print('Dataset: %d' % len(dataset))
print('Diseases dictionary: %d' % len(diseases))

Dataset: 693
Diseases dictionary: 316


In [3]:
jaccard = Jaccard()
data_process = DataProcess(jaccard)

## Prepare data

This is a minimal preprocess of the data, as I do not want to remove any essential information from texts.

The dataset that I am using contains a number in the beginning of every text, thus I need to remove that number. Also, I replace the `category` tags with `<entity>`, so I can add them to the vocabular of Spacy.

In [4]:
dataset = data_process.apply_initial_cleaner(dataset)

print(dataset[0]) # Sample

Identification of APC2, a homologue of the <entity>adenomatous polyposis coli tumour</entity> suppressor .	The <entity>adenomatous polyposis coli ( APC ) tumour</entity>-suppressor protein controls the Wnt signalling pathway by forming a complex with glycogen synthase kinase 3beta ( GSK-3beta ) , axin / conductin and betacatenin . Complex formation induces the rapid degradation of betacatenin . In <entity>colon carcinoma</entity> cells , loss of APC leads to the accumulation of betacatenin in the nucleus , where it binds to and activates the Tcf-4 transcription factor ( reviewed in [ 1 ] [ 2 ] ) . Here , we report the identification and genomic structure of APC homologues . Mammalian APC2 , which closely resembles APC in overall domain structure , was functionally analyzed and shown to contain two SAMP domains , both of which are required for binding to conductin . Like APC , APC2 regulates the formation of active betacatenin-Tcf complexes , as demonstrated using transient transcriptio

I split the texts using the **Spacy Tokenizer**. Note that I split not only the words but also the sentences, so I consider each sentence as an independent input of the model.

In [5]:
tok_dataset = data_process.tokenize_texts(dataset)

print(tok_dataset[0]) # Sample

[('identification', 'NOUN', Identification), ('of', 'ADP', of), ('apc2', 'PROPN', APC2), (',', 'PUNCT', ,), ('a', 'DET', a), ('homologue', 'NOUN', homologue), ('of', 'ADP', of), ('the', 'DET', the), ('<entity>', 'X', <entity>), ('adenomatous', 'ADJ', adenomatous), ('polyposis', 'NOUN', polyposis), ('coli', 'NOUN', coli), ('tumour', 'NOUN', tumour), ('</entity>', 'X', </entity>), ('suppressor', 'NOUN', suppressor), ('.', 'PUNCT', .)]


I transform the sequence of words into 0s and 1s such that 1s mean that it is an **entity** and 0s mean that it is just a **common word**.

In [6]:
indicators = data_process.get_indicator_sequences(tok_dataset)

print(indicators[0]) # Sample

[0 0 0 0 0 0 0 0 1 1 1 1 0 0]


In [7]:
# It is necessary to tokenize the dictionary of diseases too
tok_diseases = data_process.tokenize_texts(diseases)

In [8]:
def remove_entity_tags(tok_dataset):
    return [[token for token in text if token[2].text not in [DataProcess.ENTITY_START, DataProcess.ENTITY_END]] for text in tok_dataset]

In [9]:
tok_dataset = remove_entity_tags(tok_dataset)

## Use dictionary

We are going to use the dictionary to find diseases. Note that I find the entities by using the Jaccard Index, so first I should adjust the `min_score` value. This values behaves as threshold to mark a string as entity or not.

I evaluate the model by computing the Jaccard Index with the pseudo-binary sequences. Details: https://en.wikipedia.org/wiki/Jaccard_index

In [14]:
def find_entities(tok_dataset, tok_entities, min_score=0.5, debug=False):
    entities_per_text = []
    indicators_per_text = []
    
    for log_i, text in enumerate(tok_dataset):
        text_len = len(text)
        t_entities = []
        t_indicator = []
        
        if log_i % 100 == 0:
            print('- Text %d of %d' % (log_i, len(tok_dataset)))
        
        i = 0
        while i < text_len:
            entity_found = False
            
            for entity in tok_entities:
                entity_len = len(entity)
                score = 0
                
                if entity_len + i > text_len:
                    # The entity cannot fit in the tokenized words
                    continue

                k = 0
                while k < entity_len:
                    score = score + jaccard.word_jaccard(entity[k][2].text, text[i + k][2].text)
                    k = k + 1
                
                score = score / entity_len
                if score >= min_score:
                    entity_found = True
                    t_entities.append(text[i:i+k])
                    t_indicator = t_indicator + [1]*k
                    i = i + k
            
            if not entity_found:
                t_indicator.append(0)
                i = i + 1
        
        entities_per_text.append(np.array(t_entities))
        indicators_per_text.append(np.array(t_indicator))
    
    return np.array(entities_per_text), np.array(indicators_per_text)

In [None]:
# TO DO: Try to find the optimal threshold value
# min_score_values = np.arange(0.5, 0.8, 0.1)
min_score_values = [0.7]

best_min_score = 0
best_jaccard_score = 0

for min_score in min_score_values:
    print('Threshold: %f' % min_score)
    
    _, pred_indicators = find_entities(tok_dataset, tok_diseases, min_score, debug=True)
    jaccard_score = jaccard.bin_jaccard(np.array(indicators), pred_indicators)
    
    print('- Jaccard: %.4f' % jaccard_score)
    
    if jaccard_score > best_jaccard_score:
        best_min_score = min_score
        best_jaccard_score = jaccard_score

In [None]:
print('Jaccard score: %.4f' % best_jaccard_score)
print('Threshold: %d' % best_min_score)