# Identifying Entities in Healthcare Data

### There are eight major tasks that we need to perform. They are as follows:

- Data preprocessing
- Concept identification
- Defining the features for CRF
- Getting the features words and sentences
- Defining input and target variables
- Building the model
- Evaluating the model
- Identifying the diseases and predicted treatment using a custom NER

## Data Preprocessing
The dataset is in the token format instead of sentences, we need to construct the sentences from the words. There are blank lines after the completion of each sentence or a set of labels in label files and we need to build a logic to arrange them into sentences or a sequence of labels in the case of label files.

### We need to do the following three tasks after processing and modifying the datasets:

- Construct proper sentences from individual words and print five sentences along with their labels.
- Print the correct count of the number of sentences in the processed train and test dataset.
- Correctly count the number of lines of labels in the processed train and test dataset.

In [5]:
# Import the required libraries

import spacy
import pandas as pd
import numpy as np
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from collections import Counter

In [3]:
# Load the en_core_web_sm
nlp_model = spacy.load("en_core_web_sm")

In [11]:
# Read the required files
tr_sent = open('train_sent', 'r')
tr_label = open('train_label', 'r')
te_sent = open('test_sent', 'r')
te_label = open('test_label', 'r')

In [12]:
# Write a function which forms the tokens in to sentences and values in to labels

def process_datatxt_file(filename):
  input_file = open(filename, 'r')
  file_content = input_file.readlines() 
  input_file.close()

  # To store list of sequences (sentences or labels)
  out_lines = [] 

  line_content = ""

  for word in file_content:
    word = word.strip() 
    # If empty line, add the current sequence to out_lines
    if word == "": 
      out_lines.append(line_content)
      line_content = ""; # re-initialize, new line starts
    else:
      if line_content: # if non-empty, add new word after space, part of current sentence
        line_content += " "+word
      else:
        line_content = word # first word, no space required

  return out_lines

In [14]:
# Prepare train sentences and labels 
train_sentences = process_datatxt_file('train_sent')
train_labels = process_datatxt_file('train_label')

# Prepare test sentences and labels
test_sentences = process_datatxt_file('test_sent')
test_labels = process_datatxt_file('test_label')

In [15]:
# Print the 5 sentences from the processed dataset
for i in range(5):
  print("Sentence:", train_sentences[i])
  print("Labels:", train_labels[i], "\n\n")

Sentence: All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status )
Labels: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 


Sentence: The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )
Labels: O O O O O O O O O O O O O O O O O O O O O O O O O 


Sentence: Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 )
Labels: O O O O O O O O O O O O O O O 


Sentence: The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 )
Labels: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 


Sentence: Arrest of dilation was the most common indication in both `` co

In [16]:
# Count the number of sentences in the processed train and test dataset
print("Number of lines in train_sentences:", len(train_sentences))
print("Number of lines in test_sentences:", len(test_sentences))

No. of lines in train_sentences: 2599
No. of lines in test_sentences: 1056


In [17]:
# Count the number of lines of labels in the processed train and test dataset. 
print("Number of lines in train_labels:", len(train_labels))
print("Number of lines in test_labels:", len(test_labels))

No. of lines in train_labels: 2599
No. of lines in test_labels: 1056


## Concept Identification
After preprocessing, we will first explore what are the various concepts present in the dataset. For this task, we will use PoS tagging. It is good to identify all the words from the corpus that have a tag of NOUN or PROPN (nouns) and prepare a dictionary of their counts. We will then output the top 25 most frequently discussed concepts in the entire corpus.

In [18]:
# Extract those tokens which have NOUN or PROPN as their PoS tag and find their frequency
detectCustomPOSTags = {}
for sentences in (train_sentences, test_sentences):
  for sentence in sentences:
    processed_sentence = nlp_model(sentence) # Process each sentence by spacy model
    for token in processed_sentence:
      if(token.pos_ == 'NOUN' or token.pos_ == 'PROPN'): # Check if the token is a noun or proper noun
        detectCustomPOSTags[token.text] = detectCustomPOSTags.get(token.text, 0) + 1; #increase its frequency if it is noun
        

In [19]:
# Print the top 25 most common tokens with NOUN or PROPN PoS tags
cnpt_counter = Counter(detectCustomPOSTags)
cnpt_counter.most_common(25)

[('patients', 492),
 ('treatment', 281),
 ('%', 247),
 ('cancer', 200),
 ('therapy', 175),
 ('study', 154),
 ('disease', 142),
 ('cell', 140),
 ('lung', 116),
 ('group', 94),
 ('gene', 88),
 ('chemotherapy', 88),
 ('effects', 85),
 ('results', 78),
 ('women', 77),
 ('use', 75),
 ('risk', 71),
 ('cases', 71),
 ('surgery', 71),
 ('analysis', 70),
 ('rate', 67),
 ('response', 66),
 ('survival', 65),
 ('children', 64),
 ('effect', 63)]

## Defining features for CRF
- Define the features with the PoS tag as one of the features.
- While defining the features in which you have used the PoS tags, you also need to consider the preceding word of the current word. The use of the information of the preceding word makes the CRF model more accurate and exhaustive.
- Mark the beginning and the end words of a sentence correctly in the form of features.

In [20]:
# Let's define the features to get the feature value for one word.
def getFeaturesForOneWord(sentence, pos, pos_tags):
  word = sentence[pos]

  # Define 12 features with PoS tag as one of the features
  features = [
    'word.lower=' + word.lower(), # serves as word id
    'word[-3:]=' + word[-3:],     # last three characters
    'word[0:]=' + word[0:],     # first character
    'word[-1:]=' + word[-1:],     # last character
    'word[-2:]=' + word[-2:],     # last two characters
    'word.isupper=%s' % word.isupper(),  # is the word in all uppercase
    'word.isdigit=%s' % word.isdigit(),  # is the word a number
    'word.startsWithCapital=%s' % word[0].isupper(), # is the word starting with a capital letter
    'word.pos=' + pos_tags[pos]
  ]

  # Use the previous word also while defining features
  if(pos > 0):
    prev_word = sentence[pos-1]
    features.extend([
    'prev_word.lower=' + prev_word.lower(), 
    'prev_word.isupper=%s' % prev_word.isupper(),
    'prev_word.isdigit=%s' % prev_word.isdigit(),
    'prev_word.startsWithCapital=%s' % prev_word[0].isupper(),
    'prev_word.pos=' + pos_tags[pos-1]
   
  ])
    
  # Mark the begining and the end words of a sentence correctly in the form of features.
  else:
    features.append('BEG') # feature to track begin of sentence 

  if(pos == len(sentence)-1):
    features.append('END') # feature to track end of sentence
  return features

## Getting the features and the labels of sentences:
- Write the code to get the features value of a sentence after defining the features in the previous step.
- Write the code to get a list of labels of a given preprocessed label line that you have created earlier.

In [24]:
# Write a code to get features for a sentence.
# Define a function to get features for a sentence using the 'getFeaturesForOneWord' function.
def getFeaturesForOneSentence(sentence):
  pos_processed_sentence = nlp_model(sentence) #spacy is applied to sentence
  pos_tags = [] #correctly identify pos tags
  for token in pos_processed_sentence:
    pos_tags.append(token.pos_)

  sentence_list = sentence.split() # List of words in sentence
  
  # Correctly calling getFeaturesForOneWord defined above
  return [getFeaturesForOneWord(sentence_list, pos, pos_tags) for pos in range(len(sentence_list))]

In [25]:
# Write a code to get the labels for a sentence.
# Define a function to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
  return labels.split()

## Defining input and target variables
- Extract the features values for each sentence as an input variable for the CRF model in the test and the train dataset.
- Extract the labels as the target variable for the test and the train dataset.

In [26]:
# Define the features values for each sentence as input variable  for CRF model in test and the train dataset 
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

In [27]:
print(X_train[0])

[['word.lower=all', 'word[-3:]=All', 'word[0:]=All', 'word[-1:]=l', 'word[-2:]=ll', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=True', 'word.pos=DET', 'BEG'], ['word.lower=live', 'word[-3:]=ive', 'word[0:]=live', 'word[-1:]=e', 'word[-2:]=ve', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=ADJ', 'prev_word.lower=all', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=True', 'prev_word.pos=DET'], ['word.lower=births', 'word[-3:]=ths', 'word[0:]=births', 'word[-1:]=s', 'word[-2:]=hs', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=NOUN', 'prev_word.lower=live', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=False', 'prev_word.pos=ADJ'], ['word.lower=>', 'word[-3:]=>', 'word[0:]=>', 'word[-1:]=>', 'word[-2:]=>', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=PUNCT', 'prev_word.lower

In [28]:
print(X_test[0])

[['word.lower=furthermore', 'word[-3:]=ore', 'word[0:]=Furthermore', 'word[-1:]=e', 'word[-2:]=re', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=True', 'word.pos=ADV', 'BEG'], ['word.lower=,', 'word[-3:]=,', 'word[0:]=,', 'word[-1:]=,', 'word[-2:]=,', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=PUNCT', 'prev_word.lower=furthermore', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=True', 'prev_word.pos=ADV'], ['word.lower=when', 'word[-3:]=hen', 'word[0:]=when', 'word[-1:]=n', 'word[-2:]=en', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=ADV', 'prev_word.lower=,', 'prev_word.isupper=False', 'prev_word.isdigit=False', 'prev_word.startsWithCapital=False', 'prev_word.pos=PUNCT'], ['word.lower=all', 'word[-3:]=all', 'word[0:]=all', 'word[-1:]=l', 'word[-2:]=ll', 'word.isupper=False', 'word.isdigit=False', 'word.startsWithCapital=False', 'word.pos=DET', 

In [29]:
# Define the labels as the target variable for test and the train dataset
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

In [30]:
print(Y_train[0])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [31]:
print(Y_test[0])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


## Building the model
You need to build the CRF model for a custom NER application using the features and the target variables.

In [32]:
# Build the model
crf_model = sklearn_crfsuite.CRF(max_iterations=100)
try:
    crf_model.fit(X_train, Y_train)
except AttributeError:
    pass

## Evaluation
- Predict the labels of each of the tokens in each sentence of the test dataset that has been preprocessed earlier.
- Calculate the f1 score using the actual and the predicted labels of the test dataset.

In [34]:
# Predict the labels of each of the tokens in each sentence of the test dataset that has been preprocessed earlier.
Y_pred = crf_model.predict(X_test)
pred_label=[]
for i in Y_pred:
    pred_label.extend(i)

In [35]:
# Calculate the f1 score using the actual labels and the predicted labels of the test dataset.
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')

0.9104780783739593

- We got 91% as f1_score by using the actual labels and the predicted labels of the test dataset

## Identifying the diseases and treatment using a custom NER
Create the code or logic to get all the predicted treatments (T) labels corresponding to each disease (D) label in the test dataset. You can refer to the following image to get an idea on how to create a dictionary where diseases are working as keys and treatments are working as values.

In [36]:
# Write a logic to get the disease and treatments in dictionary format where disease is key and the treatment should be the values
diseases_and_treatments =  {} # dictionary with disease as key an list of treatments as value

for i in range(len(Y_pred)): # For each predicted sequence
  labels = Y_pred[i]

  disease = "";
  treatment = "";
  
  for j in range(len(labels)): # for each individual label in the sequence
    if labels[j] == 'O': # ignore if label is O -- other
      continue

    if(labels[j] == 'D'): # Label D indicates disease, so add the corresponding word from test sentence to the disease name string
      disease += test_sentences[i].split()[j] + " "
      continue

    if(labels[j] == 'T'): # Label T indicates treatment, so add the corresponding word from test sentence to the treatment name string
      treatment += test_sentences[i].split()[j] + " "

  disease = disease.strip() # to remove extraneous spaces
  treatment = treatment.strip()

  # add the identified disease and treatment to the dictionary
  # if it is a new disease, directly add the value
  # if the disease has been seen previously, get the treatment list
  # and add current treatment to the list.
  if disease != "" and treatment != "":
    if disease not in diseases_and_treatments.keys():
      diseases_and_treatments[disease] = [treatment]
    else:
      treatment_list = diseases_and_treatments.get(disease)
      treatment_list.append(treatment)
      diseases_and_treatments[disease] = treatment_list 

In [37]:
diseases_and_treatments

{'hereditary retinoblastoma': ['radiotherapy'],
 'unstable angina or non-Q-wave myocardial infarction': ['roxithromycin'],
 'coronary-artery disease': ['Antichlamydial antibiotics'],
 'primary pulmonary hypertension': ['fenfluramines'],
 'essential hypertension': ['moxonidine'],
 'foot infection': ['G-CSF treatment'],
 'hemorrhagic stroke': ['double-bolus alteplase'],
 "early Parkinson 's disease": ['Ropinirole monotherapy'],
 'abdominal tuberculosis': ['steroids'],
 'treating stress urinary incontinence': ['surgical procedures'],
 'female stress urinary incontinence': ['surgical treatment'],
 'stress urinary incontinence': ['therapy'],
 'preeclampsia ( proteinuric hypertension': ['intrauterine insemination with donor sperm versus intrauterine insemination'],
 'cancer': ['organ transplantation and chemotherapy',
  'oral drugs chemotherapy',
  'Matrix metalloproteinase inhibitors'],
 'major pulmonary embolism': ['Thrombolytic treatment',
  'Association between thrombolytic treatment'],


## Predict the treatment for the disease named 'hereditary retinoblastoma'

In [38]:
diseases_identified = list(diseases_and_treatments.keys())
index = 0 

print("Disease: ",diseases_identified[index])
print("Treatment:", diseases_and_treatments.get(diseases_identified[index]))

Disease:  hereditary retinoblastoma
Treatment: ['radiotherapy']


- Hereditary Retinoblastoma can be treated by Radiotherapy