## Identifying Entities in Healthcare Data


In [8]:
#Importing and installing required packages
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite 
from sklearn_crfsuite import metrics
import pandas as pd

model = spacy.load("en_core_web_sm")

import warnings
warnings.filterwarnings('ignore')




In [11]:
#Intializing the data
with open('C:\\Users\\moham\\Downloads\\inputs\\train_sent', 'r') as train_sent_file:
  train_words = train_sent_file.readlines()

with open('C:\\Users\\moham\\Downloads\\inputs\\train_label', 'r') as train_labels_file:
  train_labels_by_word = train_labels_file.readlines()

with open('C:\\Users\\moham\\Downloads\\inputs\\test_sent', 'r') as test_sent_file:
  test_words = test_sent_file.readlines()

with open('C:\\Users\\moham\\Downloads\\inputs\\test_label', 'r') as test_labels_file:
  test_labels_by_word = test_labels_file.readlines()

In [12]:
#Checking the inputs
train_words

['All\n',
 'live\n',
 'births\n',
 '>\n',
 'or\n',
 '=\n',
 '23\n',
 'weeks\n',
 'at\n',
 'the\n',
 'University\n',
 'of\n',
 'Vermont\n',
 'in\n',
 '1995\n',
 '(\n',
 'n\n',
 '=\n',
 '2395\n',
 ')\n',
 'were\n',
 'retrospectively\n',
 'analyzed\n',
 'for\n',
 'delivery\n',
 'route\n',
 ',\n',
 'indication\n',
 'for\n',
 'cesarean\n',
 ',\n',
 'gestational\n',
 'age\n',
 ',\n',
 'parity\n',
 ',\n',
 'and\n',
 'practice\n',
 'group\n',
 '(\n',
 'to\n',
 'reflect\n',
 'risk\n',
 'status\n',
 ')\n',
 '\n',
 'The\n',
 'total\n',
 'cesarean\n',
 'rate\n',
 'was\n',
 '14.4\n',
 '%\n',
 '(\n',
 '344\n',
 'of\n',
 '2395\n',
 ')\n',
 ',\n',
 'and\n',
 'the\n',
 'primary\n',
 'rate\n',
 'was\n',
 '11.4\n',
 '%\n',
 '(\n',
 '244\n',
 'of\n',
 '2144\n',
 ')\n',
 '\n',
 'Abnormal\n',
 'presentation\n',
 'was\n',
 'the\n',
 'most\n',
 'common\n',
 'indication\n',
 '(\n',
 '25.6\n',
 '%\n',
 ',\n',
 '88\n',
 'of\n',
 '344\n',
 ')\n',
 '\n',
 'The\n',
 '``\n',
 'corrected\n',
 "''\n",
 'cesarean\n',
 

It seems that the dataset provided is in the form of one word per line. 
We need to pre-process the data to recover the complete sentences and their labels.

In [13]:
#Checking to see that the number of tokens and number of corresponding labels are same.
print("Number of tokens in training set","\nNo. of words: ",len(train_words),"\nNo. of labels: ",len(train_labels_by_word))
print("\nNumber of tokens in test set","\nNo. of words: ",len(test_words),"\nNo. of labels: ",len(test_labels_by_word))

Number of tokens in training set 
No. of words:  48501 
No. of labels:  48501

Number of tokens in test set 
No. of words:  19674 
No. of labels:  19674


In [14]:
#Combining tokens belonging to the same sentence. Sentences are separated by "\n" in the dataset.
def convert_to_sentences(dataset):
    sent_list = []
    sent = ""
    for entity in dataset:
        if entity != '\n':
            sent = sent + entity[:-1] + " "       
        else: 
            sent_list.append(sent[:-1])           
            sent = ""
    return sent_list

In [15]:
#Converting tokens to sentences and individual labels to sequences of corresponding labels.
train_sentences = convert_to_sentences(train_words)
train_labels = convert_to_sentences(train_labels_by_word)
test_sentences = convert_to_sentences(test_words)
test_labels = convert_to_sentences(test_labels_by_word)

In [16]:
print("First five training sentences and their labels:\n")
for i in range(5):
    print(train_sentences[i],"\n",train_labels[i],"\n")

First five training sentences and their labels:

All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O 

Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 ) 
 O O O O O O O O O O O O O O O 

The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 a

In [17]:
print("First five test sentences and their labels:\n")
for i in range(5):
    print(test_sentences[i],"\n",test_labels[i],"\n")

First five test sentences and their labels:

Furthermore , when all deliveries were analyzed , regardless of risk status but limited to gestational age > or = 36 weeks , the rates did not change ( 12.6 % , 280 of 2214 ; primary 9.2 % , 183 of 1994 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

As the ambient temperature increases , there is an increase in insensible fluid loss and the potential for dehydration 
 O O O O O O O O O O O O O O O O O O O 

The daily high temperature ranged from 71 to 104 degrees F and AFI values ranged from 1.7 to 24.7 cm during the study period 
 O O O O O O O O O O O O O O O O O O O O O O O O 

There was a significant correlation between the 2- , 3- , and 4-day mean temperature and AFI , with the 4-day mean being the most significant ( r = 0.31 , p & # 60 ; 0.001 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

Fluctuations in ambient temperature are inversely correlated to ch

In [18]:
# number of sentences in the processed train and test dataset
print("Number of sentences in the train dataset: {}".format(len(train_sentences)))
print("Number of sentences in the test dataset: {}".format(len(test_sentences)))

Number of sentences in the train dataset: 2599
Number of sentences in the test dataset: 1056


In [19]:
# number of lines of labels in the processed train and test dataset
print("Number of lines of labels in the train dataset: {}".format(len(train_labels)))
print("Number of lines of labels in the test dataset: {}".format(len(test_labels)))

Number of lines of labels in the train dataset: 2599
Number of lines of labels in the test dataset: 1056


#### EDA

First will explore what are the various concepts present in the dataset. For this, we will use PoS Tagging.