# **Problem Statement**

> "BeHealthy" is a health-tech company and has a web platform that allows doctors to list their services and manage patient interactions and provides services for patients such as booking interactions with doctors and ordering medicines online.
Here, doctors can easily organise appointments, track past medical records and provide e-prescriptions.
So, companies like "BeHealthy" are providing medical services, prescriptions and online consultations and generating huge data day by day.
 > The following snippet of medical data may be generated when a doctor is writing notes to his/her patient or as a review of a therapy that he or she has done.
"The patient was a 62-year-old man with squamous cell lung cancer, which was first successfully treated by a combination of radiation therapy and chemotherapy."
 > As we can see in this text, a person with a non-medical background cannot understand the various medical terms.
We have taken a simple sentence from a medical data set to understand the problem and where one can understand the terms 'cancer' and 'chemotherapy'.
 > We have been given such a data set in which a lot of text is written related to the medical domain.
There are a lot of diseases that can be mentioned in the entire dataset and their related treatments are also mentioned implicitly in the text, which we saw in the aforementioned example that the disease mentioned is cancer and its treatment can be identified as chemotherapy using the sentence.
However, note that it is not explicitly mentioned in the dataset about the diseases and their treatment, but we can build an algorithm to map the diseases and their respective treatment.
 > The train dataset is used to train the Continuous Random Field (CRF) model, and the test dataset is used to evaluate the built model.

# **Setting up Google Collab**

In [1]:
# Setting up Google Colab for usage. Please disable if running locally.
import pathlib
import os


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
base_dir=pathlib.Path('/content/drive/MyDrive/assignment')
os.chdir(str(base_dir))

In [5]:
!ls

test_label  test_sent  train_label  train_sent


In [6]:
# Installing and importing relevant libraries
!pip install pycrf
!pip install sklearn-crfsuite

import spacy
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import pandas as pd

model = spacy.load("en_core_web_sm")

Collecting pycrf
  Downloading pycrf-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pycrf
  Building wheel for pycrf (setup.py) ... [?25l[?25hdone
  Created wheel for pycrf: filename=pycrf-0.0.1-py3-none-any.whl size=1871 sha256=77991ee0d083d09acc929f1359b093c5a78302cd787546a0e929b8f02c14360a
  Stored in directory: /root/.cache/pip/wheels/fd/3a/fb/e4d15c9c2b169f43811b23a863ee9717ff3eda5d2301789043
Successfully built pycrf
Installing collected packages: pycrf
Successfully installed pycrf-0.0.1
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn-crfsuite)
  Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.wh

# **Data Pre-Processing**

In [7]:
# Reading the train and test sentences and labels
with open('train_sent', 'r') as train_sent_file:
  train_words = train_sent_file.readlines()

with open('train_label','r') as train_labels_file:
  train_labels_by_word = train_labels_file.readlines()

with open('test_sent', 'r') as test_sent_file:
  test_words = test_sent_file.readlines()

with open('test_label', 'r') as test_labels_file:
  test_labels_by_word = test_labels_file.readlines()

In [8]:
# Sanity check to see that the number of tokens and no. of corresponding labels match.
print("Count of tokens in training set\n","No. of words: ",len(train_words),"\nNo. of labels: ",len(train_labels_by_word))
print("\n\nCount of tokens in test set\n","No. of words: ",len(test_words),"\nNo. of labels: ",len(test_labels_by_word))

Count of tokens in training set
 No. of words:  48501 
No. of labels:  48501


Count of tokens in test set
 No. of words:  19674 
No. of labels:  19674


In [9]:
# Function to combine tokens belonging to the same sentence. Sentences are separated by "\n" in the dataset.
def convert_to_sentences(dataset):
    sent_list = []
    sent = ""
    for entity in dataset:
        if entity != '\n':
            sent = sent + entity[:-1] + " "       # Adding word/label to current sentence / sequence of labels
        else:
            sent_list.append(sent[:-1])           # Getting rid of the space added after the last entity.
            sent = ""
    return sent_list

In [10]:
# Converting tokens to sentences and individual labels to sequences of corresponding labels.
train_sentences = convert_to_sentences(train_words)
train_labels = convert_to_sentences(train_labels_by_word)
test_sentences = convert_to_sentences(test_words)
test_labels = convert_to_sentences(test_labels_by_word)

In [11]:
print("First five training sentences and their labels:\n")
for i in range(5):
    print(train_sentences[i],"\n",train_labels[i],"\n")

First five training sentences and their labels:

All live births > or = 23 weeks at the University of Vermont in 1995 ( n = 2395 ) were retrospectively analyzed for delivery route , indication for cesarean , gestational age , parity , and practice group ( to reflect risk status ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O 

Abnormal presentation was the most common indication ( 25.6 % , 88 of 344 ) 
 O O O O O O O O O O O O O O O 

The `` corrected '' cesarean rate ( maternal-fetal medicine and transported patients excluded ) was 12.4 % ( 273 of 2194 ) , and the `` corrected '' primary rate was 9.6 % ( 190 of 1975 ) 
 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 

Arrest of dilation was the most common indication in both `` corrected '' subgroups ( 23.4 a

In [12]:
#Count the number of sentences in the processed train and test dataset

print("No. of sentences in training set: ",len(train_sentences))
print("No. of sentences in test set: ",len(test_sentences))

No. of sentences in training set:  2599
No. of sentences in test set:  1056


In [13]:
#Count the number of lines of labels in the processed train and test dataset.

print("No. of lines of labels in training set: ",len(train_labels))
print("No. of lines of labels in test set: ",len(test_labels))

No. of lines of labels in training set:  2599
No. of lines of labels in test set:  1056


# **Exploratory Data Analysis**

Extract tokens which have NOUN or PROPN as their PoS tag and find their frequency

In [14]:
# Creating a combined dataset from training and test sentences
combined_sentences = train_sentences + test_sentences
print("No. of sentences in combined dataset: ",len(combined_sentences))

No. of sentences in combined dataset:  3655


In [15]:
# Creating a list of tokens which have PoS tag of 'NOUN' or 'PROPN'
combined_tokens = []
for sentence in combined_sentences:
    doc = model(sentence)
    for token in doc:
        if token.pos_ == 'NOUN' or token.pos_ == 'PROPN':
            combined_tokens.append(token.text)
print("No. of tokens in combined dataset: ",len(combined_tokens))

No. of tokens in combined dataset:  24373


Top 20 most common tokens with NOUN or PROPN PoS tags

In [16]:
# Import Counter from the collections module
from collections import Counter

token_counts = Counter(combined_tokens)

# Get the top 20 most common tokens
top_20_tokens = token_counts.most_common(20)

# Print the top 20 tokens
print("Top 20 most common tokens (NOUN/PROPN):")
for token, count in top_20_tokens:
    print(f"{token}: {count}")

Top 20 most common tokens (NOUN/PROPN):
patients: 492
treatment: 281
%: 247
cancer: 200
therapy: 175
study: 154
disease: 142
cell: 140
lung: 116
group: 94
chemotherapy: 88
gene: 87
effects: 85
results: 79
women: 77
use: 74
TO_SEE: 74
risk: 71
cases: 71
surgery: 71


In [17]:
# Analysis of PoS tags - Independent assignment for words vs Contextual assignment in a sentence.
sentence = train_sentences[1]
sent_list = sentence.split()      # Splitting the sentence into its constituent words.
position = 2                      # Choosing position of word within sentence. Index starts at 0.

word = sent_list[position]        # Extracting word for PoS tag analysis.

print(sentence)

The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )


In [18]:
# Analysis of PoS tags: Independent vs Contextual assignment
sentence = train_sentences[1]  # Select a sentence from training data
position = 2  # Index of the word in the sentence (starting from 0)

# Extract the word at the specified position
word = sentence.split()[position]

# Display the original sentence
print("Sentence:", sentence)

# Independent PoS tag assignment (word analyzed in isolation)
independent_pos = model(word)[0].pos_
print(f"\nIndependent PoS tag for '{word}': {independent_pos}")

# Contextual PoS tag assignment (analyzed within sentence)
print("\nContextual PoS tags for all words in the sentence:")
for token in model(sentence):
    print(f"{token.text}: {token.pos_}")

# Contextual PoS tag for the word at the specified position
contextual_pos = None
for idx, token in enumerate(model(sentence)):
    if idx == position:
        contextual_pos = token.pos_
        break

print(f"\nContextual PoS tag for '{word}' at position {position}: {contextual_pos}")

Sentence: The total cesarean rate was 14.4 % ( 344 of 2395 ) , and the primary rate was 11.4 % ( 244 of 2144 )

Independent PoS tag for 'cesarean': VERB

Contextual PoS tags for all words in the sentence:
The: DET
total: ADJ
cesarean: ADJ
rate: NOUN
was: AUX
14.4: NUM
%: NOUN
(: PUNCT
344: NUM
of: ADP
2395: NUM
): PUNCT
,: PUNCT
and: CCONJ
the: DET
primary: ADJ
rate: NOUN
was: AUX
11.4: NUM
%: NOUN
(: PUNCT
244: NUM
of: ADP
2144: NUM
): PUNCT

Contextual PoS tag for 'cesarean' at position 2: ADJ


As we can see in the analysis above, the PoS tag of the word "cesarean" is not captured correctly if the word is considered individually. However, if the word is considered as a part of the sentence, then it is captured correctly. Defining a function below to execute this.

In [19]:
# Function to obtain contextual PoS tagger.
def contextual_pos_tagger(sent_list,position):
    '''Obtaining PoS tag for individual word with sentence context in-tact.
       If the PoS tag is obtained for a word individually, it may not capture the context of use in the sentence and may assign the incorrect PoS tag.'''

    sentence = " ".join(sent_list)          # Sentence needs to be in string format to process it with spacy model. List of words won't work.
    posit = 0                               # Initialising variable to record position of word in joined sentence to compare with the position of the word under considertion.
    for token in model(sentence):
        postag = token.pos_
        if (token.text == word) and (posit == position):
            break
        posit += 1
    return postag

In [20]:
# Define the features for a specific word in a sentence
def get_features_for_one_word(sent_list, position):
    word = sent_list[position]

    # Features for the current word
    features = {
        'word.lower': word.lower(),                          # Word in lowercase
        'word.postag': contextual_pos_tagger(sent_list, position),  # PoS tag of the current word
        'word[-3:]': word[-3:],                              # Last three characters
        'word[-2:]': word[-2:],                              # Last two characters
        'word.isupper': word.isupper(),                      # Is the word in all uppercase?
        'word.isdigit': word.isdigit(),                      # Is the word a number?
        'word.starts_with_capital': word[0].isupper()        # Does the word start with a capital letter?
    }

    # Features for the previous word
    if position > 0:
        prev_word = sent_list[position - 1]
        features.update({
            'prev_word.lower': prev_word.lower(),
            'prev_word.postag': contextual_pos_tagger(sent_list, position - 1),
            'prev_word.isupper': prev_word.isupper(),
            'prev_word.isdigit': prev_word.isdigit(),
            'prev_word.starts_with_capital': prev_word[0].isupper()
        })
    else:
        features['is_begin'] = True  # Indicates the beginning of a sentence

    # Features for the end of the sentence
    if position == len(sent_list) - 1:
        features['is_end'] = True  # Indicates the end of a sentence

    return features

# **Extract Features**

In [21]:
# Write a code to get features for a sentence.
def getFeaturesForOneSentence(sentence):
  sentence_list = sentence.split()
  return [get_features_for_one_word(sentence_list, position) for position in range(len(sentence_list))]

In [22]:
# Checking feature extraction
example_sentence = train_sentences[5]
print(example_sentence)

features = getFeaturesForOneSentence(example_sentence)
features[0]

Cesarean rates at tertiary care hospitals should be compared with rates at community hospitals only after correcting for dissimilar patient groups or gestational age


{'word.lower': 'cesarean',
 'word.postag': 'NOUN',
 'word[-3:]': 'ean',
 'word[-2:]': 'an',
 'word.isupper': False,
 'word.isdigit': False,
 'word.starts_with_capital': True,
 'is_begin': True}

In [23]:
features[4]

{'word.lower': 'care',
 'word.postag': 'NOUN',
 'word[-3:]': 'are',
 'word[-2:]': 're',
 'word.isupper': False,
 'word.isdigit': False,
 'word.starts_with_capital': False,
 'prev_word.lower': 'tertiary',
 'prev_word.postag': 'NOUN',
 'prev_word.isupper': False,
 'prev_word.isdigit': False,
 'prev_word.starts_with_capital': False}

In [24]:
# Write a code to get the labels for a sentence.
def getLabelsInListForOneSentence(labels):
  return labels.split()

In [25]:
# Checking label extraction
example_labels = getLabelsInListForOneSentence(train_labels[5])
print(example_labels)

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


# **Building the Model**

In [26]:
X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]

In [27]:
Y_train = [getLabelsInListForOneSentence(labels) for labels in train_labels]
Y_test = [getLabelsInListForOneSentence(labels) for labels in test_labels]

In [29]:
# Building the CRF model. Using max_iterations as 200.
crf = sklearn_crfsuite.CRF(max_iterations=300)

crf.fit(X_train, Y_train)

# **Evaluation**

In [30]:
Y_pred = crf.predict(X_test)

In [31]:
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')

0.9081250616038787

In [32]:
# Example test sentence and corresponding actual and predicted labels
print("Sentence: ",test_sentences[13])
print("Actual labels:    ", Y_test[13])
print("Predicted labels: ", Y_pred[13])

Sentence:  The objective of this study was to determine if the rate of preeclampsia is increased in triplet as compared to twin gestations
Actual labels:     ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'D', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Predicted labels:  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'D', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [33]:
# Feature list of sentence above
print(X_test[13])

[{'word.lower': 'the', 'word.postag': 'NOUN', 'word[-3:]': 'The', 'word[-2:]': 'he', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': True, 'is_begin': True}, {'word.lower': 'objective', 'word.postag': 'NOUN', 'word[-3:]': 'ive', 'word[-2:]': 've', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': False, 'prev_word.lower': 'the', 'prev_word.postag': 'NOUN', 'prev_word.isupper': False, 'prev_word.isdigit': False, 'prev_word.starts_with_capital': True}, {'word.lower': 'of', 'word.postag': 'NOUN', 'word[-3:]': 'of', 'word[-2:]': 'of', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': False, 'prev_word.lower': 'objective', 'prev_word.postag': 'NOUN', 'prev_word.isupper': False, 'prev_word.isdigit': False, 'prev_word.starts_with_capital': False}, {'word.lower': 'this', 'word.postag': 'NOUN', 'word[-3:]': 'his', 'word[-2:]': 'is', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': False, 'prev_wor

In [39]:
print(type(X_test))
print(X_test[0])  # Print the first element to check its structure
print(X_test[0][0])  # Drill down further to inspect the nested structure


<class 'list'>
[{'word.lower': 'furthermore', 'word.postag': 'PUNCT', 'word[-3:]': 'ore', 'word[-2:]': 're', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': True, 'is_begin': True}, {'word.lower': ',', 'word.postag': 'PUNCT', 'word[-3:]': ',', 'word[-2:]': ',', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': False, 'prev_word.lower': 'furthermore', 'prev_word.postag': 'PUNCT', 'prev_word.isupper': False, 'prev_word.isdigit': False, 'prev_word.starts_with_capital': True}, {'word.lower': 'when', 'word.postag': 'PUNCT', 'word[-3:]': 'hen', 'word[-2:]': 'en', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capital': False, 'prev_word.lower': ',', 'prev_word.postag': 'PUNCT', 'prev_word.isupper': False, 'prev_word.isdigit': False, 'prev_word.starts_with_capital': False}, {'word.lower': 'all', 'word.postag': 'PUNCT', 'word[-3:]': 'all', 'word[-2:]': 'll', 'word.isupper': False, 'word.isdigit': False, 'word.starts_with_capita

# **Identifying Diseases and Treatments using Custom NER**

 We now use the CRF model's prediction to prepare a record of diseases identified in the corpus and treatments used for the diseases.

In [42]:
# Extracting a dictionary of all the predicted diseases from our test data and the corresponding treatments.
disease_treatment = {}  # Initializing an empty dictionary

for i in range(len(Y_pred)):
    diseases = []  # List of diseases in the current sentence
    treatments = []  # List of treatments in the current sentence
    current_disease = ""  # Temporary storage for multi-word disease
    current_treatment = ""  # Temporary storage for multi-word treatment
    length = len(Y_pred[i])  # Length of current sentence

    for j in range(length):
        # Access the 'word.lower' key to get the actual word string
        word = X_test[i][j]['word.lower']  # Assuming 'word.lower' key holds the word

        # Check for disease labels ('D')
        if Y_pred[i][j] == 'D':
            current_disease += word + " "
            if j == length - 1 or Y_pred[i][j + 1] != 'D':  # End of disease name
                diseases.append(current_disease.strip())
                current_disease = ""

        # Check for treatment labels ('T')
        if Y_pred[i][j] == 'T':
            current_treatment += word + " "
            if j == length - 1 or Y_pred[i][j + 1] != 'T':  # End of treatment name
                treatments.append(current_treatment.strip())
                current_treatment = ""

    # Update the disease_treatment dictionary
    for disease in diseases:
        if disease in disease_treatment:
            disease_treatment[disease].extend(treatments)
        else:
            disease_treatment[disease] = treatments[:]

# The `disease_treatment` dictionary now contains diseases as keys and their treatments as values.

In [43]:
# Displaying dictionary of extracted diseases and potential treatments.
disease_treatment

{'gestational diabetes cases': [],
 'preeclampsia': [],
 'severe preeclampsia': [],
 'asymmetric double hemiplegia': [],
 'reversible nonimmune hydrops fetalis': [],
 'breast and/or ovarian cancer': [],
 'breast cancer': ['hormone replacement therapy',
  'undergone subcutaneous mastectomy'],
 'ovarian cancer': [],
 'prostate cancer': ['radical prostatectomy and iodine 125 interstitial radiotherapy'],
 'mutated prostate cancer': [],
 'hereditary prostate cancer': [],
 'multiple sclerosis ( ms )': [],
 'hereditary retinoblastoma': ['radiotherapy'],
 'epilepsy': [],
 'unstable angina or non-q-wave myocardial infarction': ['roxithromycin'],
 'coronary-artery disease': ['antichlamydial antibiotics'],
 'early-stage cervical carcinoma': [],
 'advanced disease': [],
 'cerebral palsy': ['hyperbaric oxygen therapy'],
 'severe pain': [],
 'myofascial trigger point pain': [],
 'infections': [],
 'primary pulmonary hypertension ( pph )': ['fenfluramines'],
 'essential hypertension': [],
 'osteoporo

In [44]:
# Obtaining a cleaned version of our "disease_treatment" dictionary
cleaned_dict = {}
for key in disease_treatment.keys():
    if disease_treatment[key] != []:
        cleaned_dict[key] = disease_treatment[key]
cleaned_dict

{'breast cancer': ['hormone replacement therapy',
  'undergone subcutaneous mastectomy'],
 'prostate cancer': ['radical prostatectomy and iodine 125 interstitial radiotherapy'],
 'hereditary retinoblastoma': ['radiotherapy'],
 'unstable angina or non-q-wave myocardial infarction': ['roxithromycin'],
 'coronary-artery disease': ['antichlamydial antibiotics'],
 'cerebral palsy': ['hyperbaric oxygen therapy'],
 'primary pulmonary hypertension ( pph )': ['fenfluramines'],
 'cellulitis': ['g-csf therapy', 'intravenous antibiotic treatment'],
 'foot infection': ['g-csf treatment'],
 "early parkinson 's disease": ['ropinirole monotherapy'],
 'sore throat': ['antibiotics'],
 'female stress urinary incontinence': ['surgical treatment'],
 'stress urinary incontinence': ['therapy'],
 'preeclampsia ( proteinuric hypertension )': ['intrauterine insemination with donor sperm versus intrauterine insemination'],
 'severe acquired hyperammonemia': ['organ transplantation and chemotherapy'],
 'cancer': 

In [45]:
# Converting dictionary to a dataframe
cleaned_df = pd.DataFrame({"Disease":cleaned_dict.keys(),"Treatments":cleaned_dict.values()})
cleaned_df.head()

Unnamed: 0,Disease,Treatments
0,breast cancer,"[hormone replacement therapy, undergone subcut..."
1,prostate cancer,[radical prostatectomy and iodine 125 intersti...
2,hereditary retinoblastoma,[radiotherapy]
3,unstable angina or non-q-wave myocardial infar...,[roxithromycin]
4,coronary-artery disease,[antichlamydial antibiotics]


In [46]:
search_item = 'hereditary retinoblastoma'
treatments = cleaned_dict[search_item]
print("Treatments for '{0}' is/are ".format(search_item), end = "")
for i in range(len(treatments)-1):
    print("'{}'".format(treatments[i]),",", end="")
print("'{}'".format(treatments[-1]))

Treatments for 'hereditary retinoblastoma' is/are 'radiotherapy'
