In [1]:
"""
Property Of Tausight - September 2023
"""
print(__doc__)


Property Of Tausight - September 2023



# Problem Definition

#### This internship project aims to develop a robust Medical Entity Detection and Tracking model. 

In this interview assessment, we assume that you have already established a Named Entity Recognition (NER) model capable of extracting Personally Identifiable Information (PII) from medical documents, and its performance is considered acceptable. Your primary task is to enhance the capabilities of this model, enabling it to accurately associate the extracted PIIs with patients while distinguishing them from other PIIs (such as those related to doctors, nurses, etc.) and that whenever you detect and map PIIs to a patient, you add these mappings to a database. This database allows you to query patient information when needed. For this excersie, we recommend you to create your own database using a proper dat astructure and add your patients to it.



### <font color='red'> To achieve this, you must tackle several critical challenges: </font>


__Complex Patient Information:__ Patient PIIs extend beyond just a name and may include elements like Date of Birth (DOB), Medical Record Number (MRN), phone numbers, and addresses.

__Repetitive PIIs:__ Patient PIIs can occur multiple times within a single document. For instance, a patient's name and MRN may appear in various sections of the document.

__Dispersed Patient PIIs:__ Patient PIIs may not always be grouped together within the document; instead, they may be scattered throughout the text.

__Recurrence of Healthcare Provider Names:__ The names of doctors and clinicians may also recur within the document, potentially complicating the accurate identification of patient PIIs.

__Mutliple Patient PIIs:__ A document might contain multiple patients information at once.

__Vague Patient Identity:__ A document might contains some of the patient PIIs and not all of them. 
    
    Ex: You might find:
    - Name = George Micheal, Phone = 61754584587
    - Name = george Micheal, MRN = 4546546587
    - Name = george Micheal, DOB = 01/01/1990
    In this scenario, we are not sure if George Micheal is the same person within these two documetns. Its better to keep track of both.


You are provided with a CSV file containing eight synthetic medical records, each with detected PIIs. Your primary objective is to develop a model that, given a set of detected PIIs, can precisely identify and associate the relevant information with the patient. It's crucial to note that your model will solely have access to the list of detected PIIs and not the actual text of the document. Consequently, you must devise a strategy to handle various scenarios where patient PIIs may be incomplete or distributed across different document sections.

__Expected Model's Input:__ [OTHER, NAME, NAME, OTHER, ID, ......]

### Database Example:

Your model detected:

patient's name: Babak and DOB: 10/10/2021, but MRN is not detected. 

However, you have previously identified a patient with the name Babak, DOB: 10/10/2021, and MRN: 12457458 and added to the database.

How can you leverage this prior knowledge to enhance your detection process?

## Edge Cases to consider

Your model should be capable of addressing scenarios such as:

* "Wallace, Sarthak MRN: 546584 DOB: 01/01/1980 visited the hospital, ....."
* "Sarthak DOB: 01/01/1980 visited the hospital, during his stay at the clinic he did receive his flu shot. MRN: 546584 and ...."
* "Wallace, DOB: 01/01/1980 visited the hospital, ....."
"Wallace, Sarthak visited the hospital, AM HOW OFTEN TO TAKE every day HOW TO TAKE INSTRUCTIONS mouth Start taMARAKI medication: MRN: 546584 On Discharge OMEPRAZOLE as: PRILOSEC 40 mg Last dose given 11/08/2010 08:03 AM two times a day mouth"

Your solution should comprehensively address these complexities, providing accurate patient PII mapping in diverse medical document scenarios.

### Hint


We would recommend you to create new examples using the provided data to ensure of functionality of your code. Example:

    You code create new data from: text = "Discharge Summary FINAL AHMADI, JOSEPH MRN: 22954568 DOB: 1956-01-08 60F Admission: 12/30/2015"
        - new text = "Discharge Summary FINAL AHMADI, JOSEPH Phone: 61754854587 DOB: 1956-01-08 60F Admission: 12/30/2015"
        - new text = "Discharge Summary FINAL AHMADI, JOSEPH Phone: 61754854587"
        
    Be creative

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv('sample_data.csv')
data.head()

Unnamed: 0,Text,PIIs
0,"Discharge Summary FINAL AHMADI, JOSEPH MRN: 22...","['OTHER', 'OTHER', 'OTHER', 'NAME', 'NAME', 'O..."
1,DAVID KING M MRN: 15424576 DFCI MRN: 564756 DO...,"['NAME', 'NAME', 'OTHER', 'OTHER', 'ID', 'OTHE..."
2,PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...,"['OTHER', 'OTHER', 'OTHER', 'OTHER', 'OTHER', ..."
3,PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...,"['OTHER', 'OTHER', 'OTHER', 'OTHER', 'OTHER', ..."
4,"Discharge Summary FINAL AHMADI, DOB: 1956-01-...","['OTHER', 'OTHER', 'OTHER', 'NAME', 'OTHER', '..."


In [4]:
import warnings 
warnings.filterwarnings("ignore")

In [5]:
for i in range(8):
    p_row = data.iloc[i]
    print(len(p_row['PIIs']))

8921
7057
2310
5057
8821
6989
2310
5057


In [6]:
# Here I build a dictionary in order to observe how many different PIIs there are
PIIs_labels = {}
replace_id = 1
for i in range(8):
    test_row = data.iloc[i]
    test_PIIs = test_row['PIIs']
    test_PIIs = test_PIIs.replace("[", "").replace("]", "").replace(" ", "").replace("'", "")
    test_PIIs = test_PIIs.split(',')
    for label in test_PIIs:
        if label not in PIIs_labels:
            PIIs_labels[label] = replace_id 
            replace_id  += 1
print(PIIs_labels)

{'OTHER': 1, 'NAME': 2, 'ID': 3, 'DOB': 4, 'PHONE': 5, 'ADDRESS': 6, 'AGE': 7, 'DATE': 8}


In [7]:
# Data preprocessing for PIIs
# My idea here is to convert the tags in PIIs into strs in the same format as Text, separated by spaces.
# Because I used fast tokenizers later, I couldn't input list[int] format or list[str], so I handled it like this.
# Initial thought: Use the above dictionary to convert all PIIs into numerical expressions and continue in List format.
# Due to the limitations, changes were made.
new_data = data
for i in range(8):
    c_row = data.iloc[i]
    c_PIIs = c_row['PIIs']
    c_PIIs = c_PIIs.replace("[", "").replace("]", "").replace(" ", "").replace("'", "").replace(",", " ")
    new_data.iloc[i]['PIIs'] = c_PIIs

In [8]:
print(new_data)

                                                Text  \
0  Discharge Summary FINAL AHMADI, JOSEPH MRN: 22...   
1  DAVID KING M MRN: 15424576 DFCI MRN: 564756 DO...   
2  PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...   
3  PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...   
4  Discharge Summary FINAL AHMADI,  DOB: 1956-01-...   
5  DAVID M DFCI MRN: 564756 DOB: 11/12/1964 64F A...   
6  PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...   
7  PATIENT MEDICATION LIST ON DISCHARGE FROM Mass...   

                                                PIIs  
0  OTHER OTHER OTHER NAME NAME OTHER ID OTHER DOB...  
1  NAME NAME OTHER OTHER ID OTHER OTHER ID OTHER ...  
2  OTHER OTHER OTHER OTHER OTHER OTHER ADDRESS OT...  
3  OTHER OTHER OTHER OTHER OTHER OTHER OTHER OTHE...  
4  OTHER OTHER OTHER NAME OTHER DOB OTHER OTHER O...  
5  NAME OTHER OTHER OTHER ID OTHER DOB OTHER OTHE...  
6  OTHER OTHER OTHER OTHER OTHER OTHER OTHER OTHE...  
7  OTHER OTHER OTHER OTHER OTHER OTHER OTHER OTHE...  


In [9]:
# The main purpose here is to encode the content in Text and PIIs. 
# Because data needs to be provided during the model training process.

import torch
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
Text_input_ids = []
Text_attention_masks = []
PIIs_input_ids = []
PIIs_attention_masks = []

for i in range(8):
    Text_tokens = new_data.iloc[i]['Text']
    PIIs_tokens = new_data.iloc[i]['PIIs']
    
    Text_encode = tokenizer.encode_plus(Text_tokens, return_offsets_mapping=True, truncation=True, padding='max_length', max_length=64)
    PIIs_encode = tokenizer.encode_plus(PIIs_tokens, return_offsets_mapping=True, truncation=True, padding='max_length', max_length=64)
    # output of encode_plus: 'input_ids', 'attention_mask', 'offset_mapping'
    
#     if i == 0:
#         print(PIIs_encode.input_ids)
    
    Text_input_ids.append(Text_encode['input_ids'])
    Text_attention_masks.append(Text_encode['attention_mask'])
    
    PIIs_input_ids.append(PIIs_encode['input_ids'])
    PIIs_attention_masks.append(PIIs_encode['attention_mask'])

# Change data to tensor
Text_input_ids = torch.tensor(Text_input_ids)
Text_attention_masks = torch.tensor(Text_attention_masks)
PIIs_input_ids = torch.tensor(PIIs_input_ids)
PIIs_attention_masks = torch.tensor(PIIs_attention_masks)

# print(Text_input_ids.shape)
# print(Text_attention_masks.shape)

[101, 2060, 2060, 2060, 2171, 2171, 2060, 8909, 2060, 2079, 2497, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2171, 2171, 2060, 8909, 2060, 2079, 2497, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 2060, 3042, 102]


In [10]:
# Split the dataset 
# Train：Validation：Test = 6:2
from torch.utils.data import random_split

dataset_size = len(Text_input_ids)
train_size = int(0.8 * dataset_size)  # Train dataset
val_size = dataset_size - train_size  # Validation dataset

full_dataset = TensorDataset(Text_input_ids, Text_attention_masks, PIIs_input_ids)
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

In [11]:
# Since the data is in the medical field, bioBert-NRE is used as a model for training. 
# This model can better handle models in the medical field.
from transformers import BertTokenizer, BertForTokenClassification
from torch.optim import Adam
from transformers import get_linear_schedule_with_warmup

model_name = "monologg/biobert_v1.1_pubmed"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name, num_labels=9999)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()

# Optimzer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at monologg/biobert_v1.1_pubmed and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# DataLoader
batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

num_epochs = 20 # Num of Epoch

# Learning Rate
total_steps = len(train_dataloader) * num_epochs
warmup_steps = int(total_steps * 0.1)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)

# The begining of Training


for epoch in range(num_epochs):
    # Training Process
    model.train()
    total_train_loss = 0
    for batch in train_dataloader:
        inputs, masks, labels = batch
        inputs, masks, labels = inputs.to(device), masks.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs, attention_mask=masks, labels=labels)
        loss = outputs.loss
        total_train_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()
        
    average_train_loss = total_train_loss / len(train_dataloader)
    
    # Validation Process
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            inputs, masks, labels = batch
            inputs, masks, labels = inputs.to(device), masks.to(device), labels.to(device)
            
            outputs = model(inputs, attention_mask=masks, labels=labels)
            total_val_loss += outputs.loss.item()

    average_val_loss = total_val_loss / len(val_dataloader)

    print(f"Epoch: {epoch+1}/{num_epochs}, Training Loss: {average_train_loss:.4f}, Validation Loss: {average_val_loss:.4f}")

# Save the model and tokenizer
model_path = "./model.bin"
torch.save(model.state_dict(), model_path)
tokenizer_path = "./tokenizer"
tokenizer.save_pretrained(tokenizer_path)

Epoch: 1/20, Training Loss: 9.1532, Validation Loss: 9.1276
Epoch: 2/20, Training Loss: 9.1127, Validation Loss: 8.9824
Epoch: 3/20, Training Loss: 8.9365, Validation Loss: 8.6563
Epoch: 4/20, Training Loss: 8.5193, Validation Loss: 8.2707
Epoch: 5/20, Training Loss: 8.0633, Validation Loss: 7.8114
Epoch: 6/20, Training Loss: 7.6623, Validation Loss: 7.3430
Epoch: 7/20, Training Loss: 7.2214, Validation Loss: 6.9362
Epoch: 8/20, Training Loss: 6.8810, Validation Loss: 6.6320
Epoch: 9/20, Training Loss: 6.5758, Validation Loss: 6.3868
Epoch: 10/20, Training Loss: 6.3452, Validation Loss: 6.1793
Epoch: 11/20, Training Loss: 6.2443, Validation Loss: 6.0043
Epoch: 12/20, Training Loss: 5.9621, Validation Loss: 5.8571
Epoch: 13/20, Training Loss: 5.8347, Validation Loss: 5.7359
Epoch: 14/20, Training Loss: 5.7143, Validation Loss: 5.6429
Epoch: 15/20, Training Loss: 5.5500, Validation Loss: 5.5669
Epoch: 16/20, Training Loss: 5.3964, Validation Loss: 5.5040
Epoch: 17/20, Training Loss: 5.32

('./tokenizer\\tokenizer_config.json',
 './tokenizer\\special_tokens_map.json',
 './tokenizer\\vocab.txt',
 './tokenizer\\added_tokens.json')

In [15]:
# Text recognition

# In fact, due to the need for decoding, the decoded text does not accurately correspond to the label.
# I think it is a matter of time first. I may need more time to adjust the model parameters, 
# but my computer does not support me spending a lot of time calculating.

# In addition, I have some ideas about the model.
# For example, in order to prevent the label from becoming a list, I converted all the PIIs into text. 
# In fact, I think it is enough to convert the PIIs into the corresponding label through a dictionary,
# and it is not even necessary with further coding, maybe this will be more accurate.

model.eval()

text = "Discharge Summary FINAL AHMADI, JOSEPH Phone: 61754854587 DOB: 1956-01-08 60F Admission: 12/30/2015"
encoded_text = tokenizer.encode_plus(
    text, 
    truncation=True, 
    padding='max_length', 
    max_length=8,
    return_tensors="pt"
)

input_ids = encoded_text["input_ids"].to(device)
attention_mask = encoded_text["attention_mask"].to(device)

with torch.no_grad():
    output = model(input_ids, attention_mask=attention_mask)
logits = output.logits

predicted_label_ids = torch.argmax(logits, dim=-1).squeeze().tolist()

print(predicted_label_ids)

predicted_labels = tokenizer.decode(predicted_label_ids, skip_special_tokens=True)

print(predicted_labels)


[3177, 7134, 551, 8361, 6039, 4479, 4075, 8234]
De abilities ס cared vampires Jimmy heads interaction
