
# Using NLP to Process Diagnosis Data for Triage Prediction

In healthcare, predicting triage and understanding patient influx patterns is essential for effective resource management, particularly in emergency departments.
his notebook utilizes natural language processing (NLP) techniques to analyze diagnosis data, aiming to extract insights that can inform predictions about patient severity levels.

**Objective**: The primary objective of this notebook is to preprocess and analyze a dataset containing patient diagnoses in order to develop a predictive model for patient severity levels. Specifically, the notebook aims to:
- Clean and standardize the diagnosis data.
- Apply NLP techniques such as tokenization and lemmatization to prepare the text for analysis.
- Create a Bag of Words (BoW) representation for machine learning.

## 1. Library Imports

In this section, essential libraries for natural language processing (NLP) are imported. The spacy library is utilized for advanced NLP tasks such as tokenization and lemmatization. The re library provides regular expression functionalities for text manipulation. The tqdm library is included to offer progress bars during iterative processes, enhancing user experience.

In [1]:
import pandas as pd
import re
import spacy
from tqdm import tqdm
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pickle 
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

import warnings  # The warnings module to handle warnings during code execution

# Suppress warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the English NLP model from spaCy
nlp = spacy.load("en_core_web_sm")

## 2. Loading Dataset

Initial exploration is performed by displaying the the DataFrame.

In [3]:
# Loading the dataset
data = pd.read_csv('../Data/data_after_EDA.csv')

data.head()

Unnamed: 0,ROWNUM,Hospital,Eligibility Class,Gender,Arrival Time,Severity Level,Deparment,Main Diagnosis,Discharge Time,Waiting Time (Minutes),Length of Stay (Minutes),Treatment Time(Minutes),Cluster,No Treatment
0,1,Royal Commission Health Services Program,ROYAL COMMISSION,Female,2023-12-13 13:17:48,Level Ⅳ,Emergency Medicine,"Pain, unspecified",2023-12-13 16:43:00,14.0,205.0,191.0,2,0
1,2,Royal Commission Health Services Program,ROYAL COMMISSION,Female,2023-12-08 10:59:28,Level Ⅲ,Emergency Medicine,Low back pain,2023-12-08 12:50:00,7.0,111.0,104.0,1,0
2,3,Royal Commission Health Services Program,ROYAL COMMISSION,Female,2023-11-05 14:03:02,Level Ⅲ,Emergency Medicine,"Acute upper respiratory infection, unspecified",2023-11-05 14:54:00,24.0,51.0,27.0,1,0
3,4,Royal Commission Health Services Program,ROYAL COMMISSION,Female,2023-10-07 22:57:41,Level Ⅲ,Emergency Medicine,Epistaxis,2023-10-08 00:09:00,26.0,71.0,0.0,1,1
4,5,Royal Commission Health Services Program,ROYAL COMMISSION,Female,2023-10-21 21:32:17,Level Ⅳ,Emergency Medicine,"Acute upper respiratory infection, unspecified",2023-10-21 23:10:00,56.0,98.0,42.0,0,0


This step includes filtering out any records with 'Unrated' severity levels and 'Unknown' main diagnoses, ensuring that the dataset is clean and relevant for further analysis.

In [4]:
# Filter out rows with 'Unrated' severity and 'Unknown' diagnosis
data = data[data['Severity Level'] != 'Unrated']
data = data[data['Main Diagnosis'] != 'Unknown']
data=data.reset_index()
data.drop(columns=['index','ROWNUM'])

# Save the cleaned dataset to a new CSV file
data.to_csv('../Data/data_after_EDA2.csv', index=False)
print("Updated dataset saved to 'updated_dataset.csv'.")

Updated dataset saved to 'updated_dataset.csv'.


## 3. Data Preparation

A new DataFrame is created to focus on the 'Main Diagnosis' column, which is crucial for subsequent NLP tasks. 

In [5]:
# Create a new DataFrame with only the 'Main Diagnosis' column
df = pd.DataFrame(data, columns=['Main Diagnosis'])
df

Unnamed: 0,Main Diagnosis
0,"Pain, unspecified"
1,Low back pain
2,"Acute upper respiratory infection, unspecified"
3,Epistaxis
4,"Acute upper respiratory infection, unspecified"
...,...
93184,"Asthma, unspecified"
93185,"Acute upper respiratory infection, unspecified"
93186,"Acute upper respiratory infection, unspecified"
93187,"Cutaneous abscess, furuncle and carbuncle, uns..."


- To standardize the data, all entries in this column are converted to lowercase. 
- Special characters and punctuation are removed using regular expressions to ensure that the diagnoses are clean and consistent.
- Common diagnoses are standardized to reduce redundancy, facilitating better model training later on.

In [6]:
# Step 1: Lowercase all the entries in the 'Main Diagnosis' column
df['Main Diagnosis'] = df['Main Diagnosis'].str.lower()


# Step 2: Remove punctuation or special characters from the diagnoses
df['Main Diagnosis'] = df['Main Diagnosis'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Step 3: Standardize some common diagnoses (as an example)
df['Main Diagnosis'] = df['Main Diagnosis'].replace({
    'acute upper respiratory infection unspecified': 'upper respiratory infection',
    'low back pain': 'back pain',
   
})
# Print the cleaned 'Main Diagnosis' column
print(df['Main Diagnosis'])
df.shape

0                                         pain unspecified
1                                                back pain
2                              upper respiratory infection
3                                                epistaxis
4                              upper respiratory infection
                               ...                        
93184                                   asthma unspecified
93185                          upper respiratory infection
93186                          upper respiratory infection
93187    cutaneous abscess furuncle and carbuncle unspe...
93188                          pain in limb multiple sites
Name: Main Diagnosis, Length: 93189, dtype: object


(93189, 1)

## 4. Tokenization
Tokenization is a crucial NLP step that breaks down text into individual words or tokens. A custom function using the spacy library is defined for this purpose. The tokenizer processes the 'Main Diagnosis' column, with a progress bar displayed to monitor the operation. The resulting tokenized data is stored in a new column for easy access in subsequent analyses.

In [7]:
#Tokenization
# Function to apply spaCy processing and tokenize the Main Diagnosis column
def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]

# Apply spaCy tokenizer to the 'Main Diagnosis' column
tqdm.pandas(desc="Tokenizing")
df['Main Diagnosis Tokens'] = df['Main Diagnosis'].progress_apply(spacy_tokenizer)

# Display the tokenized diagnosis column
print(df[['Main Diagnosis', 'Main Diagnosis Tokens']])
df.shape

Tokenizing: 100%|██████████| 93189/93189 [02:06<00:00, 737.59it/s]

                                          Main Diagnosis  \
0                                       pain unspecified   
1                                              back pain   
2                            upper respiratory infection   
3                                              epistaxis   
4                            upper respiratory infection   
...                                                  ...   
93184                                 asthma unspecified   
93185                        upper respiratory infection   
93186                        upper respiratory infection   
93187  cutaneous abscess furuncle and carbuncle unspe...   
93188                        pain in limb multiple sites   

                                   Main Diagnosis Tokens  
0                                    [pain, unspecified]  
1                                           [back, pain]  
2                        [upper, respiratory, infection]  
3                                          




(93189, 2)

## 5. Stop Word Removal
Following tokenization, stop words—common words that add little meaning—are removed. The spacy library's built-in stop words are loaded, and a custom function filters these out from the token lists. This process is also tracked with a progress bar. The updated tokenized data, now free of stop words, is saved in a new column, allowing for a focus on more meaningful terms.

In [8]:
# Remove the stop words
#  Load spaCy stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS
stop_words.add("unspecified")

# Function to remove stop words from the tokens
def remove_stop_words(tokens):
    return [token for token in tokens if token not in stop_words]

# Apply tqdm to monitor the process
tqdm.pandas(desc="Removing Stop Words")

# Apply the stop word removal function with a progress bar
df['Main Diagnosis Tokens Without Stopwords'] = df['Main Diagnosis Tokens'].progress_apply(remove_stop_words)

# Display the updated tokenized diagnosis column without stop words
print(df[['Main Diagnosis', 'Main Diagnosis Tokens Without Stopwords']])

Removing Stop Words: 100%|██████████| 93189/93189 [00:00<00:00, 711616.15it/s]

                                          Main Diagnosis  \
0                                       pain unspecified   
1                                              back pain   
2                            upper respiratory infection   
3                                              epistaxis   
4                            upper respiratory infection   
...                                                  ...   
93184                                 asthma unspecified   
93185                        upper respiratory infection   
93186                        upper respiratory infection   
93187  cutaneous abscess furuncle and carbuncle unspe...   
93188                        pain in limb multiple sites   

         Main Diagnosis Tokens Without Stopwords  
0                                         [pain]  
1                                         [pain]  
2                [upper, respiratory, infection]  
3                                    [epistaxis]  
4                [upper,




## 6. Lemmatization
Lemmatization reduces words to their base forms, improving analysis accuracy. A function is created to convert tokens into their lemmas using spacy. This function is applied to the column containing tokens without stop words, with progress monitored via a progress bar. The lemmatized tokens are stored in a new column, providing a simplified yet semantically rich representation of the diagnoses.

In [9]:
# Lemmatization
# Enable progress bar for pandas
tqdm.pandas(desc="Lemmatizing Tokens")

# Function to apply spaCy lemmatization to tokens
def lemmatize_tokens(tokens):
    doc = nlp(" ".join(tokens))  # Process the tokens into a single string
    return [token.lemma_ for token in doc]  # Return the lemmatized tokens

# Apply the lemmatization function with a progress bar
df['Main Diagnosis Lemmatized'] = df['Main Diagnosis Tokens Without Stopwords'].progress_apply(lemmatize_tokens)

# Display the lemmatized tokens
print(df[['Main Diagnosis', 'Main Diagnosis Lemmatized']])

Lemmatizing Tokens: 100%|██████████| 93189/93189 [01:59<00:00, 781.21it/s]

                                          Main Diagnosis  \
0                                       pain unspecified   
1                                              back pain   
2                            upper respiratory infection   
3                                              epistaxis   
4                            upper respiratory infection   
...                                                  ...   
93184                                 asthma unspecified   
93185                        upper respiratory infection   
93186                        upper respiratory infection   
93187  cutaneous abscess furuncle and carbuncle unspe...   
93188                        pain in limb multiple sites   

                       Main Diagnosis Lemmatized  
0                                         [pain]  
1                                         [pain]  
2                [upper, respiratory, infection]  
3                                    [epistaxis]  
4                [upper,




## 7. Bag of Words (BoW) Representation
To prepare the lemmatized tokens for machine learning, a Bag of Words (BoW) representation is generated. This involves joining the lemmatized tokens back into strings and using CountVectorizer from sklearn to convert the text into numerical vectors. The resulting BoW vectors are stored in a dense array format, and the fitted vectorizer is saved using pickle for future use. 

In [10]:
# Create BoW representation
# Convert lemmatized tokens back to a single string for vectorization
dataset = df['Main Diagnosis Lemmatized'].apply(lambda x: ' '.join(x))
vectorizer = CountVectorizer(max_features=110)
X = vectorizer.fit_transform(dataset)
X_dense = X.toarray()

# Save the vectorizer model to a file
with open('../Models/nlp_diagnosis.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)
    
# Creating a DataFrame for the BoW vectors
vector_df = pd.DataFrame(X_dense, columns=vectorizer.get_feature_names_out())
vector_df

Unnamed: 0,abdominal,abnormal,abrasion,abscess,accident,acute,allergy,anaemia,ankle,arthropathy,...,tract,traumatic,upper,urinary,uterine,vaginal,vomit,wheeze,wound,wrist
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93184,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93185,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
93186,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
93187,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 8.DataFrame Concatenation
The BoW vectors are concatenated with the original DataFrame, creating a comprehensive dataset that includes both original and processed features. Unused intermediate columns are removed to streamline the DataFrame. The final enriched dataset, containing numerical representations of diagnoses, is saved to a new CSV file, preparing it for model training and evaluation.

In [11]:
# Concatenating the BoW vectors back to the original DataFrame
new_df = pd.concat([df.reset_index(drop=True), vector_df.reset_index(drop=True)], axis=1)
new_df

Unnamed: 0,Main Diagnosis,Main Diagnosis Tokens,Main Diagnosis Tokens Without Stopwords,Main Diagnosis Lemmatized,abdominal,abnormal,abrasion,abscess,accident,acute,...,tract,traumatic,upper,urinary,uterine,vaginal,vomit,wheeze,wound,wrist
0,pain unspecified,"[pain, unspecified]",[pain],[pain],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,back pain,"[back, pain]",[pain],[pain],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,upper respiratory infection,"[upper, respiratory, infection]","[upper, respiratory, infection]","[upper, respiratory, infection]",0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,epistaxis,[epistaxis],[epistaxis],[epistaxis],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,upper respiratory infection,"[upper, respiratory, infection]","[upper, respiratory, infection]","[upper, respiratory, infection]",0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93184,asthma unspecified,"[asthma, unspecified]",[asthma],[asthma],0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93185,upper respiratory infection,"[upper, respiratory, infection]","[upper, respiratory, infection]","[upper, respiratory, infection]",0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
93186,upper respiratory infection,"[upper, respiratory, infection]","[upper, respiratory, infection]","[upper, respiratory, infection]",0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
93187,cutaneous abscess furuncle and carbuncle unspe...,"[cutaneous, abscess, furuncle, and, carbuncle,...","[cutaneous, abscess, furuncle, carbuncle]","[cutaneous, abscess, furuncle, carbuncle]",0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# Remove intermediate columns that are no longer needed
new_df=new_df.drop(columns=['Main Diagnosis Tokens','Main Diagnosis Tokens Without Stopwords','Main Diagnosis Lemmatized'])

# Saving the updated DataFrame with BoW vectors and Main Diagnosis
new_df.to_csv('../Data/updated_dataset.csv', index=False)
print("Updated dataset saved to '../Data/updated_dataset.csv'.")

Updated dataset saved to '../Data/updated_dataset.csv'.


## 9. Test Functionality for Model Training
In this section, the prepared dataset is used to train a machine learning model with the goal of testing whether the Bag of Words (BoW) vector can accurately predict new inputs.

In [13]:
# Concatenating the dataset of the text CountVector 
new_df = pd.concat([data, new_df], axis=1)
labels = new_df['Severity Level']  

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_dense, labels, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predicting new input
new_input = ["elbow pain"]
new_input_vector = vectorizer.transform([' '.join(spacy_tokenizer(new_input[0]))]).toarray()
prediction = model.predict(new_input_vector)

print("Predicted label for the new input:", prediction)

Predicted label for the new input: ['Level Ⅳ']


In [14]:
y_pred = model.predict(X_test)


# Assuming you have your predicted labels (y_pred) and actual labels (y_true)
classification_report = classification_report(y_test, y_pred)

# Print the classification report
print(classification_report)

              precision    recall  f1-score   support

     Level Ⅰ       0.00      0.00      0.00         3
     Level Ⅱ       0.07      0.01      0.02        90
     Level Ⅲ       0.56      0.42      0.48      4819
     Level Ⅳ       0.77      0.83      0.80     12963
     Level Ⅴ       0.16      0.22      0.18       763

    accuracy                           0.69     18638
   macro avg       0.31      0.30      0.30     18638
weighted avg       0.69      0.69      0.69     18638



## 10. Conclusion
In this notebook, we explored the application of natural language processing (NLP) techniques to process diagnosis data and test the functionality of a Bag of Words (BoW) vectorization approach for predicting triage levels in emergency departments. Through steps such as tokenization, stop word removal, and lemmatization, we prepared the data for machine learning. We trained a Naive Bayes classifier to evaluate whether the BoW representation could effectively predict new input diagnoses.
While the results obtained demonstrate the potential of this approach, it is important to note that this model serves primarily as a preliminary test of the vectorization technique's capabilities. A dedicated notebook will be developed for more comprehensive model development, focusing on refining the predictive model, optimizing parameters, and enhancing overall performance. This future work will aim to build a robust system that can reliably aid in triage decision-making in real-world healthcare settings.

In [15]:
import spacy
import re
import pandas as pd

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the stop words from spaCy
stop_words = spacy.lang.en.stop_words.STOP_WORDS
stop_words.add("unspecified")  # Add any additional stop words if needed

# Function for preprocessing new input
def preprocess_input(new_input):
    # Step 1: Tokenization
    tokens = [token.text for token in nlp(new_input)]
    
    # Step 2: Stop Word Removal
    tokens = [token for token in tokens if token not in stop_words]
    
    # Step 3: Lemmatization
    lemmatized_tokens = [token.lemma_ for token in nlp(" ".join(tokens))]
    
    return lemmatized_tokens

# Example of new diagnosis input
new_diagnosis = "sicklecell anaemia with crisis"

# Preprocess the new input
processed_input = preprocess_input(new_diagnosis)

# Display the processed tokens
print("Processed Input Tokens:", processed_input)

# If you need to convert to BoW representation for prediction:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

# Load the saved CountVectorizer
with open('../Models/nlp_diagnosis.pkl', 'rb') as file:
    vectorizer = pickle.load(file)

# Convert the lemmatized tokens back to a string
input_vector = vectorizer.transform([' '.join(processed_input)]).toarray()

# Display the BoW vector
print("BoW Vector:", input_vector)

# Convert the BoW array to a DataFrame
vector_df = pd.DataFrame(input_vector, columns=vectorizer.get_feature_names_out())
vector_df

Processed Input Tokens: ['sicklecell', 'anaemia', 'crisis']
BoW Vector: [[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0]]


Unnamed: 0,abdominal,abnormal,abrasion,abscess,accident,acute,allergy,anaemia,ankle,arthropathy,...,tract,traumatic,upper,urinary,uterine,vaginal,vomit,wheeze,wound,wrist
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
