# Introduction

Biomedical text mining and natural language processing (BioNLP) deals with processing data from journals, medical records, and other biomedical documentst to extract information, relationships, and insights from text data.

# Modules Used:

**Scispacy:**

` ScispaCy` is a Python package containing `spaCy` models for processing biomedical, scientific or clinical text having models for tagging, parsing, named entity recognition (NER), text classification, and more.

Here *NER* models are used to identify drug and disease names mentioned in a medical transcription dataset.

**en_core_sci_sm:**
A full spaCy pipeline for biomedical data. F1 score = 78.01

**en_core_sci_md**
A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors(DISEASE, CHEMICAL).

F1 Score = 84.28

UAS (Unlabeled Attachment Score):90.08

LAS (Labeled Attachment Score):88.16

**en_ner_bionlp13cg_md**

AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

F1 Score = 77.84


In [103]:
# !pip install scispacy
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz

In [104]:
import pandas as pd
import scispacy
import spacy
import re

In [105]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [106]:
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks')

In [107]:
# Define the file path
# source of dataset https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions
file_path = '/content/drive/My Drive/Colab Notebooks/dataset/mtsamples.csv'

medical_dataframe = pd.read_csv(file_path, index_col=0);

medical_dataframe.info()


<class 'pandas.core.frame.DataFrame'>
Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4999 non-null   object
 1   medical_specialty  4999 non-null   object
 2   sample_name        4999 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3931 non-null   object
dtypes: object(5)
memory usage: 234.3+ KB


In [108]:
medical_dataframe.head()
medical_dataframe.dropna(subset=['transcription'], inplace=True)

In [109]:
medical_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4966 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        4966 non-null   object
 1   medical_specialty  4966 non-null   object
 2   sample_name        4966 non-null   object
 3   transcription      4966 non-null   object
 4   keywords           3898 non-null   object
dtypes: object(5)
memory usage: 232.8+ KB


In [110]:
print("Number of medical speciality in this datasets are: ")
medical_dataframe['medical_specialty'].nunique()

Number of medical speciality in this datasets are: 


40

In [111]:
print(list(medical_dataframe['medical_specialty'].unique()))

[' Allergy / Immunology', ' Bariatrics', ' Cardiovascular / Pulmonary', ' Neurology', ' Dentistry', ' Urology', ' General Medicine', ' Surgery', ' Speech - Language', ' SOAP / Chart / Progress Notes', ' Sleep Medicine', ' Rheumatology', ' Radiology', ' Psychiatry / Psychology', ' Podiatry', ' Physical Medicine - Rehab', ' Pediatrics - Neonatal', ' Pain Management', ' Orthopedic', ' Ophthalmology', ' Office Notes', ' Obstetrics / Gynecology', ' Neurosurgery', ' Nephrology', ' Letters', ' Lab Medicine - Pathology', ' IME-QME-Work Comp etc.', ' Hospice - Palliative Care', ' Hematology - Oncology', ' Gastroenterology', ' ENT - Otolaryngology', ' Endocrinology', ' Emergency Room Reports', ' Discharge Summary', ' Diets and Nutritions', ' Dermatology', ' Cosmetic / Plastic Surgery', ' Consult - History and Phy.', ' Chiropractic', ' Autopsy']


In [112]:
print("Number of medical speciality samples in this datasets are: ")
medical_dataframe['sample_name'].nunique()

Number of medical speciality samples in this datasets are: 


2360

In [113]:
medical_dataframe_small = medical_dataframe.sample(n=100, replace=False, random_state=42)
medical_dataframe_small.info()
medical_dataframe_small.head()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 3162 to 3581
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   description        100 non-null    object
 1   medical_specialty  100 non-null    object
 2   sample_name        100 non-null    object
 3   transcription      100 non-null    object
 4   keywords           78 non-null     object
dtypes: object(5)
memory usage: 4.7+ KB


Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
3162,Markedly elevated PT INR despite stopping Cou...,Hematology - Oncology,Hematology Consult - 1,"HISTORY OF PRESENT ILLNESS:, The patient is w...",
1981,Intercostal block from fourth to tenth interc...,Pain Management,Intercostal block - 1,"PREPROCEDURE DIAGNOSIS:, Chest pain secondary...","pain management, xylocaine, marcaine, intercos..."
1361,The patient is a 65-year-old female who under...,SOAP / Chart / Progress Notes,Lobectomy - Followup,"HISTORY OF PRESENT ILLNESS: , The patient is a...","soap / chart / progress notes, non-small cell ..."
3008,Construction of right upper arm hemodialysis ...,Nephrology,Hemodialysis Fistula Construction,"PREOPERATIVE DIAGNOSIS: , End-stage renal dise...","nephrology, end-stage renal disease, av dialys..."
4943,Bronchoscopy with brush biopsies. Persistent...,Cardiovascular / Pulmonary,Bronchoscopy - 8,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia...","cardiovascular / pulmonary, persistent pneumon..."


In [114]:
# to check the sample

sample_transcription = medical_dataframe_small['transcription'].iloc[0]


# Sample Output

While taking the `first` index sample the output will be as such

* TEXT START END ENTITY TYPE
* iron-deficiency anemia 79 101 DISEASE
* blood loss 117 127 DISEASE
* colitis 133 140 DISEASE
* iron 203 207 CHEMICAL
* colitis 286 293 DISEASE
* hematoma 391 399 DISEASE
......

*So, we can see a lot of entities in this transcription. There are drug, disease, and exam names for example.*

# Named entity recognition (NER)
Named entity recognition (NER) is a subtask of natural language processing used to identify and classify named entities mentioned in unstructured text into pre-defined categories.


This NER model is trained on the BC5CDR corpus (en_ner_bc5cdr_md). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions.

In [115]:
nlp = spacy.load("en_ner_bc5cdr_md")

`spacy.load()` will return a Language object containing all components and data needed to process text. This object is usually called nlp in the documentation and tutorials.

In [116]:
# all identified entities
# Named entities detected in the document using doc.ents

doc = nlp(sample_transcription)

# Initialize lists to store entities
disease_entities = []
chemical_entities = []

# Iterate over named entities
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    # Check if the entity is labeled as "DISEASE" or "CHEMICAL"
    if ent.label_ == 'DISEASE':
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
        disease_entities.append((ent.text, ent.start_char, ent.end_char, ent.label_))
    elif ent.label_ == 'CHEMICAL':
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
        chemical_entities.append((ent.text, ent.start_char, ent.end_char, ent.label_))

# Convert lists to DataFrames
disease_df = pd.DataFrame(disease_entities, columns=["TEXT", "START", "END", "ENTITY TYPE"])
chemical_df = pd.DataFrame(chemical_entities, columns=["TEXT", "START", "END", "ENTITY TYPE"])

# Export DataFrames to Excel files
disease_df.to_excel("/content/drive/My Drive/Colab Notebooks/BioNLP_result/disease_entities.xlsx", index=False)
chemical_df.to_excel("/content/drive/My Drive/Colab Notebooks/BioNLP_result/chemical_entities.xlsx", index=False)

TEXT START END ENTITY TYPE
iron-deficiency anemia 79 101 DISEASE
blood loss 117 127 DISEASE
colitis 133 140 DISEASE
iron 203 207 CHEMICAL
colitis 286 293 DISEASE
hematoma 391 399 DISEASE
venous thrombosis 473 490 DISEASE
DVT 492 495 DISEASE
septic 537 543 DISEASE
pelvic hematoma 742 757 DISEASE
vancomycin 781 791 CHEMICAL
infectious disease 873 891 DISEASE
Coumadin 1452 1460 CHEMICAL
vitamin K 1503 1512 CHEMICAL
uric acid 1830 1839 CHEMICAL
bilirubin 1853 1862 CHEMICAL
Creatinine 1911 1921 CHEMICAL
creatinine 1951 1961 CHEMICAL
Folic acid 2079 2089 CHEMICAL
Iron 2103 2107 CHEMICAL
heparin 2322 2329 CHEMICAL
loperamide 2339 2349 CHEMICAL
niacin 2351 2357 CHEMICAL
pantoprazole 2359 2371 CHEMICAL
Diovan 2373 2379 CHEMICAL
caspofungin 2400 2411 CHEMICAL
daptomycin 2413 2423 CHEMICAL
fentanyl 2436 2444 CHEMICAL
morphine 2448 2456 CHEMICAL
pain 2464 2468 DISEASE
Compazine 2474 2483 CHEMICAL
Zofran 2487 2493 CHEMICAL
epistaxis 2629 2638 DISEASE
bleeding 3057 3065 DISEASE
adenopathy 3156 3166 

In [117]:
def extract_entities_to_excel(specific_medical_speciality_dataset, speciality_name):
    # Initialize lists to store entities

    disease_entities = []
    chemical_entities = []

    formatted_speciality_name = re.sub(r'\W+', '', speciality_name)
    print(formatted_speciality_name+': \n')

    # Iterate over each row in the DataFrame
    for index, row in specific_medical_speciality_dataset.iterrows():
        # Document pipeline prepared
        doc = nlp(row['transcription'])

        # Iterate over named entities
        print("Sample Name", "TEXT", "START", "END", "ENTITY TYPE")

        # Iterate through sample names and transcriptions
        for ent in doc.ents:
            # Check if the entity is labeled as "DISEASE" or "CHEMICAL"
            if ent.label_ == 'DISEASE':
                print(row['sample_name'], ent.text, ent.start_char, ent.end_char, ent.label_)
                disease_entities.append((row['sample_name'], ent.text, ent.start_char, ent.end_char, ent.label_))
            elif ent.label_ == 'CHEMICAL':
                print(row['sample_name'],ent.text, ent.start_char, ent.end_char, ent.label_)
                chemical_entities.append((row['sample_name'], ent.text, ent.start_char, ent.end_char, ent.label_))

    # Convert lists to DataFrames
    disease_df = pd.DataFrame(disease_entities, columns=["SAMPLE_NAME","TEXT", "START", "END", "ENTITY TYPE"])
    chemical_df = pd.DataFrame(chemical_entities, columns=["SAMPLE_NAME","TEXT", "START", "END", "ENTITY TYPE"])

    # Export DataFrames to Excel files
    disease_df.to_excel(f"/content/drive/My Drive/Colab Notebooks/BioNLP_result/{formatted_speciality_name}_disease_entities.xlsx", index=False)
    print("\nDisease Data saved successfully")
    chemical_df.to_excel(f"/content/drive/My Drive/Colab Notebooks/BioNLP_result/{formatted_speciality_name}_chemical_entities.xlsx", index=False)
    print("\nChemical Data saved successfully")
    print("\n\n")

In [118]:
# to extract all the speciality specific data
medical_specialties = medical_dataframe['medical_specialty'].unique()

for specialty in medical_specialties:
    # Get the dataset specific to the current medical specialty
    specialty_dataset = medical_dataframe[medical_dataframe['medical_specialty'] == specialty][['sample_name', 'transcription']].to_dict('records')
    specialty_dataset = pd.DataFrame(specialty_dataset)

    # Call the function to extract entities and create Excel files
    extract_entities_to_excel(specialty_dataset, specialty)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 Consult - Stasis Ulcer  Primaxin 4370 4378 CHEMICAL
Sample Name TEXT START END ENTITY TYPE
 Consult - Syncope  syncope 202 209 DISEASE
 Consult - Syncope  seizure 255 262 DISEASE
 Consult - Syncope  chest pain 343 353 DISEASE
 Consult - Syncope  shortness of breath 357 376 DISEASE
 Consult - Syncope  palpitation 382 393 DISEASE
 Consult - Syncope  hypertension 527 539 DISEASE
 Consult - Syncope  diabetes mellitus 556 573 DISEASE
 Consult - Syncope  Cholesterol 588 599 CHEMICAL
 Consult - Syncope  coronary artery disease 658 681 DISEASE
 Consult - Syncope  noncontributory.,PAST 699 720 DISEASE
 Consult - Syncope  Hypertension 740 752 DISEASE
 Consult - Syncope  hyperlipidemia 754 768 DISEASE
 Consult - Syncope  Parkinson's 794 805 DISEASE
 Consult - Syncope  Parkinson's tremor 812 830 DISEASE
 Consult - Syncope  evaluation.,PAST 854 870 CHEMICAL
 Consult - Syncope  Nonsignificant.,MEDICATIONS:,1 960 990 DISEASE
 Consult -