# Intent Detection and Entity Extraction from BioMedical Literature

## Overview

This Python notebook which implements Naming entity extraction of medical data contains code for performing token classification using the Hugging Face Transformers library. It utilizes JSON data and various models provided by the Transformers library for token classification tasks.

## Dependencies

- Python >= 3.6
- transformers 
- json

## Installation

Ensure you have the necessary dependencies installed. You can install them using pip:

```bash
pip install transformers


In [None]:
import json
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

: 

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# import sys
# sys.path.append('/content/drive/My Drive/Colab Notebooks')

# Binder-PubMedBERT Token Classification (NER) Documentation

## Overview

This section of the Python notebook loads the Binder-PubMedBERT tokenizer and model for Named Entity Recognition (NER) tasks. Binder-PubMedBERT is a pre-trained model fine-tuned on biomedical text from PubMed and the MIMIC-III dataset.


In [5]:
# Load the model and tokenizer for BINDER-PubMedBERT (NER)
binder_tokenizer = AutoTokenizer.from_pretrained("bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12")
binder_model = AutoModelForTokenClassification.from_pretrained("bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Load the model and tokenizer for RoBERTa (Intent Detection)
roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
roberta_model = AutoModelForTokenClassification.from_pretrained("roberta-base")

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# Load the model and tokenizer for PubMedBERT (NER)
pubmed_tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
pubmed_model = AutoModelForTokenClassification.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# Create pipelines for named entity recognition (NER) and intent detection
binder_ner_pipeline = pipeline("ner", model=binder_model, tokenizer=binder_tokenizer)
pubmed_ner_pipeline = pipeline("ner", model=pubmed_model, tokenizer=pubmed_tokenizer)
roberta_intent_pipeline = pipeline("ner", model=roberta_model, tokenizer=roberta_tokenizer)

In [9]:
# # Sample text
# paragraph = ""
# text = "However, it's important to consult a healthcare professional if symptoms persist or worsen, as they could indicate a more serious underlying condition"
# text4 = "On the other hand, individuals might ask about the best practices for maintaining a healthy lifestyle, including diet and exercise routines."
# text3 = "Additionally, concerns about mental health issues, such as anxiety or depression, are frequently raised, prompting discussions about therapy options or medication management."
# text1 = "Pharmacokinetic properties of abacavir were not altered by the addition of either lamivudine or zidovudine."
# text2 = "People may also inquire about preventive measures against contagious diseases, especially during flu season or outbreaks."

In [10]:
def read_txt_file(file_path):
    try:
        with open(file_path, "r") as file:
            text = file.read()
            return text
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return None

# Example usage:
# file_path = "input.txt"  # Change to the path of your text file



In [11]:
file_path = "medical_data.txt"

paragraph = read_txt_file(file_path)
if paragraph:
    print("Text from file:")
    print(paragraph)

Text from file:
Recent studies have shown promising results in the treatment of pancreatic cancer using targeted therapies. The combination of gemcitabine and nab-paclitaxel has demonstrated improved survival rates in patients with advanced pancreatic adenocarcinoma. Additionally, immunotherapy has emerged as a potential breakthrough in cancer treatment, with checkpoint inhibitors showing efficacy in various malignancies, including melanoma and non-small cell lung cancer. On the other hand, individuals often seek medical advice for common ailments such as the common cold or seasonal allergies. Symptoms like sneezing, coughing, and congestion are commonly associated with these conditions. Over-the-counter remedies such as antihistamines and decongestants are commonly recommended for symptom relief. I love to go to school.


In [12]:
def classify_entities_paragraph_pubmed(paragraph):
    # Tokenize the paragraph into sentences
    sentences = paragraph.split(". ")

    # Initialize a dictionary to store the entities for each sentence
    sentence_entities = {}

    # Process each sentence separately
    for i, sentence in enumerate(sentences):
        # Perform named entity recognition using PubMedBERT
        pubmed_ner_results = pubmed_ner_pipeline(sentence)

        # Initialize lists to store medical and generic entities for this sentence
        medical_entities = []
        generic_entities = []

        # Classify entities as medical or generic
        for entity in pubmed_ner_results:
            if entity['score'] > 0.65:  # Adjust score threshold as needed
                medical_entities.append(entity['word'])
            else:
                generic_entities.append(entity['word'])

        # Store the classified entities for this sentence
        sentence_entities[f"Sentence {i+1}"] = {"Medical Entities": medical_entities, "Generic Entities": generic_entities}

    # Return the dictionary containing entities for each sentence
    return sentence_entities

In [13]:
sentence_entities = classify_entities_paragraph_pubmed(paragraph)

# Print the entities for each sentence
for sentence, entities in sentence_entities.items():
    print(f"{sentence}:")
    print("Medical Entities detected by PubMedBERT:", entities["Medical Entities"])
    print("Generic Entities detected by PubMedBERT:", entities["Generic Entities"])
    print()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence 1:
Medical Entities detected by PubMedBERT: []
Generic Entities detected by PubMedBERT: ['recent', 'studies', 'have', 'shown', 'promising', 'results', 'in', 'the', 'treatment', 'of', 'pancreatic', 'cancer', 'using', 'targeted', 'therapies']

Sentence 2:
Medical Entities detected by PubMedBERT: ['pancreatic']
Generic Entities detected by PubMedBERT: ['the', 'combination', 'of', 'gemcitabine', 'and', 'nab', '-', 'paclitaxel', 'has', 'demonstrated', 'improved', 'survival', 'rates', 'in', 'patients', 'with', 'advanced', 'adenocarcinoma']

Sentence 3:
Medical Entities detected by PubMedBERT: []
Generic Entities detected by PubMedBERT: ['additionally', ',', 'immunotherapy', 'has', 'emerged', 'as', 'a', 'potential', 'breakthrough', 'in', 'cancer', 'treatment', ',', 'with', 'checkpoint', 'inhibitors', 'showing', 'efficacy', 'in', 'various', 'malignancies', ',', 'including', 'melanoma', 'and', 'non', '-', 'small', 'cell', 'lung', 'cancer']

Sentence 4:
Medical Entities detected by PubM

In [14]:
def classify_entities_binder_paragraph(paragraph):
    # Tokenize the paragraph into sentences
    sentences = paragraph.split(". ")

    # Initialize a dictionary to store the entities for each sentence
    sentence_entities = {}

    # Process each sentence separately
    for i, sentence in enumerate(sentences):
        # Perform named entity recognition using BINDER-PubMedBERT
        binder_ner_results = binder_ner_pipeline(sentence)

        # Initialize lists to store medical and generic entities for this sentence
        medical_entities = []
        generic_entities = []

        # Classify entities as medical or generic
        for entity in binder_ner_results:
            if entity['score'] > 0.65:  # Adjust score threshold as needed
                medical_entities.append(entity['word'])
            else:
                generic_entities.append(entity['word'])

        # Store the classified entities for this sentence
        sentence_entities[f"Sentence {i+1}"] = {"Medical Entities": medical_entities, "Generic Entities": generic_entities}

    # Return the dictionary containing entities for each sentence
    return sentence_entities


In [15]:
sentence_entities = classify_entities_binder_paragraph(paragraph)

# Print the entities for each sentence
for sentence, entities in sentence_entities.items():
    print(f"{sentence}:")
    print("Entities detected by BINDER-PubMedBERT:", entities["Medical Entities"])
    print("Generic Entities detected by BINDER-PubMedBERT:", entities["Generic Entities"])
    print()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Sentence 1:
Entities detected by BINDER-PubMedBERT: ['##ra']
Generic Entities detected by BINDER-PubMedBERT: ['recent', 'studies', 'have', 'shown', 'promising', 'results', 'in', 'the', 'treatment', 'of', 'pan', '##cre', '##atic', 'cancer', 'using', 'targeted', 'the', '##pies']

Sentence 2:
Entities detected by BINDER-PubMedBERT: []
Generic Entities detected by BINDER-PubMedBERT: ['the', 'combination', 'of', 'gem', '##cit', '##abi', '##ne', 'and', 'na', '##b', '-', 'pac', '##lita', '##x', '##el', 'has', 'demonstrated', 'improved', 'survival', 'rates', 'in', 'patients', 'with', 'advanced', 'pan', '##cre', '##atic', 'aden', '##oca', '##rc', '##ino', '##ma']

Sentence 3:
Entities detected by BINDER-PubMedBERT: ['##gnan', '##cies', '##ano', 'cell']
Generic Entities detected by BINDER-PubMedBERT: ['additionally', ',', 'im', '##mun', '##oth', '##era', '##py', 'has', 'emerged', 'as', 'a', 'potential', 'breakthrough', 'in', 'cancer', 'treatment', ',', 'with', 'checkpoint', 'inhibitors', 'showin

In [16]:
def classify_intent_paragraph(paragraph, model_pipeline, threshold, count):
    # Tokenize the paragraph into sentences
    sentences = paragraph.split(". ")

    # Initialize a dictionary to store the intent label for each sentence
    sentence_intents = {}

    # Process each sentence separately
    for i, sentence in enumerate(sentences):
        # Perform intent detection using RoBERTa
        intent_results = model_pipeline(sentence)

        # Classify intent based on the output of RoBERTa pipeline
        # Assuming intent_results is a list of dictionaries
        intent_label = "Medical Inquiry" if sum(token['entity'] == 'LABEL_1' and token['score'] > threshold for token in intent_results) >= count else "General Inquiry"


        # Store the classified intent label for this sentence
        sentence_intents[f"Sentence {i+1}"] = intent_label

    # Return the dictionary containing intent labels for each sentence
    return sentence_intents

# # Example usage:
# # paragraph = "Pharmacokinetic properties of abacavir were not altered by the addition of either lamivudine or zidovudine. Recent studies have shown promising results in the treatment of pancreatic cancer using targeted therapies."
# sentence_intents = classify_intent_roberta_paragraph(paragraph)




In [17]:
def save_intent_data_to_json(paragraph, model_pipeline, filename):
    # Tokenize the paragraph into sentences
    sentences = paragraph.split(". ")

    # Initialize a list to store JSON-serializable intent data
    json_serializable_data = []

    # Process each sentence separately
    for sentence in sentences:
        # Perform intent detection using RoBERTa
        intent_results = model_pipeline(sentence)

        # Initialize a list to store JSON-serializable token data for this sentence
        sentence_data = []

        # Iterate over each token in the intent results and extract JSON-serializable information
        for token in intent_results:
            token_info = {
                "word": token.get("word", ""),
                "entity": token.get("entity", ""),
                "score": float(token.get("score", 0.0)),  # Convert to float
                "index": int(token.get("index", 0)),      # Convert to integer
                "start": int(token.get("start", 0)),      # Convert to integer
                "end": int(token.get("end", 0))           # Convert to integer
            }
            sentence_data.append(token_info)

        # Append the JSON-serializable token data for this sentence to the list
        json_serializable_data.append({"sentence": sentence, "intent_data": sentence_data})

    # Specify the file path to save the JSON file
    file_path = f"{filename}.json"

    # Write JSON data to file
    with open(file_path, "w") as json_file:
        json.dump(json_serializable_data, json_file, indent=4)

    print(f"JSON data has been saved to {file_path}.")





In [18]:
save_intent_data_to_json(paragraph, roberta_intent_pipeline, "robert_dataset_output")

sentence_intents = classify_intent_paragraph(paragraph, roberta_intent_pipeline, 0.60, 5)
print("roberta_intent_pipeline output: ")
# Print the intent label for each sentence
for sentence, intent_label in sentence_intents.items():
    print(f"{sentence}: {intent_label}")

JSON data has been saved to robert_dataset_output.json.
roberta_intent_pipeline output: 
Sentence 1: General Inquiry
Sentence 2: General Inquiry
Sentence 3: Medical Inquiry
Sentence 4: General Inquiry
Sentence 5: Medical Inquiry
Sentence 6: General Inquiry
Sentence 7: General Inquiry


In [25]:
save_intent_data_to_json(paragraph, binder_ner_pipeline, "binder_dataset_output")

sentence_intents = classify_intent_paragraph(paragraph, binder_ner_pipeline, 0.60, 5)
print("roberta_intent_pipeline output: ")
# Print the intent label for each sentence
for sentence, intent_label in sentence_intents.items():
    print(f"{sentence}: {intent_label}")

JSON data has been saved to binder_dataset_output.json.
roberta_intent_pipeline output: 
Sentence 1: General Inquiry
Sentence 2: General Inquiry
Sentence 3: Medical Inquiry
Sentence 4: General Inquiry
Sentence 5: General Inquiry
Sentence 6: General Inquiry
Sentence 7: General Inquiry


In [24]:
save_intent_data_to_json(paragraph, pubmed_ner_pipeline, "pubmed_dataset_output")

sentence_intents = classify_intent_paragraph(paragraph, pubmed_ner_pipeline, 0.5, 1)
print("roberta_intent_pipeline output: ")
# Print the intent label for each sentence
for sentence, intent_label in sentence_intents.items():
    print(f"{sentence}: {intent_label}")

JSON data has been saved to pubmed_dataset_output.json.
roberta_intent_pipeline output: 
Sentence 1: Medical Inquiry
Sentence 2: General Inquiry
Sentence 3: Medical Inquiry
Sentence 4: General Inquiry
Sentence 5: General Inquiry
Sentence 6: General Inquiry
Sentence 7: General Inquiry
