# **Medical Question Answering System**

Objective: The goal of this project is to build a Question Answering (QA) system that can extract specific information from medical transcriptions written by attending physicians. The system will be able to answer predefined questions based on the content of the medical transcription text.

# Import Required Libraries

In [1]:
# Install required libraries
!pip install transformers datasets nltk spacy
!pip install transformers datasets huggingface_hub

# Import necessary libraries
import pandas as pd
import numpy as np
import re
from transformers import pipeline, AutoModelForQuestionAnswering, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
import nltk
nltk.download('punkt')

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Data Description:

The dataset used in this project is a subset of the Medical Transcripts repository scraped from Kaggle. It consists of 4999 rows and 6 variables, representing medical transcription records. Each row contains information in the following format:

Question: The question based on the transcription (e.g., "How old is the patient?")
Context: The medical transcription that contains the information needed to answer the question.
Answers: The ground-truth answers (though, for this project, we will simulate the ground-truth answers or use automated methods for evaluation due to lack of manual annotations).
The dataset is structured to enable the use of a reading comprehension model, where the goal is to extract the answer from the context given a question.



In [30]:
# Load the dataset (assuming you uploaded it to Colab as 'medical_transcriptions.csv')
data_path = '/content/mtsamples.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

# Focus on the 'transcription' column
df = df[['transcription']]
print(f"Number of records: {len(df)}")



Number of records: 4999


In [3]:
# Convert non-string values to empty strings and preprocess
def preprocess_text(text):
    if not isinstance(text, str):
        text = ""  # Replace non-string values with empty string
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^A-Za-z0-9.,!?\'"]+', ' ', text)  # Keep alphanumeric and punctuation
    return text.strip()

# Apply preprocessing
df['transcription'] = df['transcription'].apply(preprocess_text)

# Check cleaned text
df['transcription'].head()



Unnamed: 0,transcription
0,"SUBJECTIVE , This 23 year old white female pre..."
1,"PAST MEDICAL HISTORY , He has difficulty climb..."
2,"HISTORY OF PRESENT ILLNESS , I have seen ABC t..."
3,"2 D M MODE , ,1. Left atrial enlargement with ..."
4,1. The left ventricular cavity size and wall t...


# Manual annotations

In [4]:
# Example manual annotations (can extend this later)
annotations = [
    {
        "context": df['transcription'][0],
        "question": "How old is the patient?",
        "answers": {"text": ["45"], "answer_start": [15]}
    },
    {
        "context": df['transcription'][1],
        "question": "What is the reason for consultation?",
        "answers": {"text": ["chest pain"], "answer_start": [50]}
    }
]

# Convert to DataFrame for later processing
annotated_df = pd.DataFrame(annotations)
annotated_df

Unnamed: 0,context,question,answers
0,"SUBJECTIVE , This 23 year old white female pre...",How old is the patient?,"{'text': ['45'], 'answer_start': [15]}"
1,"PAST MEDICAL HISTORY , He has difficulty climb...",What is the reason for consultation?,"{'text': ['chest pain'], 'answer_start': [50]}"


# Load and preprocess data

In [5]:
import re

# Function to clean text
def preprocess_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.strip()
    return text

# Apply preprocessing
annotated_df["context"] = annotated_df["context"].apply(preprocess_text)
print("\nPreprocessed Contexts:")
print(annotated_df["context"].head())


Preprocessed Contexts:
0    SUBJECTIVE , This 23 year old white female pre...
1    PAST MEDICAL HISTORY , He has difficulty climb...
Name: context, dtype: object


In [6]:
from datasets import Dataset

# Convert DataFrame to Hugging Face Dataset
qa_dataset = Dataset.from_pandas(annotated_df)

# Display the dataset
print(qa_dataset)

# Use the Hugging Face Dataset split method instead of train_test_split
split_datasets = qa_dataset.train_test_split(test_size=0.2)

# Separate train and validation datasets
train_data = split_datasets['train']
val_data = split_datasets['test']

# Display the sizes of train and validation sets
print(f"Train dataset size: {len(train_data)}")
print(f"Validation dataset size: {len(val_data)}")



Dataset({
    features: ['context', 'question', 'answers'],
    num_rows: 2
})
Train dataset size: 1
Validation dataset size: 1


#

In [8]:
from transformers import AutoTokenizer

# Load tokenizer
model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define preprocessing function
def preprocess_function(examples):
    return tokenizer(
        examples["context"],
        examples["question"],
        truncation=True,
        padding="max_length",
        max_length=512
    )

# Tokenize the datasets
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_val = val_data.map(preprocess_function, batched=True)

# Verify tokenized data
print(tokenized_train[0])


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

{'context': 'SUBJECTIVE , This 23 year old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over the counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS , Her only medication currently is Ortho Tri Cyclen and the Allegra.,ALLERGIES , She has no known medicine allergies.,OBJECTIVE ,Vitals Weight was 130 pounds and blood pressure 124 78.,HEENT Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck Supple without adenopathy.,Lungs Cl

In [15]:
def preprocess_data(examples):
    inputs = tokenizer(
        examples["question"],
        examples["context"],
        truncation=True,
        padding="max_length",
        max_length=512,
        return_offsets_mapping=True  # To map char positions to token positions
    )

    start_positions = []
    end_positions = []

    for i, offset in enumerate(inputs["offset_mapping"]):
        # Extract answer details (take the first answer for simplicity)
        answer = examples["answers"][i]  # Assuming answers is a list of dicts
        start_char = answer["answer_start"][0]  # First start index
        end_char = start_char + len(answer["text"][0])  # First answer's length

        # Find token indices for start and end
        start_idx = end_idx = None
        for idx, (start, end) in enumerate(offset):
            if start <= start_char < end:
                start_idx = idx
            if start < end_char <= end:
                end_idx = idx

        # Handle cases where start or end are not found
        start_positions.append(start_idx if start_idx is not None else 0)
        end_positions.append(end_idx if end_idx is not None else 0)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    del inputs["offset_mapping"]  # Remove offset mapping after use
    return inputs


In [16]:
tokenized_train = train_data.map(preprocess_data, batched=True)
tokenized_val = val_data.map(preprocess_data, batched=True)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [17]:
print(tokenized_train[0])

{'context': 'SUBJECTIVE , This 23 year old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over the counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS , Her only medication currently is Ortho Tri Cyclen and the Allegra.,ALLERGIES , She has no known medicine allergies.,OBJECTIVE ,Vitals Weight was 130 pounds and blood pressure 124 78.,HEENT Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck Supple without adenopathy.,Lungs Cl

In [18]:
print(train_data[0]["answers"])

{'answer_start': [15], 'text': ['45']}


# Evaluation

In [20]:
!pip install evaluate

# Import the evaluation library
import evaluate

# Load evaluation metric
metric = evaluate.load("squad")

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

# Example predictions and references for evaluation

In [21]:
# Example predictions and references for evaluation
predictions = [
    {"id": "1", "prediction_text": "The patient is 45 years old."},
    {"id": "2", "prediction_text": "The patient complains of headaches."},
]
references = [
    {"id": "1", "answers": {"text": ["The patient is 45 years old."], "answer_start": [0]}},
    {"id": "2", "answers": {"text": ["The patient complains of headaches."], "answer_start": [0]}},
]

# Compute the metric
results = metric.compute(predictions=predictions, references=references)
print(results)

{'exact_match': 100.0, 'f1': 100.0}


# Example Passage

In [28]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Example passage
passage = """This 23 year old white female presents with complaint of allergies.
She used to have allergies when she lived in Seattle but she thinks they are worse here.
In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness.
She has used Allegra also. She used that last summer and she began using it again two weeks ago.
It does not appear to be working very well. She has used over the counter sprays but no prescription nasal sprays.
She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS ,
Her only medication currently is Ortho Tri Cyclen and the Allegra.,ALLERGIES ,
She has no known medicine allergies.,OBJECTIVE ,Vitals Weight was 130 pounds and blood pressure 124 78.,HEENT Her throat was mildly erythematous without exudate.
Nasal mucosa was erythematous and swollen. Only clear drainage was seen.
TMs were clear.,Neck Supple without adenopathy.,Lungs Clear.,ASSESSMENT , Allergic rhinitis.,PLAN ,
1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine.
She does not think she has prescription coverage so that might be cheaper.,
2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well."""

# Define questions
q1 = "How old is the patient?"
q2 = "Does the patient have any complaints?"
q3 = "What is the reason for this consultation?"


questions = [q1, q2, q3]

# Load model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Function to get answer from the model
def get_model_answer(model, question, passage, tokenizer):
    # Tokenize the input
    inputs = tokenizer(question, passage, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Get start and end scores for the answer
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # Get the most likely start and end of the answer
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores) + 1  # add 1 because the end position is inclusive

    # Convert token ids to string
    answer_tokens = inputs["input_ids"][0][answer_start:answer_end]
    answer = tokenizer.decode(answer_tokens)
    return answer.strip()

# Loop through the questions and get answers
for i, q in enumerate(questions):
    print(f"Question {i+1}: {q}")
    print()
    print(f"Answer: {get_model_answer(model, q, passage, tokenizer)}")
    print()
    print()

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Question 1: How old is the patient?

Answer: 23 year old


Question 2: Does the patient have any complaints?

Answer: allergies


Question 3: What is the reason for this consultation?

Answer: allergic rhinitis


