
MedQuAD (Medical Question Answering Dataset)
======

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural
language processing. With over 43,000 patient inquiries from real-life situations categorized into 31
distinct types of questions, the dataset offers an invaluable opportunity to research correlations between
treatments, chronic diseases, medical protocols and more. Answers provided in this database come not
only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a
more complete array of responses to help researchers unlock deeper insights within the realm of
healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining
equipment and get exploring!
## How to use the dataset
In order to make the most out of this dataset, start by having a look at the column names and
understanding what information they offer: qtype (the type of medical question), Question (the question
in itself), and Answer (the expert response). The qtype column will help you categorize the dataset
according to your desired question topics. Once you have filtered down your criteria as much as possible
using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do
most patients search for?” or “Are there any correlations between chronic conditions and protocols?”
Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND
Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic
data set - now it’s time for action! Use all that newfound understanding about patient needs in order
develop educational materials and implement any suggested changes necessary. If more criteria are
needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns
may be added periodically that could further enhance analysis capabilities.
Link: https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset/data

---
### TASK) Questioning Answering using Transformer based model
Implement following transformer based variants for the Question Answering task.
1. BERT
2. MobileBERT
3. RoBERTa
   
Link: https://simpletransformers.ai/docs/qa-specifics/

From the link given above you can get information about the model you need to fine-tune.
Moreover you can find guideline on how input is tailored to pass to Transformer based models.

Use 75% for training and 25% for testing.

For each of these models, try different hyper parameters and report the best results with
parameter values. Like changing number of Encoder Layers etc.
Dropout rate, 0.3 or 0.7
Set n_best_size = 5 and for few questions show models top 5 predicted answers along with
actual.

Use “wandb” to record training visualization.

Calculate BLUE Score and Rouge for both the models and report the results in table.

Also report parameter values which were used to get the results.

In [None]:
!pip install -q opendatasets 
import opendatasets as od
import os

# Download the dataset
od.download("https://www.kaggle.com/thedevastator/comprehensive-medical-q-a-dataset")


- Creating st environment
- https://simpletransformers.ai/docs/installation/
```bash
conda create -n st python pandas tqdm
conda activate st
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install simpletransformers
pip install wandb
pip install --upgrade simpletransformers
pip install --upgrade transformers

In [1]:
import pandas as pd
data=pd.read_csv('comprehensive-medical-q-a-dataset/train.csv')
data

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."
...,...,...,...
16402,symptoms,What are the symptoms of Familial visceral myo...,What are the signs and symptoms of Familial vi...
16403,information,What is (are) Pseudopelade of Brocq ?,Pseudopelade of Brocq (PBB) is a slowly progre...
16404,symptoms,What are the symptoms of Pseudopelade of Brocq ?,What are the signs and symptoms of Pseudopelad...
16405,treatment,What are the treatments for Pseudopelade of Br...,Is there treatment or a cure for pseudopelade ...


**Data Conversion**

In [2]:
from sklearn.model_selection import train_test_split

converted_data = []
for index, row in data.iterrows():
    context = row['Question'] + " " + row['Answer']
    answer_start = len(row['Question']) + 1  # +1 for the space
    converted_data.append({
        'qas': [
            {
                'id': str(index),
                'question': row['Question'],
                'answers': [
                    {
                        'text': row['Answer'],
                        'answer_start': answer_start
                    }
                ]
            }
        ],
        'context': context
    })

train_data, test_data = train_test_split(converted_data, test_size=0.25,random_state=42)

In [9]:
train_data[1]

{'qas': [{'id': '6723',
   'question': 'What is (are) Muckle-Wells syndrome ?',
   'answers': [{'text': 'Muckle-Wells syndrome is a disorder characterized by periodic episodes of skin rash, fever, and joint pain. Progressive hearing loss and kidney damage also occur in this disorder.  People with Muckle-Wells syndrome have recurrent "flare-ups" that begin during infancy or early childhood. These episodes may appear to arise spontaneously or be triggered by cold, heat, fatigue, or other stresses. Affected individuals typically develop a non-itchy rash, mild to moderate fever, painful and swollen joints, and in some cases redness in the whites of the eyes (conjunctivitis).  Hearing loss caused by progressive nerve damage (sensorineural deafness) typically becomes apparent during the teenage years. Abnormal deposits of a protein called amyloid (amyloidosis) cause progressive kidney damage in about one-third of people with Muckle-Wells syndrome; these deposits may also damage other organs.

In [10]:
test_data[1]

{'qas': [{'id': '15104',
   'question': 'What are the treatments for 21-hydroxylase deficiency ?',
   'answers': [{'text': 'What is the goal for treating 21-hydroxylase-deficient congenital adrenal hyperplasia? The objectives for treating 21-hydroxylase deficiency differ with age. In childhood, the overall goal is to replace cortisol. Obtaining hormonal balance is important and patients growth velocity and bone age is monitored. Routine analysis of blood, urine, and/or saliva may also be necessary. Corrective surgery is frequently required for females born with abnormal genitalia. In late childhood and adolescence, maintaining hormonal balance is equally important. Overtreatment may result in obesity and delayed menarche/puberty, whereas under-replacement will result in sexual precocity. Also, it is important that teens and young adults with 21-hydroxylase deficiency be successfully transitioned to adult care facilities. Follow-up of adult patients should involve multidisciplinary clin

In [None]:
!pip install -q simpletransformers rouge_score nltk

In [3]:
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: ammar-90123 (ammar-90). Use `wandb login --relogin` to force relogin


True

**BERT**

In [11]:


def extract_predictions(predictions):
    predicted_answers = []
    for prediction in predictions:
        predicted_answers.append(prediction['answer'])
    return predicted_answers


In [21]:
def extract_answers(data):
    actual_answers = []
    for data_item in data:  # Iterate over each dictionary in the list
        for item in data_item['qas']:
            for answer in item['answers']:
                actual_answers.append(answer['text'])
    return actual_answers
# To get the actual answers from the test data
actual_answers = extract_answers(test_data)

To get the predicted answers from the model
The model.predict() function expects a list of contexts and questions
If test_data is a list of dictionaries
contexts = [data['context'] for data in test_data]
questions = [[qas['question'] for qas in data['qas']] for data in test_data]

# Combine contexts and questions
to_predict = [{'context': context, 'qas': [{'question': question, 'id': str(i)} for i, question in enumerate(questions_list)]} for context, questions_list in zip(contexts, questions)]
# Predict answers
predictions = model.predict(to_predict)

# The predictions are a list of two lists. The first list contains dictionaries with 'id' and 'answer' keys.
predicted_answers = [pred['answer'][0] for pred in predictions[0]]

In [22]:
# import logging
# import wandb
# import os
# from nltk.translate.bleu_score import sentence_bleu
# from rouge_score import rouge_scorer
# from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
# import numpy as np

# os.environ["WANDB_HTTP_TIMEOUT"] = "180"
# wandb.init(project="MedQuad", entity="ammar-90",
#            config={"batch_size": 12, "epochs": 3, "learning_rate": 3e-5, "train_size": len(train_data), "eval_size": len(test_data) })

# logging.basicConfig(level=logging.INFO)
# transformers_logger = logging.getLogger("transformers")
# transformers_logger.setLevel(logging.WARNING)

# model_args = {
#     'reprocess_input_data': True,
#     'overwrite_output_dir': True,
#     'num_train_epochs': 3,
#     'learning_rate': 3e-5,
#     'n_best_size': 5,
#     'max_seq_length': 384,
#     'doc_stride': 128,
#     'train_batch_size': 12,
#     'gradient_accumulation_steps': 8,
#      'wandb_project': 'MedQuad',
#      "use_multiprocessing_for_evaluation": True,
# "multiprocessing_chunksize": 5
# }

# model = QuestionAnsweringModel(
#     "bert", "bert-base-uncased", args=model_args, use_cuda=True
# )


# # Train the model
# model.train_model(train_data)

# Evaluate the model
# eval_results = model.eval_model(test_data)

# print(eval_results)

# actual_answers = extract_answers(test_data)
# The predictions are a list of two lists. The first list contains dictionaries with 'id' and 'answer' keys.
predicted_answers = [pred['answer'][0] for pred in predictions[0]]
# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)

TypeError: Fraction.__new__() got an unexpected keyword argument '_normalize'

: 

**MobileBERT**

In [None]:
import logging
import wandb
import os
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
import numpy as np

os.environ["WANDB_HTTP_TIMEOUT"] = "180"
wandb.init(project="MedQuad", entity="ammar-90",
           config={"batch_size": 12, "epochs": 3, "learning_rate": 3e-5, "train_size": len(train_data), "eval_size": len(test_data) })

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'num_train_epochs': 3,
    'learning_rate': 3e-5,
    'n_best_size': 5,
    'max_seq_length': 384,
    'doc_stride': 128,
    'train_batch_size': 12,
    'gradient_accumulation_steps': 8,
     'wandb_project': 'MedQuad',
     "use_multiprocessing_for_evaluation": True,
"multiprocessing_chunksize": 5
}

model = QuestionAnsweringModel(
    "mobilebert", "google/mobilebert-uncased", args=model_args, use_cuda=True
)

model.train_model(train_data)

results, model_outputs, wrong_predictions = model.eval_model(test_data)

print(results)

actual_answers = [x['answers'][0]['text'] for x in test_data]
predicted_answers = model.predict([x['context'] for x in test_data])

# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)

**ROBERTa**

In [None]:
import logging
import wandb
import os
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
import numpy as np

os.environ["WANDB_HTTP_TIMEOUT"] = "180"
wandb.init(project="MedQuad", entity="ammar-90",
           config={"batch_size": 12, "epochs": 3, "learning_rate": 3e-5, "train_size": len(train_data), "eval_size": len(test_data) })

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'num_train_epochs': 3,
    'learning_rate': 3e-5,
    'n_best_size': 5,
    'max_seq_length': 384,
    'doc_stride': 128,
    'train_batch_size': 12,
    'gradient_accumulation_steps': 8,
     'wandb_project': 'MedQuad',
     "use_multiprocessing_for_evaluation": True,
"multiprocessing_chunksize": 5
}

model = QuestionAnsweringModel(
    "roberta", "roberta-base", args=model_args, use_cuda=True
)

model.train_model(train_data)

results, model_outputs, wrong_predictions = model.eval_model(test_data)

print(results)

actual_answers = [x['answers'][0]['text'] for x in test_data]
predicted_answers = model.predict([x['context'] for x in test_data])

# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)