
MedQuAD (Medical Question Answering Dataset)
======

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural
language processing. With over 43,000 patient inquiries from real-life situations categorized into 31
distinct types of questions, the dataset offers an invaluable opportunity to research correlations between
treatments, chronic diseases, medical protocols and more. Answers provided in this database come not
only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a
more complete array of responses to help researchers unlock deeper insights within the realm of
healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining
equipment and get exploring!
## How to use the dataset
In order to make the most out of this dataset, start by having a look at the column names and
understanding what information they offer: qtype (the type of medical question), Question (the question
in itself), and Answer (the expert response). The qtype column will help you categorize the dataset
according to your desired question topics. Once you have filtered down your criteria as much as possible
using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do
most patients search for?” or “Are there any correlations between chronic conditions and protocols?”
Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND
Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic
data set - now it’s time for action! Use all that newfound understanding about patient needs in order
develop educational materials and implement any suggested changes necessary. If more criteria are
needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns
may be added periodically that could further enhance analysis capabilities.
Link: https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset/data

---
### TASK) Questioning Answering using Transformer based model
Implement following transformer based variants for the Question Answering task.
1. BERT
2. MobileBERT
3. RoBERTa
   
Link: https://simpletransformers.ai/docs/qa-specifics/

From the link given above you can get information about the model you need to fine-tune.
Moreover you can find guideline on how input is tailored to pass to Transformer based models.

Use 75% for training and 25% for testing.

For each of these models, try different hyper parameters and report the best results with
parameter values. Like changing number of Encoder Layers etc.
Dropout rate, 0.3 or 0.7
Set n_best_size = 5 and for few questions show models top 5 predicted answers along with
actual.

Use “wandb” to record training visualization.

Calculate BLUE Score and Rouge for both the models and report the results in table.

Also report parameter values which were used to get the results.

In [None]:
!pip install -q opendatasets
import opendatasets as od
import os

# Download the dataset
od.download("https://www.kaggle.com/thedevastator/comprehensive-medical-q-a-dataset")


Dataset URL: https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
Downloading comprehensive-medical-q-a-dataset.zip to ./comprehensive-medical-q-a-dataset


100%|██████████| 4.89M/4.89M [00:00<00:00, 5.23MB/s]





- Creating st environment
- https://simpletransformers.ai/docs/installation/
```bash
conda create -n st python pandas tqdm
conda activate st
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install simpletransformers
pip install wandb
pip install --upgrade simpletransformers
pip install --upgrade transformers
```

In [None]:
import pandas as pd
data=pd.read_csv('comprehensive-medical-q-a-dataset/train.csv')
data

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."
...,...,...,...
16402,symptoms,What are the symptoms of Familial visceral myo...,What are the signs and symptoms of Familial vi...
16403,information,What is (are) Pseudopelade of Brocq ?,Pseudopelade of Brocq (PBB) is a slowly progre...
16404,symptoms,What are the symptoms of Pseudopelade of Brocq ?,What are the signs and symptoms of Pseudopelad...
16405,treatment,What are the treatments for Pseudopelade of Br...,Is there treatment or a cure for pseudopelade ...


**Data Conversion**

In [None]:
from sklearn.model_selection import train_test_split

converted_data = []
for index, row in data.iterrows():
    context = row['Question'] + " " + row['Answer']
    answer_start = len(row['Question']) + 1  # +1 for the space
    converted_data.append({
        'qas': [
            {
                'id': str(index),
                'question': row['Question'],
                'answers': [
                    {
                        'text': row['Answer'],
                        'answer_start': answer_start
                    }
                ]
            }
        ],
        'context': context
    })

train_data, test_data = train_test_split(converted_data, test_size=0.25,random_state=42)

In [None]:
!pip install -q simpletransformers rouge_score nltk

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/315.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/315.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━

In [None]:
import wandb
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

**BERT**

In [8]:
import logging
import wandb
import os
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
import numpy as np

os.environ["WANDB_HTTP_TIMEOUT"] = "180"
wandb.init(project="MedQuad", entity="ammar-90",
           config={"batch_size": 12, "epochs": 3, "learning_rate": 3e-5, "train_size": len(train_data), "eval_size": len(test_data) })

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'num_train_epochs': 3,
    'learning_rate': 3e-5,
    'n_best_size': 5,
    'max_seq_length': 384,
    'doc_stride': 128,
    'train_batch_size': 12,
    'gradient_accumulation_steps': 8,
     'wandb_project': 'MedQuad',
     "use_multiprocessing_for_evaluation": True,
"multiprocessing_chunksize": 5
}

model = QuestionAnsweringModel(
    "bert", "bert-base-uncased", args=model_args, use_cuda=True
)


# Train the model
model.train_model(train_data)

# Evaluate the model
eval_results = model.eval_model(test_data)

print(eval_results)

def extract_answers(data):
    actual_answers = []
    for data_item in data:  # Iterate over each dictionary in the list
        for item in data_item['qas']:
            for answer in item['answers']:
                actual_answers.append(answer['text'])
    return actual_answers
# To get the actual answers from the test data
actual_answers = extract_answers(test_data)




VBox(children=(Label(value='0.001 MB of 0.003 MB uploaded\r'), FloatProgress(value=0.3447204968944099, max=1.0…

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
convert squad examples to features: 100%|██████████| 12305/12305 [03:37<00:00, 56.48it/s]
add example index and unique id: 100%|██████████| 12305/12305 [00:00<00:00, 435487.34it/s]


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Running Epoch 1 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]

  self.pid = os.fork()
convert squad examples to features: 100%|██████████| 4102/4102 [00:53<00:00, 76.58it/s]
add example index and unique id: 100%|██████████| 4102/4102 [00:00<00:00, 277058.17it/s]


Running Evaluation:   0%|          | 0/64 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [9]:
# To get the predicted answers from the model
# The model.predict() function expects a list of contexts and questions
# If test_data is a list of dictionaries
contexts = [data['context'] for data in test_data]
questions = [[qas['question'] for qas in data['qas']] for data in test_data]

# Combine contexts and questions
to_predict = [{'context': context, 'qas': [{'question': question, 'id': str(i)} for i, question in enumerate(questions_list)]} for context, questions_list in zip(contexts, questions)]
# Predict answers
predictions = model.predict(to_predict)

# The predictions are a list of two lists. The first list contains dictionaries with 'id' and 'answer' keys.
predicted_answers = [pred['answer'][0] for pred in predictions[0]]
# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)

convert squad examples to features: 100%|██████████| 4102/4102 [00:51<00:00, 79.60it/s]
add example index and unique id: 100%|██████████| 4102/4102 [00:00<00:00, 303113.67it/s]


Running Prediction:   0%|          | 0/64 [00:00<?, ?it/s]

Average BLEU Score: 0.4237
ROUGE Scores: {'rouge1': 0.3, 'rouge2': 0.56, 'rougeL': 0.22}


**MobileBERT**

In [10]:
import logging
import wandb
import os
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
import numpy as np

os.environ["WANDB_HTTP_TIMEOUT"] = "180"
wandb.init(project="MedQuad", entity="ammar-90",
           config={"batch_size": 12, "epochs": 3, "learning_rate": 3e-5, "train_size": len(train_data), "eval_size": len(test_data) })

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'num_train_epochs': 3,
    'learning_rate': 3e-5,
    'n_best_size': 5,
    'max_seq_length': 384,
    'doc_stride': 128,
    'train_batch_size': 12,
    'gradient_accumulation_steps': 8,
     'wandb_project': 'MedQuad',
     "use_multiprocessing_for_evaluation": True,
"multiprocessing_chunksize": 5
}

model = QuestionAnsweringModel(
    "mobilebert", "google/mobilebert-uncased", args=model_args, use_cuda=True
)

model.train_model(train_data)

eval_results = model.eval_model(test_data)

print(eval_results)

def extract_answers(data):
    actual_answers = []
    for data_item in data:  # Iterate over each dictionary in the list
        for item in data_item['qas']:
            for answer in item['answers']:
                actual_answers.append(answer['text'])
    return actual_answers
# To get the actual answers from the test data

actual_answers = extract_answers(test_data)
# To get the predicted answers from the model
# The model.predict() function expects a list of contexts and questions
# If test_data is a list of dictionaries
contexts = [data['context'] for data in test_data]
questions = [[qas['question'] for qas in data['qas']] for data in test_data]

# Combine contexts and questions
to_predict = [{'context': context, 'qas': [{'question': question, 'id': str(i)} for i, question in enumerate(questions_list)]} for context, questions_list in zip(contexts, questions)]
# Predict answers
predictions = model.predict(to_predict)

# The predictions are a list of two lists. The first list contains dictionaries with 'id' and 'answer' keys.
predicted_answers = [pred['answer'][0] for pred in predictions[0]]
# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Training loss,█▁▁▁▁▁▁▁▁▁▁▁
global_step,▁▂▂▃▄▄▅▅▆▇▇█
lr,█▇▇▆▅▅▄▄▃▂▂▁

0,1
Training loss,0.00092
global_step,600.0
lr,0.0




config.json:   0%|          | 0.00/847 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/147M [00:00<?, ?B/s]

Some weights of MobileBertForQuestionAnswering were not initialized from the model checkpoint at google/mobilebert-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

convert squad examples to features: 100%|██████████| 12305/12305 [03:40<00:00, 55.79it/s]
add example index and unique id: 100%|██████████| 12305/12305 [00:00<00:00, 475988.07it/s]


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Running Epoch 1 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]



Running Epoch 2 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1657 [00:00<?, ?it/s]

convert squad examples to features: 100%|██████████| 4102/4102 [00:53<00:00, 76.28it/s]
add example index and unique id: 100%|██████████| 4102/4102 [00:00<00:00, 425120.09it/s]


Running Evaluation:   0%|          | 0/64 [00:00<?, ?it/s]

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

convert squad examples to features: 100%|██████████| 4102/4102 [01:13<00:00, 55.87it/s]
add example index and unique id: 100%|██████████| 4102/4102 [00:00<00:00, 427890.15it/s]


Running Prediction:   0%|          | 0/64 [00:00<?, ?it/s]

Average BLEU Score: 0.3
ROUGE Scores: {'rouge1': 0.27, 'rouge2': 0.14, 'rougeL': 0.0}


**ROBERTa**

In [None]:
import logging
import wandb
import os
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
import numpy as np

os.environ["WANDB_HTTP_TIMEOUT"] = "180"
wandb.init(project="MedQuad", entity="ammar-90",
           config={"batch_size": 6, "epochs": 1, "learning_rate": 0.001, "train_size": len(train_data), "eval_size": len(test_data) })

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

model_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'num_train_epochs': 1,
    'learning_rate': 0.001,
    'n_best_size': 5,
    'max_seq_length': 384,
    'doc_stride': 128,
    'train_batch_size': 6,
    'gradient_accumulation_steps': 4,
     'wandb_project': 'MedQuad',
     "use_multiprocessing_for_evaluation": True,
"multiprocessing_chunksize": 5
}

model = QuestionAnsweringModel(
    "roberta", "roberta-base", args=model_args, use_cuda=True
)

model.train_model(train_data)

eval_results = model.eval_model(test_data)

print(eval_results)

def extract_answers(data):
    actual_answers = []
    for data_item in data:  # Iterate over each dictionary in the list
        for item in data_item['qas']:
            for answer in item['answers']:
                actual_answers.append(answer['text'])
    return actual_answers
# To get the actual answers from the test data

actual_answers = extract_answers(test_data)
# To get the predicted answers from the model
# The model.predict() function expects a list of contexts and questions
# If test_data is a list of dictionaries
contexts = [data['context'] for data in test_data]
questions = [[qas['question'] for qas in data['qas']] for data in test_data]

# Combine contexts and questions
to_predict = [{'context': context, 'qas': [{'question': question, 'id': str(i)} for i, question in enumerate(questions_list)]} for context, questions_list in zip(contexts, questions)]
# Predict answers
predictions = model.predict(to_predict)

# The predictions are a list of two lists. The first list contains dictionaries with 'id' and 'answer' keys.
predicted_answers = [pred['answer'][0] for pred in predictions[0]]
# Function to calculate BLEU score
def calculate_bleu(actual_answers, predicted_answers):
    scores = []
    for actual, predicted in zip(actual_answers, predicted_answers):
        reference = actual.split()  # Actual answer tokens
        candidate = predicted.split()  # Predicted answer tokens
        score = sentence_bleu([reference], candidate)
        scores.append(score)
    return sum(scores) / len(scores)  # Return average BLEU score

bleu_score = calculate_bleu(actual_answers, predicted_answers)
print("Average BLEU Score:", bleu_score)

# Function to calculate ROUGE scores
def calculate_rouge(actual_answers, predicted_answers):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
    for actual, predicted in zip(actual_answers, predicted_answers):
        score = scorer.score(actual, predicted)
        for key in scores:
            scores[key].append(score[key].fmeasure)  # We are using the F1 measure here

    # Calculate average scores
    avg_scores = {key: np.mean(value) for key, value in scores.items()}
    return avg_scores

rouge_scores = calculate_rouge(actual_answers, predicted_answers)
print("ROUGE Scores:", rouge_scores)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()
convert squad examples to features:   100%|██████████| 12305/12305 [03:37<00:00, 56.48it/s]