### Download the dataset

[comprehensive-medical-q-a-dataset](https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset)\
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. 

In [1]:
# download the dataset from kaggle

import kagglehub

# Download latest version
path = kagglehub.dataset_download("thedevastator/comprehensive-medical-q-a-dataset")

print("Path to dataset files:", path)

Path to dataset files: /home/mshifa/.cache/kagglehub/datasets/thedevastator/comprehensive-medical-q-a-dataset/versions/2


In [1]:
import json
import re
import pandas as pd
from tqdm.auto import tqdm

# Load dataset
df_raw = pd.read_csv("./../dataset/medical_qa_raw.csv")
df_raw.head(5)

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."


In [2]:
df_raw.describe()

Unnamed: 0,qtype,Question,Answer
count,16407,16407,16407
unique,16,14979,15817
top,information,What causes Causes of Diabetes ?,This condition is inherited in an autosomal re...
freq,4535,20,348


In [3]:
# Shape and columns
print("Shape:", df_raw.shape)
print("Columns:", df_raw.columns.tolist())

Shape: (16407, 3)
Columns: ['qtype', 'Question', 'Answer']


In [4]:
# Null values check
print("\nMissing Values:")
print(df_raw.isnull().sum())


Missing Values:
qtype       0
Question    0
Answer      0
dtype: int64


In [5]:
# Function to clean spaces, tabs, and normalize text
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text).replace("\t", " ")             # replace tabs with spaces
    text = re.sub(r"\s+", " ", text)                # collapse multiple spaces
    text = re.sub(r"\s+\?", "?", text)              # remove space before ?
    text = re.sub(r"\?+", "?", text)                # replace ?? or ??? with single ?
    return text.strip()

df_clean = df_raw.copy()
# Apply cleaning to all string columns
for col in df_clean.select_dtypes(include=["object"]).columns:
    df_clean[col] = df_clean[col].apply(clean_text)
  

In [6]:
def remove_duplicate_questions(df):
    """Remove exact duplicate questions"""
    print(f"Original dataset shape: {df.shape}")
    
    # Remove exact duplicates based on Question column
    df_clean = df.drop_duplicates(subset=['Question'], keep='first')
    
    print(f"After removing duplicate questions: {df_clean.shape}")
    print(f"Removed {len(df) - len(df_clean)} duplicate questions")
    
    return df_clean

# Basic duplicate removal
df_unique = remove_duplicate_questions(df_clean)

Original dataset shape: (16407, 3)
After removing duplicate questions: (14979, 3)
Removed 1428 duplicate questions


In [7]:
df_unique.describe()

Unnamed: 0,qtype,Question,Answer
count,14979,14979,14979
unique,16,14979,14443
top,information,Who is at risk for Lymphocytic Choriomeningiti...,This condition is inherited in an autosomal re...
freq,3822,1,348


#### generate unique id for each record

In [8]:
import hashlib
def generate_document_id(doc):
    # combined = f"{doc['course']}-{doc['question']}"
    combined = f"{doc['Question']}-{doc['Answer'][:15]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [9]:
df_unique["id"] = df_unique.apply(lambda row: generate_document_id(row), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique["id"] = df_unique.apply(lambda row: generate_document_id(row), axis=1)


In [10]:
# Save cleaned CSV
df_unique.to_csv("./../dataset/medical_qa_with_id.csv", index=False)

In [11]:
df_unique

Unnamed: 0,qtype,Question,Answer,id
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,f72c0d85
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,9e8711f0
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",261d4d14
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",9e68d9a5
5,prevention,How to prevent Lymphocytic Choriomeningitis (L...,LCMV infection can be prevented by avoiding co...,2b1db317
...,...,...,...,...
16402,symptoms,What are the symptoms of Familial visceral myo...,What are the signs and symptoms of Familial vi...,78c3def4
16403,information,What is (are) Pseudopelade of Brocq?,Pseudopelade of Brocq (PBB) is a slowly progre...,1cac2b57
16404,symptoms,What are the symptoms of Pseudopelade of Brocq?,What are the signs and symptoms of Pseudopelad...,6e42cd46
16405,treatment,What are the treatments for Pseudopelade of Br...,Is there treatment or a cure for pseudopelade ...,1e5e6716


### Save the documents as a json file

In [12]:
documents = []

for  _, row in df_unique.iterrows():
    doc = {
        "answer": row["Answer"].strip(),
        "question": row["Question"].strip(),
        "qtype": row['qtype'].strip(),
        "id": row['id']

    }
    documents.append(doc)

final_data = [
    {
        "document_info": "Comprehensive Medical Q&A Dataset",
        "documents": documents
    }
]

In [13]:
# Save JSON file
with open("./../dataset/medical_qa_documents_with_id.json", "w", encoding="utf-8") as f_in:
    json.dump(final_data, f_in, indent=2, ensure_ascii=False)

print("Saved cleaned JSON at ./../dataset/medical_qa_documents_with_id.json")


Saved cleaned JSON at ./../dataset/medical_qa_documents_with_id.json


In [14]:
!head ./../dataset/medical_qa_documents_with_id.json


[
  {
    "document_info": "Comprehensive Medical Q&A Dataset",
    "documents": [
      {
        "answer": "LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents. Transmission may also occur when these materials are directly introduced into broken skin, the nose, the eyes, or the mouth, or presumably, via the bite of an infected rodent. Person-to-person transmission has not been reported, with the exception of vertical transmission from infected mother to fetus, and rarely, through organ transplantation.",
        "question": "Who is at risk for Lymphocytic Choriomeningitis (LCM)?",
        "qtype": "susceptibility",
        "id": "f72c0d85"
      },
