# Medical Question-Answer Dataset Curation Using Label Studio

In this tutorial, we'll explore the process of curating a dataset for a Generative AI model, specifically focusing on medical question and answer generation using Label Studio. This involves setting up projects in Label Studio, importing datasets, and configuring tasks for question and answer generation.

### Installation
To set up your environment, install the required libraries using pip:

In [32]:
!pip install -r requirements.txt



## Setup Label Studio
Start Label Studio and connect to it with the SDK. You can retrieve an API key from the user settings. 

```
$ label-studio
```

In [33]:
# Import the SDK and the client module
from label_studio_sdk import Client

# Define the URL where Label Studio is accessible and the API key for your user account
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = '8ec0e4df82864ebbd95596f31b4cedd8cc9c3a25'

# Connect to the Label Studio API and check the connection
client = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
client.check_connection()

{'status': 'UP'}

## MedChat Dataset
In this section, we will import the [MedChat QA dataset](https://huggingface.co/datasets/ngram/medchat-qa) from Hugging Face datasets and set up a Label Studio project to curate it.

In [37]:
from datasets import load_dataset

medchat_dataset = load_dataset("ngram/medchat-qa")

In [38]:
medchat_dataset['train'][1]

{'question': 'What is one of the methods of administering this drug, as mentioned in the section?',
 'answer': 'For intravenous infusion into a peripheral or central vein.'}

### Creating a Label Studio Project
Next, create a new project in Label Studio for this dataset:

In [39]:
medchat_project = client.start_project(
    title='Project 1: MedChat',
    color='#008000',
    label_config='''
<View className="root">
  <Style>
  .root {
    font-family: 'Roboto', sans-serif;
    line-height: 1.6;
    background-color: #f0f0f0;
  }
  .container {
    margin: 0 auto;
    padding: 20px;
    background-color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt {
    padding: 20px;
    background-color: #0084ff;
    color: #ffffff;
    border-radius: 5px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt-input {
    flex-basis: 49%;
    padding: 20px;
    background-color: rgba(44, 62, 80, 0.9);
    color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
    width: 100%;
    border: none;
    font-family: 'Roboto', sans-serif;
    font-size: 16px;
    outline: none;
  }
  .prompt-input:focus {
    outline: none;
  }
  .prompt-input:hover {
    background-color: rgba(52, 73, 94, 0.9);
    cursor: pointer;
    transition: all 0.3s ease;
  }
  .lsf-richtext__line:hover {
    background: unset;
  }
  </Style>
  <Text name="chat" value="$question" layout="dialogue" />
  <Header value="Answer:"/>
  	<Text name="summary" value="$answer" toName="summary" rows="4" editable="false" maxSubmissions="1" showSubmitButton="false"/>
  <Header value="User prompt:" />
  <View className="prompt">
  <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="true" />
  </View>
  <Header value="Bot answer:"/>
    <TextArea name="response" toName="chat" rows="4" editable="false" maxSubmissions="1" showSubmitButton="false" />
</View>
    '''
)

This project setup also includes the capability to configure the Label Studio ML Backend for using an LLM to help label the data. 

To set up the ML Backend, see the [LLM Interactive Example](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/llm_interactive).

```bash
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
cd label-studio-ml-backend/label_studio_ml/examples/llm_interactive
```

We can edit the `docker-compose.yml` to point to the model we are interested in. For instance if we want to configure it to use Llama 3 running with Ollama, we can configure the following fields: 

```yaml
OPENAI_PROVIDER=ollama
OPENAI_MODEL=llama3
OLLAMA_ENDPOINT=http://host.docker.internal:11434/v1/
```

Insert the data into the Label Studio project. 

In [40]:
# task = medchat_dataset['train'][1]
medchat_tasks = []
for t in medchat_dataset['train']:
    medchat_tasks.append(t)
medchat_project.import_tasks(medchat_tasks) 

[404575,
 404576,
 404577,
 404578,
 404579,
 404580,
 404581,
 404582,
 404583,
 404584,
 404585,
 404586,
 404587,
 404588,
 404589,
 404590,
 404591,
 404592,
 404593,
 404594,
 404595,
 404596,
 404597,
 404598,
 404599,
 404600,
 404601,
 404602,
 404603,
 404604,
 404605,
 404606,
 404607,
 404608,
 404609,
 404610,
 404611,
 404612,
 404613,
 404614,
 404615,
 404616,
 404617,
 404618,
 404619,
 404620,
 404621,
 404622,
 404623,
 404624,
 404625,
 404626,
 404627,
 404628,
 404629,
 404630,
 404631,
 404632,
 404633,
 404634,
 404635,
 404636,
 404637,
 404638,
 404639,
 404640,
 404641,
 404642,
 404643,
 404644,
 404645,
 404646,
 404647,
 404648,
 404649,
 404650,
 404651,
 404652,
 404653,
 404654,
 404655,
 404656,
 404657,
 404658,
 404659,
 404660,
 404661,
 404662,
 404663,
 404664,
 404665,
 404666,
 404667,
 404668,
 404669,
 404670,
 404671,
 404672,
 404673,
 404674,
 404675,
 404676,
 404677,
 404678,
 404679,
 404680,
 404681,
 404682,
 404683,
 404684,
 404685,
 

Notice how even this curated dataset has examples that may not be necessarily desirable. 

## Question Generation with MeDAL 
The [MeDAL dataset](https://huggingface.co/datasets/medal) is a comprehensive medical text collection curated specifically for abbreviation disambiguation, aiding in natural language understanding pre-training within the medical domain.

We can use this dataset to provide context for creating a synthetic Q&A dataset. We'll first start with a Label Studio project to generate questions. 

Set up a new project for question generation: 

In [41]:
medal_questions_project = client.start_project(
    title='Project 2: MeDAL Question Generation',
    label_config='''
<View className="root">
  <Style>
  .root {
    font-family: 'Roboto', sans-serif;
    line-height: 1.6;
    background-color: #f0f0f0;
  }
  .container {
    margin: 0 auto;
    padding: 20px;
    background-color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt {
    padding: 20px;
    background-color: #0084ff;
    color: #ffffff;
    border-radius: 5px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt-input {
    flex-basis: 49%;
    padding: 20px;
    background-color: rgba(44, 62, 80, 0.9);
    color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
    width: 100%;
    border: none;
    font-family: 'Roboto', sans-serif;
    font-size: 16px;
    outline: none;
  }
  .prompt-input:focus {
    outline: none;
  }
  .prompt-input:hover {
    background-color: rgba(52, 73, 94, 0.9);
    cursor: pointer;
    transition: all 0.3s ease;
  }
  .lsf-richtext__line:hover {
    background: unset;
  }
  </Style>
  <Text name="chat" value="$text" layout="dialogue"/>
  <Header value="Question prompt:"/>
  <View className="prompt">
    <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false"/>
  </View>
  <Header value="Proposed questions:"/>
  <TextArea name="response" toName="chat" rows="3" editable="true" maxSubmissions="1" showSubmitButton="false"/>
</View>
    '''
)

Load the dataset and insert it into Label Studio. The dataset is quite large, so we'll only load a few examples first. 

In [42]:
medal_dataset = load_dataset("medal", split='train')

In [43]:
# Insert 10 examples into Label Studio
num_examples = 10
for i in range(num_examples):
    task = medal_dataset[i]
    medal_questions_project.import_tasks(task)

For question generation, we need to have a strong prompt to yield solid results. Here is a useful prompt for generating medical questions for examples from the MeDAL dataset. 

```text
Given a block of medical text, generate several direct, succinct, and unique questions that stand alone, focusing on extracting specific medical information such as symptoms, diagnosis, treatment options, or patient management strategies. Each question should aim to elicit precise and informative responses without requiring additional context. The questions should cover diverse aspects of the medical content to ensure a comprehensive understanding. Ensure each question is clear and formulated to be self-contained. Here are examples to guide your question generation:

What are the common symptoms associated with [specific condition]?
How is [specific condition] diagnosed?
What treatment options are available for [specific condition]?
What are the potential side effects of [specific medication]?
What preventive measures are recommended for [specific condition]?

Use these examples as a template, tailoring questions to different parts of the text to maximize the dataset's utility and accuracy. Questions should be separated by a new line and not prefixed by any markers or numbers.
```

## Answer Generation with MeDAL
The final step involves setting up a project for answer generation using the questions created in the previous step.

We'll set up a project, export our questions generated in the previous section and generate answers in Label Studio.

In [44]:
medal_anwers_project = client.start_project(
    title='Project 3: MeDAL Answer Generation',
    label_config='''
<View className="root">
  <Style>
  .root {
    font-family: 'Roboto', sans-serif;
    line-height: 1.6;
    background-color: #f0f0f0;
  }
  .container {
    margin: 0 auto;
    padding: 20px;
    background-color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt {
    padding: 20px;
    background-color: #0084ff;
    color: #ffffff;
    border-radius: 5px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt-input {
    flex-basis: 49%;
    padding: 20px;
    background-color: rgba(44, 62, 80, 0.9);
    color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
    width: 100%;
    border: none;
    font-family: 'Roboto', sans-serif;
    font-size: 16px;
    outline: none;
  }
  .prompt-input:focus {
    outline: none;
  }
  .prompt-input:hover {
    background-color: rgba(52, 73, 94, 0.9);
    cursor: pointer;
    transition: all 0.3s ease;
  }
  .lsf-richtext__line:hover {
    background: unset;
  }
  </Style>
  <Text name="chat" value="$text" layout="dialogue"/>
  <Header value="Answer prompt:"/>
  <View className="prompt">
    <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false"/>
  </View>
  <Header value="Proposed answer:"/>
  <TextArea name="response" toName="chat" rows="3" editable="true" maxSubmissions="1" showSubmitButton="false"/>
</View>
    '''
)

Export questions from our previous project. 

In [45]:
# Download questions from Label Studio
questions_tasks = medal_questions_project.get_labeled_tasks()
len(questions_tasks)

4

In [46]:
questions_tasks[0]['annotations'][0]['result'][1]

{'value': {'text': ['**Question:**\n\nWhich section of this medical article provides information on the potential benefits of svap for preventing cardiac fibrosis from pressure overload?\n\n**Answer:** The section that discusses the effects of svap in cultured cardiac fibroblasts suggests that it may have antifibrotic properties by blocking TGFÎ² signaling.\n\n**Question:**\n\nWhat are the specific mechanisms through which svap is thought to prevent cardiac fibrosis from pressure overload, according to this medical article?\n\n**Answer:** The section entitled "Discussion" provides more details on how svap works by blocking TGFÎ² and its downstream signaling pathways in cardiac fibroblasts.\n\n**Question:**\n\nWhich section of the article describes the effects of angiotensin II (Ang II) on collagen synthesis and Î±-smooth muscle actin-positive CFs?\n\n**Answer:** The section that discusses the effect of Ang II treatment in cultured cardiac fibroblasts suggests that it enhances prolifera

Format as a Hugging Face dataset.

In [47]:
import re
from datasets import Dataset
# Extract questions
def extract_questions_data(questions_tasks):
    data = []
    for task in questions_tasks:
        for result in task['annotations'][0]['result']:
            if result['from_name'] == 'response':
                # Extract the abstract_id
                abstract_id = task['data']['abstract_id']
                
                # Extract the question text and split by newlines to handle multiple questions
                questions = result['value']['text'][0].split('\n')
                
                # Store each question with its corresponding abstract_id
                for question in questions:
                    # Check if the question is not empty and contains at least one alphanumeric character
                    if question.strip() and re.search('[a-zA-Z0-9]', question):
                        data.append({'abstract_id': abstract_id, 'text': question})
                break
    return data

extracted_questions_data = extract_questions_data(questions_tasks)

questions_dataset = Dataset.from_dict({'abstract_id': [item['abstract_id'] for item in extracted_questions_data], 
                             'text': [item['text'] for item in extracted_questions_data]})

# Format a 

Review our dataset and insert it into our answers project. 

In [48]:
questions_dataset

Dataset({
    features: ['abstract_id', 'text'],
    num_rows: 37
})

In [49]:
# Upload the dataset to our Answers Project
for question in questions_dataset: 
    medal_anwers_project.import_tasks(question)

Similar to the questions curation, we also need a strong prompt for generating the answers to these questions. Here is a sample prompt that can be used. 

```text
You are a medical expert tasked with providing the most accurate and succinct answers to specific questions based on detailed medical data. Focus on precision and directness in your responses, ensuring that each answer is factual, concise, and to the point. Avoid unnecessary elaboration and prioritize accuracy over sounding confident. Here are some guidelines for your responses:

- Provide clear, direct answers without filler or extraneous details.
- Base your responses solely on the information available in the medical text provided.
- Ensure that your answers are straightforward and easy to understand, yet medically accurate.
- Avoid speculative or generalized statements that are not directly supported by the text.

Use these guidelines to formulate your answers to the questions presented. 
```

## Curate Q&A Dataset
Once question-answer pairs are generated and refined, download the dataset. For a more robust use case, we would also perform any necessary post-processing here such as formatting and anonymization. 

Finally, upload the curated dataset to Hugging Face, ensuring it meets their standards for public datasets. 

In [50]:
# Download questions from Label Studio
answers_tasks = medal_anwers_project.get_labeled_tasks()
len(answers_tasks)

1

In [51]:
# Extract questions
def extract_answers_data(answers_tasks):
    data = []
    for task in answers_tasks:
        for result in task['annotations'][0]['result']:
            if result['from_name'] == 'response':
                # Extract the abstract_id
                abstract_id = task['data']['abstract_id']
                
                # Extract the question text and split by newlines to handle multiple questions
                answer = result['value']['text'][0]
                question = task['data']['text']
                
                # Store each question with its corresponding abstract_id
                data.append({'abstract_id': abstract_id, 'question': question, 'answer': answer})
    return data

extracted_answers_data = extract_answers_data(answers_tasks)

qa_dataset = Dataset.from_dict({'abstract_id': [item['abstract_id'] for item in extracted_answers_data], 
                             'question': [item['question'] for item in extracted_answers_data],
                             'answer': [item['answer'] for item in extracted_answers_data]})

In [52]:
qa_dataset[0]

{'abstract_id': 14145090,
 'question': 'What are the specific mechanisms through which svap is thought to prevent cardiac fibrosis from pressure overload, according to this medical article?',
 'answer': '**Source Text**:\n\n"What are the specific mechanisms through which svap is thought to prevent cardiac fibrosis from pressure overload, according to this medical article?"\n\n**Response**:\n\nSVAP inhibits JNK1/2 and NF-kappaB signaling pathways by binding to their upstream activator MEK3/6. It also reduces levels of TGF-beta, a major fibroblast growth factor, in the myocardium. SVAP can inhibit the production of extracellular matrix proteins such as collagen I and fibronectin that are associated with cardiac fibrosis.'}

In [None]:
qa_dataset.push_to_hub("<HF_Username>/med-qa")