## **Synthetic data generation**

For our fine-tuning data, we will use the [NoteChat](https://arxiv.org/abs/2310.15959) dataset [(Huggingface link)](https://huggingface.co/datasets/akemiH/NoteChat). This dataset contains 167K pairs of real clinical notes extracted from [PMC-Patients](https://arxiv.org/abs/2202.13876) with generated patient-doctor conversations.

<p align="center">
<img src="figures/notechat.png" width="50%">
</p>

In this notebook, we extend this dataset of pairs to triplets with GPT-3.5: 


(`clinical note`, `dialogue transcript`) $\to$ (`clinical note`, `dialogue transcript`, `patient summary`)

We extract patient summaries by prompting GPT-3.5 with a clinical note, its corresponding dialogue transcript and a comprehensive template of patient features. 

In [2]:
#!pip install -r requirements.txt

from utils.chat import *

%reload_ext autoreload
%autoreload 2

### 1. **Load NoteChat data**

We first take a look at the NoteChat dataset. We load the dataset from the Huggingface library and display a few examples.

In [2]:
from datasets import load_dataset

dataset = load_dataset("akemiH/NoteChat")
df = dataset['train'].to_pandas()
df

Found cached dataset csv (/Users/abonnet/.cache/huggingface/datasets/akemiH___csv/akemiH--NoteChat-1802703932ed5ba3/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,data,conversation
0,This 60-year-old male was hospitalized due to ...,"\nDoctor: Hi, Mr. X, I'm Dr. Y. How are you fe..."
1,A 39-year-old man was hospitalized due to an i...,"\nDoctor: Hello, I am Dr. Smith. Can you tell ..."
2,One week after a positive COVID-19 result this...,"\nDoctor: Good morning, how are you feeling to..."
3,This 69-year-old male was admitted to the ICU ...,"Doctor: Good morning, sir. How are you feeling..."
4,This 57-year-old male was admitted to the ICU ...,"\nDoctor: Good morning, Mr. Patient. How are y..."
...,...,...
206996,A 63-year-old woman with metastatic breast car...,"Doctor: Hi, how are you feeling today?\nPatien..."
206997,"A 6 years old, neutered male Lhasa Apso was pr...","Doctor: Hello, what brings you in today?\nPati..."
206998,"An 8 years old, neutered male mixed breed dog ...","Doctor: Hi, how are you today?\nPatient: I'm n..."
206999,A 4 years old spayed female Doberman Pinscher ...,"Doctor: Hello there, how are you feeling today..."


In [3]:
note = df['data'][0]
conversation = df['conversation'][0]  

print('CLINICAL NOTE')
print(note)
print('\n\nCONVERSATION')
print(conversation.replace('\n\n', '\n'))

CLINICAL NOTE
This 60-year-old male was hospitalized due to moderate ARDS from COVID-19 with symptoms of fever, dry cough, and dyspnea. We encountered several difficulties during physical therapy on the acute ward. First, any change of position or deep breathing triggered coughing attacks that induced oxygen desaturation and dyspnea. To avoid rapid deterioration and respiratory failure, we instructed and performed position changes very slowly and step-by-step. In this way, a position change to the 135° prone position () took around 30 minutes. This approach was well tolerated and increased oxygen saturation, for example, on day 5 with 6 L/min of oxygen from 93% to 97%. Second, we had to adapt the breathing exercises to avoid prolonged coughing and oxygen desaturation. Accordingly, we instructed the patient to stop every deep breath before the need to cough and to hold inspiration for better air distribution. In this manner, the patient performed the breathing exercises well and managed

### 2. **Experiment with extraction of patient summaries**


$$\text{Prompt} + \text{Clinical note} + \text{Dialogue transcript} + \text{Template (with definitions?)} \to \text{Patient summary}$$

- Find best template + prompt to extract patient summaries. 
- Cost analysis using token estimates. 

#### Extraction options

We will have to test out a few options to see which one works best.

- **Clinical note and/or dialogue transcript?** Dialogue is generated from clinical note, so all information contained in the dialogue should be in the clinical note --> only use clinical note?

- **Zero-shot vs. One-shot?** Do we include an example of (clinical note, dialogue transcript, patient summary) in the prompt? Ideally yes, but it might not fit in the prompt. If we remove the dialogue, it might fit better. 

- **Template definitions?** Do we include definitions of the template in the prompt? Adding the definitions might not help, and might not fit in the prompt. 

In [4]:
# Load the extraction prompt
instruction_path = 'generation/instructions/instructions.txt'
with open('generation/instructions/instructions.txt', 'r') as f:
    instructions = f.read()

print(instructions)

Given the provided clinical note and patient-doctor dialogue, extract the patient information following the template provided. 

If a field is not mentioned in the dialogue, simply write "feature": None.



In [5]:
# Load the template
with open('generation/templates/template.json', 'r') as f:
    template = json.load(f)

template

{'visit motivation': '',
 'patient information': {'age': '',
  'gender': '',
  'family medical history': '',
  'recent travels': '',
  'socio economic context': '',
  'occupation': '',
  'exercise frequency': '',
  'nutrition': '',
  'sexual history': '',
  'alcohol consumption': '',
  'drug usage': '',
  'smoking status': ''},
 'current symptoms': [{'name of symptom': '',
   'intensity of symptom': '',
   'location': '',
   'when did the symptom appear': '',
   'temporalisation': ''}],
 'previous diagnostics': [{'name of condition': '',
   'severity': '',
   'prescribed treatment': '',
   'reaction to treatment': ''}],
 'current medication': [{'name of medication': '',
   'dosage': '',
   'frequency': '',
   'duration': '',
   'reason for taking': ''}],
 'patient medical history': {'physiological context': '',
  'psychological context': '',
  'vaccination history': '',
  'recent surgeries': '',
  'allergies': ''},
 'visit conclusion': {'diagnosis': '',
  'prescribed treatment': '',
  

In [6]:
# Load the template definitions
with open('generation/templates/template_definitions.json', 'r') as f:
    template_def = json.load(f)

template_def

{'visit motivation': "Reason for the patient's visit",
 'patient information': {'age': "Patient's age",
  'sex': "Patient's sex",
  'family medical history': 'Information about family medical history',
  'recent travels': "Details about patient's recent travels",
  'socio economic context': "Patient's socioeconomic background",
  'occupation': "Patient's occupation",
  'exercise frequency': "Frequency of patient's exercise activity",
  'nutrition': "Information about patient's nutrition",
  'sexual history': "Relevant details about patient's sexual history",
  'alcohol consumption': "Patient's alcohol consumption habits",
  'drug usage': 'Information about any drugs used by patient',
  'smoking status': "Patient's smoking status"},
 'current symptoms': [{'name of symptom': 'Specific symptom experienced by the patient',
   'intensity of symptom': 'Severity or intensity of the symptom',
   'location': 'Where the symptom is localized',
   'when did the symptom appear': 'Time of onset for 

In [5]:
model = 'gpt-4'
template_path = 'generation/templates/template_definitions.json'
save_path = 'generation/summaries.jsonl'
keys_path = 'generation/keys.json'

In [None]:
extract(
    model,
    template_path,
    save_path,
    keys_path,
    dataframe=df,
    use_notes=True, 
    use_dialogues=True)

In [8]:
summaries = pd.read_json(save_path, lines=True)
summaries = summaries[summaries['summary'].str.len() > 0]
for i, row in summaries.iterrows():
    print(f'CLINICAL NOTE {i}')
    print(row['data'])
    print('\n\nCONVERSATION')
    print(row['conversation'].replace('\n\n', '\n'))
    print('\n\nSUMMARY')
    print(row['summary'].replace('\n\n', '\n'))
    print('\n\n')

CLINICAL NOTE 0
This 60-year-old male was hospitalized due to moderate ARDS from COVID-19 with symptoms of fever, dry cough, and dyspnea. We encountered several difficulties during physical therapy on the acute ward. First, any change of position or deep breathing triggered coughing attacks that induced oxygen desaturation and dyspnea. To avoid rapid deterioration and respiratory failure, we instructed and performed position changes very slowly and step-by-step. In this way, a position change to the 135° prone position () took around 30 minutes. This approach was well tolerated and increased oxygen saturation, for example, on day 5 with 6 L/min of oxygen from 93% to 97%. Second, we had to adapt the breathing exercises to avoid prolonged coughing and oxygen desaturation. Accordingly, we instructed the patient to stop every deep breath before the need to cough and to hold inspiration for better air distribution. In this manner, the patient performed the breathing exercises well and manag

### 3. **Generate triplets**

Once we have a good prompting strategy, we generate triplets for the whole dataset.
