## **Synthetic data generation**

For our fine-tuning data, we will use the [NoteChat](https://arxiv.org/abs/2310.15959) dataset [(Huggingface link)](https://huggingface.co/datasets/akemiH/NoteChat). This dataset contains 167K pairs of real clinical notes extracted from [PMC-Patients](https://arxiv.org/abs/2202.13876) with generated patient-doctor conversations.

<p align="center">
<img src="figures/notechat.png" width="50%">
</p>

In this notebook, we extend this dataset of pairs to triplets with GPT-3.5: 


(`clinical note`, `dialogue transcript`) $\to$ (`clinical note`, `dialogue transcript`, `patient summary`)

We extract patient summaries by prompting GPT-3.5 with a clinical note, its corresponding dialogue transcript and a comprehensive template of patient features. 

In [1]:
#!pip install -r requirements.txt

from utils.chat import *
from datasets import load_dataset

%reload_ext autoreload
%autoreload 2

### 1. **Load NoteChat data**

We first take a look at the NoteChat dataset. We load the dataset from the Huggingface library and display a few examples.

In [2]:
df_path = 'data/NoteChat.jsonl'

if not os.path.exists(df_path):
    dataset = load_dataset("akemiH/NoteChat")
    data = dataset['train'].to_pandas()
    data.to_json('data/NoteChat.jsonl', orient='records', lines=True)

else:
    data = pd.read_json(df_path, orient='records', lines=True)

data


Unnamed: 0,data,conversation
0,This 60-year-old male was hospitalized due to ...,"\nDoctor: Hi, Mr. X, I'm Dr. Y. How are you fe..."
1,A 39-year-old man was hospitalized due to an i...,"\nDoctor: Hello, I am Dr. Smith. Can you tell ..."
2,One week after a positive COVID-19 result this...,"\nDoctor: Good morning, how are you feeling to..."
3,This 69-year-old male was admitted to the ICU ...,"Doctor: Good morning, sir. How are you feeling..."
4,This 57-year-old male was admitted to the ICU ...,"\nDoctor: Good morning, Mr. Patient. How are y..."
...,...,...
206996,A 63-year-old woman with metastatic breast car...,"Doctor: Hi, how are you feeling today?\nPatien..."
206997,"A 6 years old, neutered male Lhasa Apso was pr...","Doctor: Hello, what brings you in today?\nPati..."
206998,"An 8 years old, neutered male mixed breed dog ...","Doctor: Hi, how are you today?\nPatient: I'm n..."
206999,A 4 years old spayed female Doberman Pinscher ...,"Doctor: Hello there, how are you feeling today..."


In [3]:
note = data['data'][0].replace('. ', '.\n')
conversation = data['conversation'][0].strip()  

print('CLINICAL NOTE')
print(note)
print('\n\nCONVERSATION')
print(conversation.replace('\n\n', '\n'))

CLINICAL NOTE
This 60-year-old male was hospitalized due to moderate ARDS from COVID-19 with symptoms of fever, dry cough, and dyspnea.
We encountered several difficulties during physical therapy on the acute ward.
First, any change of position or deep breathing triggered coughing attacks that induced oxygen desaturation and dyspnea.
To avoid rapid deterioration and respiratory failure, we instructed and performed position changes very slowly and step-by-step.
In this way, a position change to the 135° prone position () took around 30 minutes.
This approach was well tolerated and increased oxygen saturation, for example, on day 5 with 6 L/min of oxygen from 93% to 97%.
Second, we had to adapt the breathing exercises to avoid prolonged coughing and oxygen desaturation.
Accordingly, we instructed the patient to stop every deep breath before the need to cough and to hold inspiration for better air distribution.
In this manner, the patient performed the breathing exercises well and managed

### 2. **Experiment with extraction of patient summaries**


$$\text{Prompt} + \text{Clinical note} + \text{Template (with definitions?)} \to \text{Patient summary}$$

We will have to test out a few options to see which one works best.

- **Clinical note and/or dialogue transcript?** Dialogue is generated from clinical note, so all information contained in the dialogue should be in the clinical note --> only use clinical note?

- **Zero-shot vs. One-shot?** Do we include an example of (clinical note, dialogue transcript, patient summary) in the prompt? Ideally yes, but it might not fit in the prompt. If we remove the dialogue, it might fit better. 
--> Zero-shot to fit into context. 

- **Template definitions?** Do we include definitions of the template in the prompt? Adding the definitions might not help, and might not fit in the prompt. 

After a few tests, we choose to generate using only clinical notes, zero-shot and using a template with definitions. 

In [4]:
# Load the extraction prompt
instruction_path = 'generation/instructions/instructions.txt'
with open('generation/instructions/instructions.txt', 'r') as f:
    instructions = f.read()

print(instructions)

Given the provided clinical note, extract the corresponding patient summary following the template provided. 
Be as thorough as possible to extract all the information from the clinical note, but do not add any new information. 
If a field is not mentioned in the dialogue, simply write "feature": None.


In [5]:
# Load the template
with open('generation/templates/template.json', 'r') as f:
    template = json.load(f)

# Load the template definitions
with open('generation/templates/template_definitions.json', 'r') as f:
    template_def = json.load(f)

template_def

{'visit motivation': "Reason for the patient's visit",
 'hospitalization': [{'reason': 'Reason for hospitalization',
   'date': 'Date of hospitalization',
   'duration': "Length of patient's hospital stay",
   'department': 'Department of hospital where patient was hospitalized'}],
 'patient information': {'age': "Patient's age",
  'sex': "Patient's sex",
  'ethnicity': "Patient's ethnicity",
  'weight': "Patient's weight",
  'height': "Patient's height",
  'family medical history': 'Information about family medical history',
  'recent travels': "Details about patient's recent travels",
  'socio economic context': "Patient's socioeconomic background",
  'occupation': "Patient's occupation"},
 'patient medical history': {'physiological context': 'Relevant physiological history of the patient',
  'psychological context': 'Relevant psychological history of the patient',
  'vaccination history': 'History of vaccinations received by the patient',
  'recent surgeries': 'Details about any rec

In [6]:
model = 'gpt-4'
template_path = 'generation/templates/template_definitions.json'
instruction_path = 'generation/instructions/instructions.txt'
data_path = 'data/NoteChat.jsonl'
save_path = 'generation/summaries.jsonl'
keys_path = 'generation/keys.json'

In [7]:
extract(
    model,
    template_path,
    instruction_path,
    data_path,
    save_path,
    keys_path=keys_path,
    use_notes=True, 
    use_dialogues=False, 
    batch_size=4)

  0%|          | 0/51751 [00:00<?, ?it/s]

Created 4 partitions with token number:[1110, 1129, 1086, 1088]
Partition 1/4: 1 points and 1110 tokens
.
Break for 20 seconds.
End of break.
Partition 2/4: 1 points and 1129 tokens
.
Break for 18 seconds.
End of break.
Partition 3/4: 1 points and 1086 tokens
.
Break for 0.01 seconds.
End of break.
Partition 4/4: 1 points and 1088 tokens


Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x4537f6220>



Chat TimeOut
Retrying chat (TimeOut) (1/5)...

Chat TimeOut
Retrying chat (TimeOut) (2/5)...


In [15]:
# Show sample results
summaries = pd.read_json(save_path, lines=True)
summaries = summaries[summaries['summary'].str.len() > 0]
for i, row in summaries.iterrows():
    print(f'CLINICAL NOTE {i}')
    print(row['data'].replace('. ', '.\n'))
    #print('\n\nCONVERSATION')
    #print(row['conversation'].replace('\n\n', '\n'))
    print('\n\nSUMMARY')
    print(row['summary'].replace('\n\n', '\n'))
    print('\n\n')

CLINICAL NOTE 0
This 60-year-old male was hospitalized due to moderate ARDS from COVID-19 with symptoms of fever, dry cough, and dyspnea.
We encountered several difficulties during physical therapy on the acute ward.
First, any change of position or deep breathing triggered coughing attacks that induced oxygen desaturation and dyspnea.
To avoid rapid deterioration and respiratory failure, we instructed and performed position changes very slowly and step-by-step.
In this way, a position change to the 135° prone position () took around 30 minutes.
This approach was well tolerated and increased oxygen saturation, for example, on day 5 with 6 L/min of oxygen from 93% to 97%.
Second, we had to adapt the breathing exercises to avoid prolonged coughing and oxygen desaturation.
Accordingly, we instructed the patient to stop every deep breath before the need to cough and to hold inspiration for better air distribution.
In this manner, the patient performed the breathing exercises well and manag

### 3. **Generate triplets**

Once we have a good prompting strategy, we generate triplets for the whole dataset.
