## **Synthetic data generation**

For our fine-tuning data, we will use the [NoteChat](https://arxiv.org/abs/2310.15959) dataset [(Huggingface link)](https://huggingface.co/datasets/akemiH/NoteChat). This dataset contains 167K pairs of real clinical notes extracted from [PMC-Patients](https://arxiv.org/abs/2202.13876) with generated patient-doctor conversations.

<p align="center">
<img src="figures/notechat.png" width="50%">
</p>

In this notebook, we extend this dataset of pairs to triplets with GPT-3.5: 


(`clinical note`, `dialogue transcript`) $\to$ (`clinical note`, `dialogue transcript`, `patient summary`)

We extract patient summaries by prompting GPT-3.5 with a clinical note, its corresponding dialogue transcript and a comprehensive template of patient features. 

In [7]:
#!pip install -r requirements.txt
import os
from utils.chat import *
from datasets import load_dataset

%reload_ext autoreload
%autoreload 2

### 1. **Load NoteChat data**

We first take a look at the NoteChat dataset. We load the dataset from the Huggingface library and display a few examples.

In [2]:
df_path = 'data/NoteChat.jsonl'

if not os.path.exists(df_path):
    dataset = load_dataset("akemiH/NoteChat")
    data = dataset['train'].to_pandas()
    data.to_json('data/NoteChat.jsonl', orient='records', lines=True)

else:
    data = pd.read_json(df_path, orient='records', lines=True)

data = data.sort_values(by=['data'], ascending=False)

data


Unnamed: 0,data,conversation
163999,"“Z,” a 14-year-old girl, was referred to our c...","Doctor: Hello, Z. I'm Dr. X, and I'll be your ..."
66442,"“Theresa” was a widowed, White British woman i...","\nDoctor: Hi Theresa, how are you today?\n\nTh..."
33502,“Sami Rami” is a 5-year and 8-month-old boy fr...,"\nDoctor: Hi, I’m Dr. X. How are you doing tod..."
70760,"“Mr. L” was a 54-year-old single, jobless man....","\nDoctor: Hello Mr. L, I am Dr. X. How are you..."
50554,"“Mr. D” is a 42-year-old man, presently living...","Doctor: Hello Mr. D, I'm Dr. X. How are you fe..."
...,...,...
181740,(1) A 30 y.o. P0+1 underwent a successful ovul...,"Doctor: Hello, how can I help you today?\nPati..."
191740,(1) A 30 y.o. P0+1 underwent a successful ovul...,"Doctor: Hello, how can I help you today?\nPati..."
104501,()A 34 year-old African American male with no ...,"\nDoctor: Good morning, may I have your name?\..."
134012,() A 53-year-old man was diagnosed with Rai II...,"\nDoctor: Good morning, sir. How are you feeli..."


In [3]:
data.iloc[0]['data']

'“Z,” a 14-year-old girl, was referred to our child psychiatry clinic because she was experiencing a mania episode. Both a magnetic resonance imaging scan and an electroencephalography revealed no hint of an organic cause of her manic symptoms. Since her history also revealed a depressive episode, she was diagnosed as having bipolar disorder type 1 using the Kiddie-Sads-Present and Lifetime Version. Then, the Young Mania Rating Scale was used to measure the severity of her manic symptoms; her score was 45.\nWe started treatment with SV 250 mg/day for the first 4 days. We also administered risperidone 1 mg/day to resolve the ideas of reference symptom throughout the course of treatment. At the 15th day, the dose of SV was titrated to 750 mg/day (250 mg at noon and 500 mg at night). Since the SV dose reached 750mg/day Z began bedwetting. About 4 weeks later, when the dose of SV was titrated up to 500mg twice a day (a blood level of 75μg/ml), Z developed diurnal and nocturnal enuresis eve

In [4]:
note = data['data'][0].replace('. ', '.\n')
conversation = data['conversation'][0].strip()  

print('CLINICAL NOTE')
print(note)
print('\n\nCONVERSATION')
print(conversation.replace('\n\n', '\n'))

CLINICAL NOTE
This 60-year-old male was hospitalized due to moderate ARDS from COVID-19 with symptoms of fever, dry cough, and dyspnea.
We encountered several difficulties during physical therapy on the acute ward.
First, any change of position or deep breathing triggered coughing attacks that induced oxygen desaturation and dyspnea.
To avoid rapid deterioration and respiratory failure, we instructed and performed position changes very slowly and step-by-step.
In this way, a position change to the 135° prone position () took around 30 minutes.
This approach was well tolerated and increased oxygen saturation, for example, on day 5 with 6 L/min of oxygen from 93% to 97%.
Second, we had to adapt the breathing exercises to avoid prolonged coughing and oxygen desaturation.
Accordingly, we instructed the patient to stop every deep breath before the need to cough and to hold inspiration for better air distribution.
In this manner, the patient performed the breathing exercises well and managed

### 2. **Experiment with extraction of patient summaries**


$$\text{Prompt} + \text{Clinical note} + \text{Template (with definitions?)} \to \text{Patient summary}$$

We will have to test out a few options to see which one works best.

- **Clinical note and/or dialogue transcript?** Dialogue is generated from clinical note, so all information contained in the dialogue should be in the clinical note --> only use clinical note?

- **Zero-shot vs. One-shot?** Do we include an example of (clinical note, dialogue transcript, patient summary) in the prompt? Ideally yes, but it might not fit in the prompt. If we remove the dialogue, it might fit better. 
--> Zero-shot to fit into context. 

- **Template definitions?** Do we include definitions of the template in the prompt? Adding the definitions might not help, and might not fit in the prompt. 

After a few tests, we choose to generate using only clinical notes, zero-shot and using a template with definitions. 

In [8]:
# Load the extraction prompt
instruction_path = 'generation/instructions/instructions.txt'
with open('generation/instructions/instructions.txt', 'r') as f:
    instructions = f.read()

print(instructions)

Given the provided clinical note, extract the corresponding patient summary following the template provided. 
Be as thorough as possible to extract all the information from the clinical note, but do not add any new information. 
Make sure all details mentioned in the clinical note appear in your output. If necessary add more field at the end of the template.
If a field is not mentioned, simply write "feature": None.


In [9]:
# Load the template
with open('generation/templates/template.json', 'r') as f:
    template = json.load(f)

# Load the template definitions
with open('generation/templates/template_definitions.json', 'r') as f:
    template_def = json.load(f)

template_def

{'visit motivation': "Reason for the patient's visit",
 'admission': [{'reason': 'Reason for admission to a care center',
   'date': 'Date of first admission',
   'duration': "Length of patient's stay",
   'center': 'Name, type and details of care center'}],
 'patient information': {'age': "Patient's age",
  'sex': "Patient's sex",
  'ethnicity': "Patient's ethnicity or nationality",
  'weight': "Patient's weight",
  'height': "Patient's height",
  'family medical history': 'Information about family medical history',
  'recent travels': "Details about patient's recent travels",
  'socio economic context': "Patient's socioeconomic background",
  'occupation': "Patient's occupation"},
 'patient medical history': {'physiological context': 'Relevant physiological history of the patient',
  'psychological context': 'Relevant psychological history of the patient',
  'vaccination history': 'History of vaccinations received by the patient',
  'allergies': 'Any known allergies of the patient',


In [15]:
notechat = pd.read_json('data/NoteChat.jsonl', orient='records', lines=True)
# Sort by decreasing length of clinical note
notechat['length'] = notechat['data'].apply(lambda x: len(x.split()))   
notechat = notechat.sort_values(by=['length'], ascending=False)
notechat = notechat.drop(columns=['length'])
notechat['idx'] = notechat.index
notechat.to_json('data/NoteChat_sorted.jsonl', orient='records', lines=True)

In [10]:
model = 'gpt-4-1106-preview' #for token count only
template_path = 'generation/templates/template_definitions.json'
instruction_path = 'generation/instructions/instructions.txt'
data_path = 'data/NoteChat_sorted.jsonl'
save_path = 'generation/summaries_testv3.jsonl'
keys_path = 'generation/keys.json'

In [11]:
extract(
    model,
    template_path,
    instruction_path,
    data_path,
    save_path,
    use_notes=True, 
    use_dialogues=False, 
    batch_size=10,
    nb_to_generate =10)

  0%|          | 0/1 [00:00<?, ?it/s]

Created 1 partitions with token number:[13904]
Partition 1/1: 10 points and 13904 tokens
..........
Break for 0.01 seconds.
End of break.
Saved
Batch done. Waiting 5 seconds...


100%|██████████| 1/1 [02:11<00:00, 131.28s/it]

Done.





Unnamed: 0,idx,data,conversation,summary
0,155216,"A a sixteen year-old girl, presented to our Ou...","\nDoctor: Good morning, what brings you to the...","{\n ""visit motivation"": ""Discomfort in the ..."
1,77465,This is the case of a 56-year-old man that was...,"Doctor: Hi, how are you feeling today?\nPatien...","{\n ""visit motivation"": ""Complaints of a du..."
2,133948,A 36-year old female patient visited our hospi...,"\nDoctor: Hello, what brings you to the hospit...","{\n ""visit motivation"": ""Pain and restricte..."
3,80176,A 49-year-old male presented with a complaint ...,"\nDoctor: Good morning, Mr. [Patient's Name]. ...","{\n ""visit motivation"": ""Pain in the left p..."
4,72232,A 47-year-old male patient was referred to the...,"\nDoctor: Good morning, how are you feeling to...","{\n ""visit motivation"": ""Recurrent attacks ..."
5,31864,A 24-year-old Yemeni female presented to the e...,"Doctor: Good morning, how are you feeling toda...","{\n ""visit motivation"": ""Inability to walk ..."
6,26809,We report a 24-day-old female baby who present...,"Doctor: Hi there, I am Dr. Smith. How can I he...","{\n ""visit motivation"": ""Presented with dys..."
7,149866,A 16 years old female patient presented to us ...,"Doctor: Good morning, what brings you here tod...","{\n ""visit motivation"": ""Inability to walk ..."
8,87064,We present a case of a seventy-three-year-old ...,"Doctor: Good morning, sir. How can I help you ...","{\n ""visit motivation"": ""Concerned with hav..."
9,123006,A 23-year-old female patient was admitted to a...,"\nDoctor: Hi, how are you feeling today?\n\nPa...","{\n ""visit motivation"": ""esthetic problem c..."


In [10]:
# Show sample results
summaries = pd.read_json(save_path, lines=True)
summaries = summaries[summaries['summary'].str.len() > 0]
for i, row in summaries.iterrows():
    print(f'CLINICAL NOTE {i}')
    print(row['data'].replace('. ', '.\n'))
    #print('\n\nCONVERSATION')
    #print(row['conversation'].replace('\n\n', '\n'))
    print('\n\nSUMMARY')
    print(row['summary'].replace('\n\n', '\n'))
    print('\n\n')

CLINICAL NOTE 0
A a sixteen year-old girl, presented to our Outpatient department with the complaints of discomfort in the neck and lower back as well as restriction of body movements.
She was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position.
She would keep her head turned to the right and upwards due to the sustained contraction of the neck muscles.
There was a sideways bending of the back in the lumbar region.
To counter the abnormal positioning of the back and neck, she would keep her limbs in a specific position to allow her body weight to be supported.
Due to the restrictions with the body movements at the neck and in the lumbar region, she would require assistance in standing and walking.
She would require her parents to help her with daily chores, including all activities of self-care.
She had been experiencing these difficulties for the past four months since when she was introduced to olanzapine tablets for t

### 3. **Generate triplets**

Once we have a good prompting strategy, we generate triplets for the whole dataset.
