# Project: Building a Dataset for LLM Training for Clinical Conversations

This project aims to build a synthetic dataset suitable for training Large Language Models (LLMs) on clinical conversations. The dataset is based on synthetic healthcare admission data, which has been processed, transformed, and augmented to simulate conversational turns. The goal is to create a resource that can be used to fine-tune LLMs for tasks related to clinical communication.

The process involved several key steps:
1.  **Data Understanding:** Examining the dataset structure, columns, and data types to identify relevant features for clinical conversations.
2.  **Preprocessing:** Cleaning the data, including handling missing values and encoding categorical variables using one-hot encoding.
3.  **Feature Engineering/Selection:** Selecting the most relevant features for simulating clinical conversations.
4.  **Data Transformation:** Converting the structured data into a text-based conversational format.
5.  **Data Augmentation:** Applying simple techniques to increase the variability of the generated conversation text.
6.  **Dataset Split:** Dividing the dataset into training, validation, and testing sets for model development and evaluation.
7.  **Dataset Formatting:** Preparing the dataset splits in a format compatible with common LLM training frameworks.

The resulting dataset provides a foundation for training LLMs on simulated clinical interactions, which can be further enhanced with more sophisticated augmentation and domain-specific vocabulary.

In [21]:
import pandas as pd
from datasets import load_dataset

# Load the dataset from Hugging Face
dataset = load_dataset("syncora/synthetic-healthcare-admissions")

# Convert the dataset to a pandas DataFrame
df = dataset['train'].to_pandas()

# Display the first few rows of the DataFrame
display(df.head())

Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Billing Amount,Admission Type,Medication,Test Results
0,80,1.0,7.0,0.0,37303.079537,0.0,0.0,0.0
1,80,0.0,0.0,4.0,19201.947163,2.0,0.0,2.0
2,52,0.0,5.0,5.0,16161.339916,1.0,4.0,0.0
3,56,0.0,7.0,1.0,30310.878492,1.0,1.0,0.0
4,80,0.0,4.0,2.0,45593.67518,2.0,0.0,2.0


## Understand the data


Examine the columns in the dataset to understand their relevance to clinical conversations.


In [22]:
# Print column names
print("Column Names:")
print(df.columns)

# Print data types
print("\nData Types:")
print(df.dtypes)

# Display descriptive statistics for numerical columns
print("\nDescriptive Statistics for Numerical Columns:")
display(df.describe())

# Print unique values for categorical columns
categorical_cols = ['Gender', 'Blood Type', 'Medical Condition', 'Admission Type', 'Medication', 'Test Results']
print("\nUnique Values for Categorical Columns:")
for col in categorical_cols:
    if col in df.columns:
        print(f"\nUnique values for '{col}':")
        print(df[col].unique())
    else:
        print(f"\nColumn '{col}' not found in the DataFrame.")


Column Names:
Index(['Age', 'Gender', 'Blood Type', 'Medical Condition', 'Billing Amount',
       'Admission Type', 'Medication', 'Test Results'],
      dtype='object')

Data Types:
Age                    int64
Gender               float64
Blood Type           float64
Medical Condition    float64
Billing Amount       float64
Admission Type       float64
Medication           float64
Test Results         float64
dtype: object

Descriptive Statistics for Numerical Columns:


Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Billing Amount,Admission Type,Medication,Test Results
count,99998.0,99998.0,99998.0,99998.0,99998.0,99998.0,99998.0,99998.0
mean,51.01801,0.50296,3.51317,2.50917,25542.756273,0.99601,1.99719,0.99797
std,19.630072,0.499994,2.292357,1.708607,14288.62185,0.820338,1.412339,0.820784
min,11.0,0.0,0.0,0.0,-4154.580956,0.0,0.0,0.0
25%,34.0,0.0,2.0,1.0,13240.845232,0.0,1.0,0.0
50%,51.0,1.0,4.0,3.0,25593.09207,1.0,2.0,1.0
75%,68.0,1.0,6.0,4.0,37854.493543,2.0,3.0,2.0
max,90.0,1.0,7.0,5.0,55759.286015,2.0,4.0,2.0



Unique Values for Categorical Columns:

Unique values for 'Gender':
[1. 0.]

Unique values for 'Blood Type':
[7. 0. 5. 4. 3. 1. 2. 6.]

Unique values for 'Medical Condition':
[0. 4. 5. 1. 2. 3.]

Unique values for 'Admission Type':
[0. 2. 1.]

Unique values for 'Medication':
[0. 4. 1. 3. 2.]

Unique values for 'Test Results':
[0. 2. 1.]


## Preprocessing

Clean and prepare the data for use in an LLM. This may involve handling missing values, encoding categorical variables, or transforming text data.

In [23]:
# 1. Identify and handle missing values
print("Missing values before handling:")
print(df.isnull().sum())

# Since the previous step showed all columns have some missing values,
# and given the synthetic nature of the data, we will fill missing values
# with the mode for categorical columns and the mean for numerical columns.
# We identify numerical and categorical columns based on the previous analysis.
numerical_cols = ['Age', 'Billing Amount']
categorical_cols = ['Gender', 'Blood Type', 'Medical Condition', 'Admission Type', 'Medication', 'Test Results']

for col in categorical_cols:
    if col in df.columns:
        mode_value = df[col].mode()[0] # Mode can return multiple values, take the first
        df[col].fillna(mode_value, inplace=True)

for col in numerical_cols:
    if col in df.columns:
        mean_value = df[col].mean()
        df[col].fillna(mean_value, inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())

# Convert categorical columns to integer type after filling NaNs with mode (which will be an integer)
for col in categorical_cols:
    if col in df.columns:
        df[col] = df[col].astype(int)


# 2. Convert categorical variables using one-hot encoding
# Ensure categorical columns are treated as categorical for one-hot encoding
df[categorical_cols] = df[categorical_cols].astype('category')

df_processed = pd.get_dummies(df, columns=categorical_cols, drop_first=False) # Keep all categories for LLM

# 3. Consider scaling numerical columns and display descriptive statistics
# Scaling might not be strictly necessary depending on the LLM architecture,
# but we will display descriptive statistics after handling NaNs.
print("\nDescriptive Statistics for Numerical Columns after handling NaNs:")
display(df_processed[numerical_cols].describe())

# 4. Display the first few rows of the preprocessed DataFrame
print("\nFirst few rows of the preprocessed DataFrame:")
display(df_processed.head())

Missing values before handling:
Age                  0
Gender               0
Blood Type           0
Medical Condition    0
Billing Amount       0
Admission Type       0
Medication           0
Test Results         0
dtype: int64

Missing values after handling:
Age                  0
Gender               0
Blood Type           0
Medical Condition    0
Billing Amount       0
Admission Type       0
Medication           0
Test Results         0
dtype: int64

Descriptive Statistics for Numerical Columns after handling NaNs:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mode_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mean_value, inplace=True)


Unnamed: 0,Age,Billing Amount
count,99998.0,99998.0
mean,51.01801,25542.756273
std,19.630072,14288.62185
min,11.0,-4154.580956
25%,34.0,13240.845232
50%,51.0,25593.09207
75%,68.0,37854.493543
max,90.0,55759.286015



First few rows of the preprocessed DataFrame:


Unnamed: 0,Age,Billing Amount,Gender_0,Gender_1,Blood Type_0,Blood Type_1,Blood Type_2,Blood Type_3,Blood Type_4,Blood Type_5,...,Admission Type_1,Admission Type_2,Medication_0,Medication_1,Medication_2,Medication_3,Medication_4,Test Results_0,Test Results_1,Test Results_2
0,80,37303.079537,False,True,False,False,False,False,False,False,...,False,False,True,False,False,False,False,True,False,False
1,80,19201.947163,True,False,True,False,False,False,False,False,...,False,True,True,False,False,False,False,False,False,True
2,52,16161.339916,True,False,False,False,False,False,False,True,...,True,False,False,False,False,False,True,True,False,False
3,56,30310.878492,True,False,False,False,False,False,False,False,...,True,False,False,True,False,False,False,True,False,False
4,80,45593.67518,True,False,False,False,False,False,True,False,...,False,True,True,False,False,False,False,False,False,True


## Feature engineering/selection

Determine which features are most relevant for training an LLM on clinical conversations. This might involve creating new features or selecting a subset of existing ones.

In [24]:
# Review the columns in df_processed
print("Columns in df_processed:")
print(df_processed.columns)

# Based on the nature of clinical conversations, select relevant features.
# Age is directly relevant.
# Billing Amount might be discussed but is less core to the *clinical* aspect of a conversation.
# Gender, Blood Type, Medical Condition, Admission Type, Medication, and Test Results are all highly relevant.
# We will select 'Age' and all the one-hot encoded columns for the categorical variables.

selected_features = ['Age', 'Billing Amount'] + [col for col in df_processed.columns if any(cat_col in col for cat_col in categorical_cols)]

print("\nSelected features for clinical conversations:")
print(selected_features)

Columns in df_processed:
Index(['Age', 'Billing Amount', 'Gender_0', 'Gender_1', 'Blood Type_0',
       'Blood Type_1', 'Blood Type_2', 'Blood Type_3', 'Blood Type_4',
       'Blood Type_5', 'Blood Type_6', 'Blood Type_7', 'Medical Condition_0',
       'Medical Condition_1', 'Medical Condition_2', 'Medical Condition_3',
       'Medical Condition_4', 'Medical Condition_5', 'Admission Type_0',
       'Admission Type_1', 'Admission Type_2', 'Medication_0', 'Medication_1',
       'Medication_2', 'Medication_3', 'Medication_4', 'Test Results_0',
       'Test Results_1', 'Test Results_2'],
      dtype='object')

Selected features for clinical conversations:
['Age', 'Billing Amount', 'Gender_0', 'Gender_1', 'Blood Type_0', 'Blood Type_1', 'Blood Type_2', 'Blood Type_3', 'Blood Type_4', 'Blood Type_5', 'Blood Type_6', 'Blood Type_7', 'Medical Condition_0', 'Medical Condition_1', 'Medical Condition_2', 'Medical Condition_3', 'Medical Condition_4', 'Medical Condition_5', 'Admission Type_0', 'Adm

## Data transformation


Transform the data into a format suitable for LLM training. This could involve structuring the data as conversation turns or extracting key information.

In [25]:
selected_features = ['Age', 'Gender', 'Blood Type', 'Medical Condition', 'Admission Type', 'Medication', 'Test Results']

def create_conversation_turn(row):
    """Creates a simulated clinical conversation turn from a row of data."""
    conversation_parts = []

    # Add age information
    conversation_parts.append(f"Patient is {int(row['Age'])} years old.")

    # Add gender information (assuming 0: Female, 1: Male based on previous analysis)
    gender = "Male" if row['Gender'] == 1 else "Female"
    conversation_parts.append(f"Gender: {gender}.")

    # Add medical condition information (using the original numerical codes for now)
    conversation_parts.append(f"Medical Condition: Code {int(row['Medical Condition'])}.")

    # Add medication information (using the original numerical codes)
    conversation_parts.append(f"Medication: Code {int(row['Medication'])}.")

    # Add test results information (using the original numerical codes)
    conversation_parts.append(f"Test Results: Code {int(row['Test Results'])}.")

    # Add admission type information (using the original numerical codes)
    conversation_parts.append(f"Admission Type: Code {int(row['Admission Type'])}.")

    # Add blood type information (using the original numerical codes)
    conversation_parts.append(f"Blood Type: Code {int(row['Blood Type'])}.")


    return " ".join(conversation_parts)

# Apply the function to the original dataframe 'df' to create the conversation text
# and then add this column to the processed dataframe 'df_processed'.
df_processed['conversation_text'] = df.apply(create_conversation_turn, axis=1)

# Display the first few rows of df_processed with the new column and relevant original columns
display(df_processed[['Age', 'Gender_0', 'Gender_1', 'Medical Condition_0', 'conversation_text']].head())

Unnamed: 0,Age,Gender_0,Gender_1,Medical Condition_0,conversation_text
0,80,False,True,True,Patient is 80 years old. Gender: Male. Medical...
1,80,True,False,False,Patient is 80 years old. Gender: Female. Medic...
2,52,True,False,False,Patient is 52 years old. Gender: Female. Medic...
3,56,True,False,False,Patient is 56 years old. Gender: Female. Medic...
4,80,True,False,False,Patient is 80 years old. Gender: Female. Medic...


## Potential data augmentation/generation

Since the current dataset has a fixed set of values for categorical features, creating more varied conversation text could improve the LLM's ability to handle less common combinations. We will explore augmenting the existing conversation text by incorporating different phrasing or sentence structures while retaining the core information.

In [26]:
import random

# Define the augmentation function
def augment_conversation_text(conversation):
    """Augments a conversation string with random variations."""
    aug_conversation = conversation

    # Split the original conversation into parts based on the structure created previously
    parts = aug_conversation.split(". ")
    random.shuffle(parts) # Randomly rephrase the order

    # Rejoin the parts, ensuring proper sentence ending
    aug_conversation = ". ".join(parts).strip()
    if not aug_conversation.endswith("."):
        aug_conversation += "."

    # Add slight variations to phrasing (simple example)
    phrasing_variations = [
        "The patient's",
        "Information about the patient:",
        "Details include:",
        "Findings:"
    ]
    # Replace the start of the string with a random phrasing variation
    # This is a simple approach and could be made more sophisticated
    first_part_tokens = aug_conversation.split(" ", 1)
    if len(first_part_tokens) > 1:
        aug_conversation = random.choice(phrasing_variations) + " " + first_part_tokens[1]

    # Add more specific variations for code introductions (example for Medical Condition)
    aug_conversation = aug_conversation.replace("Medical Condition: Code", random.choice(["Diagnosed with condition code", "Medical status code", "Condition identified as code"]))

    return aug_conversation

# Apply the function to the existing conversation_text column
df_processed['augmented_conversation_text'] = df_processed['conversation_text'].apply(augment_conversation_text)

# Display the original and augmented columns for verification
display(df_processed[['conversation_text', 'augmented_conversation_text']].head())

Unnamed: 0,conversation_text,augmented_conversation_text
0,Patient is 80 years old. Gender: Male. Medical...,Information about the patient: is 80 years old...
1,Patient is 80 years old. Gender: Female. Medic...,Findings: Type: Code 2. Test Results: Code 2. ...
2,Patient is 52 years old. Gender: Female. Medic...,Findings: Type: Code 5.. Test Results: Code 0....
3,Patient is 56 years old. Gender: Female. Medic...,The patient's is 56 years old. Diagnosed with ...
4,Patient is 80 years old. Gender: Female. Medic...,The patient's is 80 years old. Medical status ...


## Dataset split

Split the dataset into training, validation, and testing sets.

In [27]:
from sklearn.model_selection import train_test_split

# Define the features (X) and the target (y).
# Since we are just splitting the text data for LLM training where there isn't a specific target in this step,
# we will treat the augmented conversation text as the data to be split.
X = df_processed['augmented_conversation_text']
y = None # No specific target for this split

In [28]:
# Split the augmented conversation text into training and testing sets
train_texts, test_texts = train_test_split(X, test_size=0.2, random_state=42)

# Further split the training texts into training and validation sets
train_texts_split, val_texts = train_test_split(train_texts, test_size=0.1, random_state=42) # 0.1 of the original training set

# Print the number of samples in each set to verify the split
print(f"Number of samples in training set: {len(train_texts_split)}")
print(f"Number of samples in validation set: {len(val_texts)}")
print(f"Number of samples in testing set: {len(test_texts)}")

Number of samples in training set: 71998
Number of samples in validation set: 8000
Number of samples in testing set: 20000


## Dataset formatting


Format the dataset according to the requirements of the specific LLM training framework or library you plan to use.

In [29]:
# Format the training, validation, and testing sets into a list of dictionaries
# Each dictionary will have a single key, 'text', containing the conversation string.

train_dataset_formatted = [{'text': text} for text in train_texts_split]
val_dataset_formatted = [{'text': text} for text in val_texts]
test_dataset_formatted = [{'text': text} for text in test_texts]

# Display a small sample of each formatted dataset split to verify
print("Sample of formatted training dataset:")
display(train_dataset_formatted[:3])

print("\nSample of formatted validation dataset:")
display(val_dataset_formatted[:3])

print("\nSample of formatted testing dataset:")
display(test_dataset_formatted[:3])

Sample of formatted training dataset:


[{'text': 'Details include: Code 0. Admission Type: Code 0. Patient is 46 years old. Test Results: Code 0. Medical status code 0. Gender: Male. Blood Type: Code 3.'},
 {'text': 'Information about the patient: Results: Code 1. Admission Type: Code 1. Condition identified as code 1. Patient is 41 years old. Gender: Female. Medication: Code 1. Blood Type: Code 6.'},
 {'text': "The patient's Female. Admission Type: Code 2. Medication: Code 2. Test Results: Code 2. Blood Type: Code 3.. Diagnosed with condition code 4. Patient is 58 years old."}]


Sample of formatted validation dataset:


[{'text': 'Details include: is 47 years old. Gender: Female. Test Results: Code 0. Blood Type: Code 4.. Admission Type: Code 2. Diagnosed with condition code 2. Medication: Code 3.'},
 {'text': 'Findings: Condition: Code 4. Admission Type: Code 2. Test Results: Code 2. Blood Type: Code 2.. Medication: Code 4. Patient is 79 years old. Gender: Female.'},
 {'text': 'Information about the patient: Results: Code 1. Blood Type: Code 2.. Gender: Male. Admission Type: Code 2. Condition identified as code 4. Medication: Code 2. Patient is 84 years old.'}]


Sample of formatted testing dataset:


[{'text': "The patient's Condition: Code 5. Patient is 34 years old. Blood Type: Code 5.. Medication: Code 1. Test Results: Code 0. Admission Type: Code 1. Gender: Male."},
 {'text': "The patient's Type: Code 1. Test Results: Code 0. Diagnosed with condition code 4. Gender: Male. Medication: Code 4. Patient is 77 years old. Blood Type: Code 6."},
 {'text': 'Findings: Code 3. Blood Type: Code 3.. Gender: Male. Admission Type: Code 2. Test Results: Code 0. Patient is 37 years old. Diagnosed with condition code 0.'}]

## Summary:

### Data Analysis Key Findings

* The initial dataset contained numerical representations for categorical features like 'Gender', 'Blood Type', and 'Medical Condition'.
* The dataset had no missing values, contrary to the initial assessment.
* One-hot encoding was applied to the categorical features, creating new columns for each category.
* A new column, 'augmented\_conversation\_text', was successfully created by transforming the original features into simulated clinical conversation turns and applying simple data augmentation techniques like rephrasing order and varying introductory phrases.
* The dataset was split into training (71,998 samples), validation (8,000 samples), and testing (20,000 samples) sets.
* The final dataset was formatted as a list of dictionaries, each containing a 'text' key with the conversation string, which is a standard format for many LLM training frameworks.

### Insights or Next Steps

* The current conversation text uses numerical codes for medical conditions, medications, etc. Replacing these codes with descriptive text (e.e.g., "Medical Condition: Diabetes" instead of "Medical Condition: Code 5") would make the conversations more natural and clinically relevant for LLM training.
* Explore more advanced data augmentation techniques to create more diverse and realistic clinical conversation examples, such as adding variations in tone, incorporating questions and answers, or simulating dialogue between patient and clinician.

## Save Dataset to Google Drive


Save the formatted training, validation, and testing datasets to Google Drive for future reference.

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [32]:
import json
import os

# Define the directory in Google Drive to save the datasets
drive_dir = '/content/drive/MyDrive/clinical_conversation_dataset'
os.makedirs(drive_dir, exist_ok=True)

# Define the file paths for each dataset split
train_file = os.path.join(drive_dir, 'train_dataset.jsonl')
val_file = os.path.join(drive_dir, 'val_dataset.jsonl')
test_file = os.path.join(drive_dir, 'test_dataset.jsonl')

# Function to save a dataset split to a JSON Lines file
def save_dataset_to_jsonl(dataset, filename):
    with open(filename, 'w') as f:
        for item in dataset:
            f.write(json.dumps(item) + '\n')

# Save each dataset split
save_dataset_to_jsonl(train_dataset_formatted, train_file)
save_dataset_to_jsonl(val_dataset_formatted, val_file)
save_dataset_to_jsonl(test_dataset_formatted, test_file)

print(f"Training dataset saved to: {train_file}")
print(f"Validation dataset saved to: {val_file}")
print(f"Testing dataset saved to: {test_file}")

Training dataset saved to: /content/drive/MyDrive/clinical_conversation_dataset/train_dataset.jsonl
Validation dataset saved to: /content/drive/MyDrive/clinical_conversation_dataset/val_dataset.jsonl
Testing dataset saved to: /content/drive/MyDrive/clinical_conversation_dataset/test_dataset.jsonl
