# **ChronicGPT: An Approach to Convert Tabular Clinical Records into Clinical Narratives for Effectively Tuning Disease-Specific GPT Models**

In this study, heart patient data from the UCI dataset was utilized. This data was originally in tabular format, similar to what is typically found in a structured, tabular clinical dataset or Electronic Health Record (EHR)-derived tabular dataset. However, large language models (LLMs) such as GPT perform optimally with text-based input rather than tabular data. This introduced two primary challenges:

    1. The data was in a tabular format, which is not commonly used with transformer-based models like GPT.

    2. The dataset contained very few heart disease cases, making it difficult to train an effective model.

The key advantage of employing a language model like GPT over traditional machine learning models or standard neural networks lies in its dual capability. Traditional models make predictions based solely on statistical patterns and observed probabilities within the data. While often effective, such models typically fail to capture the clinical significance behind categorical features — particularly in medical datasets where each category (e.g., “normal,” “abnormal,” “present,” “not present”) conveys nuanced meaning. In contrast, GPT models are capable of understanding clinical context — in a manner similar to a healthcare expert — while also learning underlying statistical relationships. This combination of semantic understanding and statistical reasoning renders GPT models especially powerful for healthcare applications, where both interpretability and domain knowledge are critical.

The creation of synthetic examples for heart disease prediction poses additional difficulties, especially when the dataset includes numerous categorical features (such as "yes"/"no" or "normal"/"abnormal") with significant clinical implications. Common techniques such as SMOTE, ADASYN, and ENN are well-suited for numerical data but often fail to effectively handle categorical medical data. This limitation impedes the improvement of model performance when real heart disease data is scarce.

To address this, a table-to-text approach was developed. Initially, the tabular data was converted into short clinical-style text, resembling the format a physician might use. Subsequently, a GPT-based model (GPT4) was employed to generate more realistic and medically accurate examples by paraphrasing existing samples from heart disease patients.

This approach preserves the clinical meaning while producing new examples, a feat that traditional methods for creating synthetic data for categorical features such as SMOTE-NC, CTGAN, TVAE, and CopulaGAN often struggle to achieve. These alternative techniques either fail to capture the clinical context adequately or generate samples that are less meaningful when dealing with categorical medical features.

Following the generation of realistic medical text samples, a GPT2 model was fine-tuned using the short texts. The objective was to assess whether the model could accurately predict the presence of heart disease in new patients — and the model demonstrated strong performance, even when trained on limited data.

All GPT models used in this study were obtained from Hugging Face’s library of pre-trained models. The findings demonstrate that LLMs can be adapted to handle medical tabular data by transforming it into text, thereby enhancing the model's ability to learn clinically relevant patterns associated with heart disease.

Additionally, this approach was compared to a simpler method in which tabular data was converted into a sequential input format — listing feature values in order, without converting them into clinical-style text. While this method rendered the data compatible with transformer models, performance was suboptimal. Transformer-based models struggled to capture the relationships and clinical significance embedded in plain sequences of numbers or categories.

In contrast, the table-to-clinical-text approach enabled the GPT model to interpret both the semantic and clinical context of each feature. This resulted in significantly improved performance and generalization, particularly for the underrepresented heart disease class.

To evaluate the generalizability of the proposed method, it was also applied to the UCI Heart Failure Clinical dataset and the UCI Chronic Kidney Disease dataset, with consistent results obtained across all datasets.

## **Keywords:**

    1. Large Language Models (LLMs)

    2. Table-to-Text Conversion

    3. Clinical Text Generation

    4. Class Imbalance

    5. Transformer-Based Prediction

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [2]:
# Set DPI for fugures

plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300

In [3]:
# Set the default font size and weight
plt.rcParams['font.size'] = 30
plt.rcParams['font.weight'] = 'bold'

In [4]:
# Drive connection

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Dataset: UCI Heart Disease Dataset

In [None]:
df.rename(columns={
    "age": "Age (years)",
    "sex": "Sex (1 = Male, 0 = Female)",
    "cp": "Chest Pain Type (1 = Typical Angina, 2 = Atypical Angina, 3 = Non-anginal Pain, 4 = Asymptomatic)",
    "trestbps": "Resting Blood Pressure (mm Hg)",
    "chol": "Serum Cholesterol (mg/dL)",
    "fbs": "Fasting Blood Sugar (> 120 mg/dL, 1 = True, 0 = False)",
    "restecg": "Resting ECG Results (0 = Normal, 1 = ST-T Wave Abnormality, 2 = Left Ventricular Hypertrophy)",
    "thalach": "Maximum Heart Rate Achieved",
    "exang": "Exercise-Induced Angina (1 = Yes, 0 = No)",
    "oldpeak": "ST Depression Induced by Exercise (mm)",
    "slope": "Slope of Peak Exercise ST Segment (1 = Upsloping, 2 = Flat, 3 = Downsloping)",
    "ca": "Number of Major Vessels (0-3) Colored by Fluoroscopy",
    "thal": "Thalassemia (3 = Normal, 6 = Fixed Defect, 7 = Reversible Defect)",
    "num": "Heart Disease (0 = No Disease, 1 = Heart Disease)"
}, inplace=True)

In [None]:
load_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/df.csv"

# Load the DataFrame
df = pd.read_csv(load_path)

In [None]:
df.head()

Unnamed: 0,Age (years),"Sex (1 = Male, 0 = Female)","Chest Pain Type (1 = Typical Angina, 2 = Atypical Angina, 3 = Non-anginal Pain, 4 = Asymptomatic)",Resting Blood Pressure (mm Hg),Serum Cholesterol (mg/dL),"Fasting Blood Sugar (> 120 mg/dL, 1 = True, 0 = False)","Resting ECG Results (0 = Normal, 1 = ST-T Wave Abnormality, 2 = Left Ventricular Hypertrophy)",Maximum Heart Rate Achieved,"Exercise-Induced Angina (1 = Yes, 0 = No)",ST Depression Induced by Exercise (mm),"Slope of Peak Exercise ST Segment (1 = Upsloping, 2 = Flat, 3 = Downsloping)",Number of Major Vessels (0-3) Colored by Fluoroscopy,"Thalassemia (3 = Normal, 6 = Fixed Defect, 7 = Reversible Defect)","Heart Disease (0 = No Disease, 1 = Heart Disease)"
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


# Table to Clinical Text for GPT Model





In [None]:
import torch
from sklearn.model_selection import train_test_split

# Mapping for readable text
ecg_mapping = {
    0: "Normal",
    1: "ST-T Wave Abnormality",
    2: "Left Ventricular Hypertrophy"
}

slope_mapping = {
    1: "Upsloping",
    2: "Flat",
    3: "Downsloping"
}

thal_mapping = {
    3: "Normal",
    6: "Fixed Defect",
    7: "Reversible Defect"
}

# Separate features and target
input_features = df.drop(columns=["Heart Disease (0 = No Disease, 1 = Heart Disease)"])
target = df["Heart Disease (0 = No Disease, 1 = Heart Disease)"]

# Convert rows to clinical-style text
def row_to_text(row):
    return (f"Patient is a {int(row['Sex (1 = Male, 0 = Female)']) and 'male' or 'female'} aged {row['Age (years)']} years, "
            f"presenting with chest pain type {row['Chest Pain Type (1 = Typical Angina, 2 = Atypical Angina, 3 = Non-anginal Pain, 4 = Asymptomatic)']}, "
            f"resting blood pressure {row['Resting Blood Pressure (mm Hg)']} mm Hg, serum cholesterol {row['Serum Cholesterol (mg/dL)']} mg/dL, "
            f"{'elevated' if row['Fasting Blood Sugar (> 120 mg/dL, 1 = True, 0 = False)'] else 'normal'} fasting blood sugar, "
            f"resting ECG showing result {ecg_mapping[row['Resting ECG Results (0 = Normal, 1 = ST-T Wave Abnormality, 2 = Left Ventricular Hypertrophy)']]}, "
            f"maximum heart rate achieved {row['Maximum Heart Rate Achieved']}, "
            f"{'with' if row['Exercise-Induced Angina (1 = Yes, 0 = No)'] else 'without'} exercise-induced angina, "
            f"ST depression of {row['ST Depression Induced by Exercise (mm)']} mm, "
            f"Slope of Peak Exercise ST Segment: {slope_mapping[row['Slope of Peak Exercise ST Segment (1 = Upsloping, 2 = Flat, 3 = Downsloping)']]}, "
            f"{row['Number of Major Vessels (0-3) Colored by Fluoroscopy']} major vessels affected, "
            f"thalassemia type {thal_mapping.get(row['Thalassemia (3 = Normal, 6 = Fixed Defect, 7 = Reversible Defect)'], 'Unknown')}.")

df["clinical_text"] = input_features.apply(row_to_text, axis=1)

In [None]:
load_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/clinical_text_dataset.csv"

# Load the DataFrame
df = pd.read_csv(load_path)

# Preview
df.head()

In [None]:
# Drop target column to get input features only
input_features = df.drop(columns=["Heart Disease (0 = No Disease, 1 = Heart Disease)"])

# Apply the conversion function row-wise
df["clinical_text"] = input_features.apply(row_to_text, axis=1)

# View the converted text
print(df["clinical_text"].head())

0    Patient is a male aged 63 years, presenting wi...
1    Patient is a male aged 67 years, presenting wi...
2    Patient is a male aged 67 years, presenting wi...
3    Patient is a male aged 37 years, presenting wi...
4    Patient is a female aged 41 years, presenting ...
Name: clinical_text, dtype: object


In [None]:
for idx, text in enumerate(df["clinical_text"]):
    print(f"Row {idx + 1}:\n{text}\n{'-'*80}")

Row 1:
Patient is a male aged 63 years, presenting with chest pain type 1, resting blood pressure 145 mm Hg, serum cholesterol 233 mg/dL, elevated fasting blood sugar, resting ECG showing result Left Ventricular Hypertrophy, maximum heart rate achieved 150, without exercise-induced angina, ST depression of 2.3 mm, Slope of Peak Exercise ST Segment: Downsloping, 0.0 major vessels affected, thalassemia type Fixed Defect.
--------------------------------------------------------------------------------
Row 2:
Patient is a male aged 67 years, presenting with chest pain type 4, resting blood pressure 160 mm Hg, serum cholesterol 286 mg/dL, normal fasting blood sugar, resting ECG showing result Left Ventricular Hypertrophy, maximum heart rate achieved 108, with exercise-induced angina, ST depression of 1.5 mm, Slope of Peak Exercise ST Segment: Flat, 3.0 major vessels affected, thalassemia type Normal.
--------------------------------------------------------------------------------
Row 3:
Pat

# Train/Test Split

In [None]:
# Before Augmented
x_train = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Before_Augmented/x_train.csv")
y_train = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Before_Augmented/y_train.csv")

In [None]:
# Train Shape
x_train.shape

(246, 1)

In [None]:
# Train Label Shape
y_train.shape

(246, 1)

In [None]:
# Train Count
print(y_train.value_counts())

Heart Disease (0 = No Disease, 1 = Heart Disease)
0                                                    128
1                                                    118
Name: count, dtype: int64


In [None]:
# Loading Test Set
x_test = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/x_test.csv")
y_test = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/y_test.csv")

In [None]:
# Test shape
x_test.shape

(51, 1)

In [None]:
# Test Label shape
y_test.shape

(51, 1)

In [None]:
# Test Count
print(y_test.value_counts())

0
0    32
1    19
Name: count, dtype: int64


### GPT Model (humarin/chatgpt_paraphraser_on_T5_base) for Synthetic Data and Paraphrasing (Optional)

In [None]:
# Merge x_train and y_train
df = pd.concat([x_train, y_train], axis=1)

### Positive cases

In [None]:
# Filter the rows where CKD == 1
hd_rows = df[df["Heart Disease (0 = No Disease, 1 = Heart Disease)"] == 1]

# Print clinical texts for CKD patients only
for idx, text in enumerate(hd_rows["clinical_text"]):
    print(f"Row {idx + 1}:\n{text}\n{'-'*80}")

Row 1:
Patient is a male aged 54 years, presenting with chest pain type 4, resting blood pressure 124 mm Hg, serum cholesterol 266 mg/dL, normal fasting blood sugar, resting ECG showing result Left Ventricular Hypertrophy, maximum heart rate achieved 109, with exercise-induced angina, ST depression of 2.2 mm, Slope of Peak Exercise ST Segment: Flat, 1.0 major vessels affected, thalassemia type Reversible Defect.
--------------------------------------------------------------------------------
Row 2:
Patient is a male aged 58 years, presenting with chest pain type 4, resting blood pressure 150 mm Hg, serum cholesterol 270 mg/dL, normal fasting blood sugar, resting ECG showing result Left Ventricular Hypertrophy, maximum heart rate achieved 111, with exercise-induced angina, ST depression of 0.8 mm, Slope of Peak Exercise ST Segment: Upsloping, 0.0 major vessels affected, thalassemia type Reversible Defect.
--------------------------------------------------------------------------------
R

In [None]:
import torch
from transformers import pipeline

# Step 1: Filter class 1 examples
df_class_1 = df[df["Heart Disease (0 = No Disease, 1 = Heart Disease)"] == 1]
clinical_sentences_class_1 = df_class_1["clinical_text"].tolist()

# Step 2: Calculate how many paraphrases we need
num_class_0 = df[df["Heart Disease (0 = No Disease, 1 = Heart Disease)"] == 0].shape[0]
num_class_1 = len(clinical_sentences_class_1)
num_needed = num_class_0 - num_class_1

# Step 3: Load GPT (or equivalent medically fine-tuned model for paraphrasing)
paraphrase_pipe = pipeline(
    "text2text-generation",
    model="humarin/chatgpt_paraphraser_on_T5_base",  # This is T5-based, tuned for paraphrasing
    device=0 if torch.cuda.is_available() else -1
)

# Step 4: Generate paraphrases
synthetic_paraphrases = []
num_generated = 0
i = 0

while num_generated < num_needed:
    sentence = clinical_sentences_class_1[i % num_class_1]
    prompt = f"Paraphrase medically accurately: {sentence}"
    result = paraphrase_pipe(prompt, max_length=512, num_return_sequences=1, do_sample=True)
    paraphrased_text = result[0]['generated_text']
    synthetic_paraphrases.append(paraphrased_text)
    num_generated += 1
    i += 1

# Step 5: Store separately in a DataFrame
df_synthetic = pd.DataFrame({
    "clinical_text": synthetic_paraphrases,
    "Heart Disease (0 = No Disease, 1 = Heart Disease)": [1] * len(synthetic_paraphrases)
})

print(f"Generated {len(df_synthetic)} paraphrased class 1 samples.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Generated 23 paraphrased class 1 samples.


In [None]:
load_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Augmented/Final/synthetic_HD_Data.csv"

# Load the DataFrame
df_synthetic = pd.read_csv(load_path, encoding='ISO-8859-1')

# Preview
df_synthetic.head()

Unnamed: 0,clinical_text,"Heart Disease (0 = No Disease, 1 = Heart Disease)"
0,A 54-year-old male with chest pain type 4 show...,1
1,Resting ECG shows Left Ventricular Hypertrophy...,1
2,A male patient with elevated fasting blood sug...,1
3,A 50-year-old male with chest pain type 4 has ...,1
4,Presenting with chest pain type 4 and exercise...,1


In [None]:
# Visualizing Rows
for idx, text in enumerate(df_synthetic["clinical_text"]):
    print(f"Row {idx + 1}:\n{text}\n{'-'*80}")

Row 1:
A 54-year-old male with chest pain type 4 shows ST depression of 2.2 mm and a flat slope on peak exercise. He has a resting blood pressure of 124 mm Hg and serum cholesterol at 266 mg/dL. Fasting blood sugar is normal. Resting ECG indicates Left Ventricular Hypertrophy. Maximum heart rate achieved is 109, and he experiences exercise-induced angina. One major vessel is affected, and thalassemia is of the Reversible Defect type.
--------------------------------------------------------------------------------
Row 2:
Resting ECG shows Left Ventricular Hypertrophy in this 58-year-old male with chest pain type 4. Blood pressure is 150 mm Hg, and cholesterol level is 270 mg/dL. Fasting blood sugar is normal. The patient reached a maximum heart rate of 111 and had exercise-induced angina. ST depression is 0.8 mm with an upsloping segment. No major vessels are affected, and thalassemia is of the Reversible Defect type.
---------------------------------------------------------------------

## After Augmentation

In [None]:
# Concatenating along rows (axis=0)

x_train = pd.concat(
    [ x_train[['clinical_text']],
      df_synthetic[['clinical_text']] ],
    axis=0,
    ignore_index=True
)

y_train = pd.concat(
    [ y_train[['Heart Disease (0 = No Disease, 1 = Heart Disease)']],
      df_synthetic[['Heart Disease (0 = No Disease, 1 = Heart Disease)']] ],
    axis=0,
    ignore_index=True
)

In [None]:
"""
# Save to CSV
x_train.to_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/Final/x_train.csv", index=False)
y_train.to_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/Final/y_train", index=False)
"""

In [5]:
# Augmented
x_train = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/Final/x_train.csv")
y_train = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/Final/y_train")

In [None]:
# Train Shape
x_train.shape

(269, 1)

In [None]:
# Train Label Shape
y_train.shape

(292, 1)

In [None]:
# Train Count
print(y_train.value_counts())

0
1    164
0    128
Name: count, dtype: int64


In [6]:
# Loading Test Set
x_test = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/x_test.csv")
y_test = pd.read_csv("/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Updated/Final_Data/y_test.csv")

In [None]:
# Test shape
x_test.shape

(51, 1)

In [None]:
# Test Label shape
y_test.shape

(51, 1)

In [None]:
# Test Count
print(y_test.value_counts())

0
0    32
1    19
Name: count, dtype: int64


# GPT as Classification Model

In [None]:
pip install --upgrade transformers



In [None]:
import transformers
print(transformers.__version__)

4.51.3


In [None]:
!pip install huggingface_hub[hf_xet]

Collecting hf-xet>=0.1.4 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf-xet
Successfully installed hf-xet-1.1.0


In [None]:
pip install transformers datasets scikit-learn

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

In [7]:
import torch
import pandas as pd
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config
from transformers import Trainer, TrainingArguments
from torch import nn
from torch.utils.data import Dataset
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Load tokenizer and GPT2
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token  # set pad token to eos_token

# 2. Custom Dataset
class ClinicalDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(list(texts), truncation=True, padding=True, max_length=max_len, return_tensors='pt')
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

# 3. Custom GPT2 Classification Model
class GPT2ForClassification(nn.Module):
    def __init__(self, n_classes=2):
        super(GPT2ForClassification, self).__init__()
        self.gpt2 = GPT2Model.from_pretrained("distilgpt2")
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(self.gpt2.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.gpt2(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state  # [batch_size, seq_len, hidden_dim]
        cls_output = last_hidden_state[:, -1, :]  # use last token hidden state
        logits = self.classifier(self.dropout(cls_output))
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
        return {"loss": loss, "logits": logits}

# Texts
if isinstance(x_train, pd.DataFrame):
    x_train = x_train.squeeze().astype(str).tolist()
if isinstance(x_test, pd.DataFrame):
    x_test = x_test.squeeze().astype(str).tolist()

# Labels
if isinstance(y_train, pd.DataFrame):
    y_train = y_train.squeeze().astype(int).tolist()
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze().astype(int).tolist()

# 4. Prepare dataset
train_dataset = ClinicalDataset(x_train, y_train, tokenizer)
test_dataset = ClinicalDataset(x_test, y_test, tokenizer)

# 5. Load model
model = GPT2ForClassification()
model.to(device)

# 6. TrainingArguments and Trainer
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=5,
    save_steps=10,
    eval_steps=10,
    metric_for_best_model="accuracy",  # or "f1" depending on your task
    greater_is_better=True,
    warmup_ratio=0.1,              # Warmup to prevent early overfitting
    gradient_accumulation_steps=2, # Simulates larger batch size
    fp16=True,                     # Use if on GPU with mixed precision support
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# 7. Train the model
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mstarwarsfanclub1234[0m ([33mstarwarsfanclub1234-montclair-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
5,4.1137
10,1.8797
15,3.4278
20,1.7062
25,1.4692
30,1.1253
35,1.2491
40,1.8573
45,1.9591
50,1.1173


TrainOutput(global_step=680, training_loss=0.7316580455092823, metrics={'train_runtime': 1387.1139, 'train_samples_per_second': 1.939, 'train_steps_per_second': 0.49, 'total_flos': 0.0, 'train_loss': 0.7316580455092823, 'epoch': 10.0})

In [8]:
# 8. Evaluation
preds_output = trainer.predict(test_dataset)
predictions = np.argmax(preds_output.predictions, axis=1)

In [9]:
# 9. Classification report & confusion matrix
print("Classification Report:\n", classification_report(y_test, predictions, digits=4))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))

Classification Report:
               precision    recall  f1-score   support

           0     0.9677    0.9375    0.9524        32
           1     0.9000    0.9474    0.9231        19

    accuracy                         0.9412        51
   macro avg     0.9339    0.9424    0.9377        51
weighted avg     0.9425    0.9412    0.9415        51

Confusion Matrix:
 [[30  2]
 [ 1 18]]


In [10]:
"""
save_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Model/"

trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

print("Model and tokenizer saved to:", save_path)
"""

Model and tokenizer saved to: /content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Model/


In [11]:
from transformers import GPT2Tokenizer
import torch

save_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Model/"

# Reload tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(save_path)
tokenizer.pad_token = tokenizer.eos_token

# Reload model
model = GPT2ForClassification(n_classes=2)
state_dict = torch.load(save_path + "pytorch_model.bin", map_location="cuda" if torch.cuda.is_available() else "cpu")
model.load_state_dict(state_dict)
model.to(device)
model.eval()

print("Model loaded from:", save_path)

Model loaded from: /content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Model/


## Ablation

In [None]:
import torch
import pandas as pd
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config
from transformers import Trainer, TrainingArguments
from torch import nn
from torch.utils.data import Dataset
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Load tokenizer and GPT2
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token  # set pad token to eos_token

# 2. Custom Dataset
class ClinicalDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(list(texts), truncation=True, padding=True, max_length=max_len, return_tensors='pt')
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

# 3. Custom GPT2 Classification Model
class GPT2ForClassification(nn.Module):
    def __init__(self, n_classes=2):
        super(GPT2ForClassification, self).__init__()
        self.gpt2 = GPT2Model.from_pretrained("distilgpt2")
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(self.gpt2.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.gpt2(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state  # [batch_size, seq_len, hidden_dim]
        cls_output = last_hidden_state[:, -1, :]  # use last token hidden state
        logits = self.classifier(self.dropout(cls_output))
        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)
        return {"loss": loss, "logits": logits}

# Texts
if isinstance(x_train, pd.DataFrame):
    x_train = x_train.squeeze().astype(str).tolist()
if isinstance(x_test, pd.DataFrame):
    x_test = x_test.squeeze().astype(str).tolist()

# Labels
if isinstance(y_train, pd.DataFrame):
    y_train = y_train.squeeze().astype(int).tolist()
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze().astype(int).tolist()

# 4. Prepare dataset
train_dataset = ClinicalDataset(x_train, y_train, tokenizer)
test_dataset = ClinicalDataset(x_test, y_test, tokenizer)

# 5. Load model
model = GPT2ForClassification()
model.to(device)

# 6. TrainingArguments and Trainer
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.001,
    logging_dir="./logs",
    logging_steps=5,
    save_steps=10,
    eval_steps=10,
    metric_for_best_model="accuracy",  # or "f1" depending on your task
    greater_is_better=True,
    warmup_ratio=0.1,              # Warmup to prevent early overfitting
    gradient_accumulation_steps=2, # Simulates larger batch size
    fp16=True,                     # Use if on GPU with mixed precision support
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# 7. Train the model
trainer.train()

Step,Training Loss
5,0.9986
10,2.1239
15,1.0534
20,1.4737
25,1.1274
30,1.2136
35,0.9339
40,0.8216
45,0.8325
50,0.8108


TrainOutput(global_step=183, training_loss=0.8105736430225476, metrics={'train_runtime': 290.2068, 'train_samples_per_second': 2.543, 'train_steps_per_second': 0.631, 'total_flos': 0.0, 'train_loss': 0.8105736430225476, 'epoch': 2.959349593495935})

In [None]:
# 8. Evaluation
preds_output = trainer.predict(test_dataset)
predictions = np.argmax(preds_output.predictions, axis=1)

In [None]:
# 9. Classification report & confusion matrix
print("Classification Report:\n", classification_report(y_test, predictions, digits=4))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))

Classification Report:
               precision    recall  f1-score   support

           0     0.8750    0.8750    0.8750        32
           1     0.7895    0.7895    0.7895        19

    accuracy                         0.8431        51
   macro avg     0.8322    0.8322    0.8322        51
weighted avg     0.8431    0.8431    0.8431        51

Confusion Matrix:
 [[28  4]
 [ 4 15]]


# Interpretability

We take the trained GPT-2 model and look at how much each input token “influences” the model’s prediction. To do this, we track gradients—essentially, how much the model’s output would change if we slightly changed the token’s representation. Tokens with larger gradient values had a bigger effect on the decision, meaning the model “paid more attention” to them. By focusing only on keywords of interest (like “blood pressure” or “sugar”), we can see which clinical features the model considered most important for its prediction. This method gives a transparent view of what the model thinks matters, without changing the model itself.

In simple terms: we compute the gradient of the model’s output (the predicted class score) with respect to each input token’s embedding. The size of this gradient tells us how sensitive the prediction is to changes in that token. Larger gradients mean the model relies more on that token to make its decision.

This is a post-hoc interpretability method that works directly on the trained model without modifying it, and it’s widely used in NLP for token-level importance visualization.

In [25]:
for idx, text in enumerate(x_test.values):
    print(f"Row {idx + 1}:\n{text}\n{'-' * 80}")

Row 1:
['Patient is a male aged 41 years, presenting with chest pain type 2, resting blood pressure 120 mm Hg, serum cholesterol 157 mg/dL, normal fasting blood sugar, resting ECG showing result Normal, maximum heart rate achieved 182, without exercise-induced angina, ST depression of 0.0 mm, Slope of Peak Exercise ST Segment: Upsloping, 0.0 major vessels affected, thalassemia type Normal.']
--------------------------------------------------------------------------------
Row 2:
['Patient is a female aged 49 years, presenting with chest pain type 4, resting blood pressure 130 mm Hg, serum cholesterol 269 mg/dL, normal fasting blood sugar, resting ECG showing result Normal, maximum heart rate achieved 163, without exercise-induced angina, ST depression of 0.0 mm, Slope of Peak Exercise ST Segment: Upsloping, 0.0 major vessels affected, thalassemia type Normal.']
--------------------------------------------------------------------------------
Row 3:
['Patient is a male aged 66 years, pres

In [7]:
# Texts
if isinstance(x_train, pd.DataFrame):
    x_train = x_train.squeeze().astype(str).tolist()
if isinstance(x_test, pd.DataFrame):
    x_test = x_test.squeeze().astype(str).tolist()

# Labels
if isinstance(y_train, pd.DataFrame):
    y_train = y_train.squeeze().astype(int).tolist()
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze().astype(int).tolist()

In [8]:
# If y_train/y_test are already lists
y_train = [int(i) for i in y_train]
y_test  = [int(i) for i in y_test]

# If x_train/x_test are lists, make sure they are strings
x_train = [str(i) for i in x_train]
x_test  = [str(i) for i in x_test]

In [12]:
import torch
import pandas as pd
from transformers import GPT2Tokenizer, GPT2Model, GPT2Config, Trainer, TrainingArguments
from torch import nn
from torch.utils.data import Dataset
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# =====================
# 0. Device
# =====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# =====================
# 1. Load tokenizer
# =====================
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token  # pad token

# =====================
# 2. Dataset
# =====================
class ClinicalDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.encodings = tokenizer(list(texts), truncation=True, padding=True,
                                   max_length=max_len, return_tensors="pt")
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

# =====================
# 3. GPT2 Classification Model (with attention)
# =====================
class GPT2ForClassification(nn.Module):
    def __init__(self, n_classes=2):
        super().__init__()
        # GPT2 with attention output
        config = GPT2Config.from_pretrained("distilgpt2", output_attentions=True, return_dict=True)
        self.gpt2 = GPT2Model.from_pretrained("distilgpt2", config=config)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(self.gpt2.config.hidden_size, n_classes)

    def forward(self, input_ids=None, attention_mask=None, labels=None, return_dict=True, inputs_embeds=None):
        outputs = self.gpt2(input_ids=input_ids,
                            attention_mask=attention_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=True,
                            return_dict=True)
        last_hidden_state = outputs.last_hidden_state
        cls_output = last_hidden_state[:, -1, :]  # last token
        logits = self.classifier(self.dropout(cls_output))

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(logits, labels)

        return {"loss": loss, "logits": logits, "attentions": outputs.attentions}

# =====================
# 4. Load your dataset
# =====================
# Make sure x_train, y_train, x_test, y_test are loaded as pandas DataFrame
# Example conversion:
def df_to_list(df):
    if isinstance(df, pd.DataFrame):
        return df.squeeze().astype(str).tolist()
    return df

x_train = df_to_list(x_train)
x_test = df_to_list(x_test)
y_train = df_to_list(y_train)
y_test = df_to_list(y_test)

train_dataset = ClinicalDataset(x_train, y_train, tokenizer)
test_dataset = ClinicalDataset(x_test, y_test, tokenizer)

# =====================
# 5. Initialize model
# =====================
model = GPT2ForClassification(n_classes=2)
model.to(device)

# =====================
# 6. Training
# =====================
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=5,
    save_steps=10,
    eval_steps=10,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    warmup_ratio=0.1,
    gradient_accumulation_steps=2,
    fp16=True,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return {"accuracy": (preds == labels).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mstarwarsfanclub1234[0m ([33mstarwarsfanclub1234-montclair-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
5,2.5353
10,1.7174
15,2.0509
20,2.1478
25,1.1147
30,1.0501
35,1.0035
40,2.8833
45,2.0856
50,0.6745


TrainOutput(global_step=680, training_loss=0.6815892156432657, metrics={'train_runtime': 3302.0155, 'train_samples_per_second': 0.815, 'train_steps_per_second': 0.206, 'total_flos': 0.0, 'train_loss': 0.6815892156432657, 'epoch': 10.0})

In [13]:
"""
from transformers import GPT2Tokenizer
import torch

save_path = "/content/drive/Shareddrives/Best Shared Drive Ever/Simon-personal/CardioGPT/Model/Test/"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
"""

In [11]:
# Sample clinical sentence
sample_text = ("Patient is a male aged 41 years, presenting with chest pain type 2, resting blood pressure 120 mm Hg, serum cholesterol 157 mg/dL, normal fasting blood sugar, resting ECG showing result Normal, maximum heart rate achieved 182, without exercise-induced angina, ST depression of 0.0 mm, Slope of Peak Exercise ST Segment: Upsloping, 0.0 major vessels affected, thalassemia type Normal.")

# Tokenize
inputs = tokenizer(
    sample_text,
    return_tensors="pt",
    truncation=True,
    padding=True
).to(device)

print("Input IDs shape:", inputs["input_ids"].shape)
print("Attention mask shape:", inputs["attention_mask"].shape)

Input IDs shape: torch.Size([1, 91])
Attention mask shape: torch.Size([1, 91])


In [12]:
# Forward pass
with torch.no_grad():
    outputs = model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"]
    )

logits = outputs["logits"]
attentions = outputs["attentions"]  # tuple of attention matrices per layer
predicted_class = logits.argmax(dim=-1).item()

print("Predicted class:", predicted_class)
print("Number of attention layers:", len(attentions))
print("Shape of first layer attentions:", attentions[0].shape)  # [batch, heads, seq_len, seq_len]

Predicted class: 0
Number of attention layers: 6
Shape of first layer attentions: torch.Size([1, 12, 91, 91])


In [18]:
# =====================
# Updated Sample clinical text
# =====================
sample_text = (
    "Patient is a male aged 50 years, presenting with chest pain type 3, "
    "resting blood pressure 129 mm Hg, serum cholesterol 196 mg/dL, normal fasting blood sugar, "
    "resting ECG showing result Normal, maximum heart rate achieved 163, without exercise-induced angina, "
    "ST depression of 0.0 mm, Slope of Peak Exercise ST Segment: Upsloping, 0.0 major vessels affected, "
    "thalassemia type Normal."
)

# =====================
# Define keywords of interest for this text
# =====================
key_words = ["age", "chest", "pain", "blood", "pressure", "cholesterol", "blood", "sugar",
             "ECG", "heart", "rate", "angina", "ST", "depression", "ST", "segment",
             "major", "vessels", "thalassemia"]

# =====================
# Tokenize
# =====================
inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, padding=True).to(device)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# =====================
# Forward pass with embedding gradient tracking
# =====================
embed_layer = model.gpt2.wte
embeds = embed_layer(inputs["input_ids"])
embeds.retain_grad()
embeds.requires_grad_(True)

outputs = model(input_ids=None, attention_mask=inputs["attention_mask"], inputs_embeds=embeds)
cls_index = outputs["logits"].argmax(dim=-1).item()
score = outputs["logits"][0, cls_index]

model.zero_grad()
score.backward()

# =====================
# Compute token importance
# =====================
grads = embeds.grad[0]  # (seq_len, embedding_dim)
token_importance = grads.norm(dim=-1).detach().cpu().numpy()

# =====================
# Map token saliency to keywords only
# =====================
token_scores = {}
for t, s in zip(tokens, token_importance):
    clean_token = t.replace("Ġ", "").lower()  # remove GPT2 whitespace token
    if any(k.lower() in clean_token for k in key_words):
        token_scores[t] = s

# =====================
# Print keyword-specific saliency
# =====================
print("Keyword-level gradient saliency:")
for t, s in token_scores.items():
    print(f"{t}: {s:.4f}")

Keyword-level gradient saliency:
Ġaged: 0.0871
Ġchest: 0.0666
Ġpain: 0.0920
Ġresting: 0.0310
Ġblood: 0.0345
Ġpressure: 0.0393
Ġcholesterol: 0.0679
Ġfasting: 0.0507
Ġsugar: 0.0503
Ġheart: 0.0534
Ġrate: 0.0435
ĠST: 0.2561
Ġdepression: 0.0952
Ġmajor: 0.4008
Ġvessels: 0.3308
