### Processing and Filtering MIMIC-III Discharge Notes

This code filters and processes discharge summaries from the MIMIC-III dataset. It removes newborn and incomplete records, keeps only the latest admission per patient, and combines notes by admission. It extracts only the sections typically available at admission (e.g., chief complaint, medical history) to avoid data leakage on the diagnoses.


In [None]:
import pandas as pd
import re

def filter_notes(notes_df: pd.DataFrame, admissions_df: pd.DataFrame, admission_text_only: bool = False) -> pd.DataFrame:
    """
    Filter and clean MIMIC-III note data, retaining only the last admission per patient.

    Parameters:
      notes_df: DataFrame containing note data.
      admissions_df: DataFrame containing admission data.
      admission_text_only: If True, further extract only sections known at admission.

    Returns:
      A cleaned DataFrame with a consolidated TEXT field (or TEXT_ADMISSION if filtered),
      with only the latest admission per patient.
    """
    # Filter out newborn admissions.
    adult_admissions = admissions_df[admissions_df.ADMISSION_TYPE != "NEWBORN"]
    notes_df = notes_df[notes_df.HADM_ID.isin(adult_admissions.HADM_ID)]

    # Drop rows without TEXT or HADM_ID.
    notes_df = notes_df.dropna(subset=["TEXT", "HADM_ID"])

    # Keep only discharge summaries.
    notes_df = notes_df[notes_df.CATEGORY == "Discharge summary"]

    # Sort by CHARTDATE and remove duplicate TEXT entries (keep the latest).
    notes_df = notes_df.sort_values(by=["CHARTDATE"])
    notes_df = notes_df.drop_duplicates(subset=["TEXT"], keep="last")

    # Combine all texts for each HADM_ID.
    combined_texts = notes_df.groupby("HADM_ID")["TEXT"].apply(lambda texts: "\n\n".join(texts)).reset_index()

    # Get main report rows to retain metadata.
    main_reports = notes_df[notes_df.DESCRIPTION == "Report"].copy()
    main_reports = main_reports[["HADM_ID", "ROW_ID", "SUBJECT_ID", "CHARTDATE"]]
    main_reports = main_reports.drop_duplicates(subset=["HADM_ID"], keep="last")

    # Merge the combined texts with metadata.
    notes_df = pd.merge(combined_texts, main_reports, on="HADM_ID", how="inner")
    notes_df["TEXT"] = notes_df["TEXT"].str.strip()

    # Ensure we have all critical fields.
    notes_df = notes_df.dropna(subset=["HADM_ID", "SUBJECT_ID", "TEXT"])

    # Retain only the last admission per patient.
    notes_df = notes_df.sort_values(by=["SUBJECT_ID", "CHARTDATE"])
    notes_df = notes_df.drop_duplicates(subset=["SUBJECT_ID"], keep="last")

    # Optionally filter text to admission-only sections.
    if admission_text_only:
        notes_df = filter_admission_text(notes_df)

    return notes_df



def filter_admission_text(notes_df: pd.DataFrame) -> pd.DataFrame:
    """
    Extract and combine admission-related sections from the TEXT field.

    The function extracts sections that are typically available at admission:
      - CHIEF_COMPLAINT
      - PRESENT_ILLNESS
      - MEDICAL_HISTORY
      - MEDICATIONS ON ADMISSION
      - ALLERGIES
      - FAMILY_HISTORY
      - SOCIAL_HISTORY

    It uses a regex pattern that (case-insensitively) captures the text following each section header,
    up to the start of the next header (assumed to be a newline followed by an uppercase header) or the end-of-text.

    Parameters:
      notes_df: DataFrame containing a TEXT column.

    Returns:
      DataFrame with additional columns for each section and a new TEXT_ADMISSION column that
      concatenates the extracted sections.
    """
    # Define mapping of section keys to header strings.
    admission_sections = {
        "CHIEF_COMPLAINT": "chief complaint:",
        "PRESENT_ILLNESS": "present illness:",
        "MEDICAL_HISTORY": "medical history:",
        "MEDICATION_ADM": "medications on admission:",
        "ALLERGIES": "allergies:",
        "FAMILY_HISTORY": "family history:",
        "SOCIAL_HISTORY": "social history:"
    }

    # For each section, extract text using a regex pattern.
    # The pattern looks for the header (case-insensitive) then lazily captures text until a newline
    # followed by another header (starting with uppercase letters) or the end of the string.
    for key, header in admission_sections.items():
        # Escape header to avoid regex metacharacter issues.
        regex_pattern = r'(?is){}(.*?)(?=\n[A-Z][A-Z\s]+?:|$)'.format(re.escape(header))
        notes_df[key] = notes_df["TEXT"].str.extract(regex_pattern, expand=False).fillna("").str.strip()

        # Optionally clear sections that start with unwanted tokens.
        notes_df.loc[notes_df[key].str.startswith("[]"), key] = ""

    # Filter out notes that are missing at least one of the main admission sections.
    notes_df = notes_df[
        (notes_df["CHIEF_COMPLAINT"] != "") |
        (notes_df["PRESENT_ILLNESS"] != "") |
        (notes_df["MEDICAL_HISTORY"] != "")
    ]

    # Combine the extracted sections into a new column.
    notes_df["TEXT_ADMISSION"] = (
        "CHIEF COMPLAINT: " + notes_df["CHIEF_COMPLAINT"].astype(str) + "\n\n" +
        "PRESENT ILLNESS: " + notes_df["PRESENT_ILLNESS"].astype(str) + "\n\n" +
        "MEDICAL HISTORY: " + notes_df["MEDICAL_HISTORY"].astype(str) + "\n\n" +
        "MEDICATION ON ADMISSION: " + notes_df["MEDICATION_ADM"].astype(str) + "\n\n" +
        "ALLERGIES: " + notes_df["ALLERGIES"].astype(str) + "\n\n" +
        "FAMILY HISTORY: " + notes_df["FAMILY_HISTORY"].astype(str) + "\n\n" +
        "SOCIAL HISTORY: " + notes_df["SOCIAL_HISTORY"].astype(str)
    )

    return notes_df


In [None]:
import pandas as pd

notes_df = pd.read_csv('NOTEEVENTS.csv.gz', compression='gzip')
admissions_df = pd.read_csv('ADMISSIONS.csv.gz', compression='gzip')

cleaned_notes_admission = filter_notes(notes_df, admissions_df, admission_text_only=True)
cleaned_notes_admission.head()

### Merging Diagnoses with Clinical Notes

This code loads diagnosis data from MIMIC-III, groups ICD-9 codes by hospital admission, and merges them with the previously cleaned clinical notes. It keeps only records with associated diagnoses for further analysis.


In [None]:
# Load the DIAGNOSES_ICD data
diagnoses_df = pd.read_csv('DIAGNOSES_ICD.csv.gz', compression='gzip')

# Group by HADM_ID and aggregate unique ICD9 codes into a list
diagnoses_grouped = diagnoses_df.groupby('HADM_ID')['ICD9_CODE'] \
    .apply(lambda codes: list(codes.unique())) \
    .reset_index(name='DIAGNOSES')

merged_df = pd.merge(cleaned_notes_admission, diagnoses_grouped, on='HADM_ID', how='left')

merged_df = merged_df.dropna(subset=['DIAGNOSES'])

merged_df.head()

### Creating Label Mappings for ICD-9 Codes

This code generates a sorted list of unique ICD-9 codes and creates two dictionaries: one mapping each code to a unique integer ID (`label2id`), and another mapping IDs back to their corresponding codes (`id2label`).


In [None]:
import pandas as pd

# Load the DIAGNOSES_ICD data
diagnoses_df = pd.read_csv('DIAGNOSES_ICD.csv.gz', compression='gzip')

# Get sorted list of unique ICD9 codes
unique_icd9_codes = sorted(diagnoses_df['ICD9_CODE'].dropna().unique())

# Create mappings
label2id = {code: idx for idx, code in enumerate(unique_icd9_codes)}
id2label = {idx: code for code, idx in label2id.items()}

print(label2id)
print(id2label)

# Creating Label Mappings for ICD-9 Code Chapters

This will be used for our other model with less labels.

In [None]:
# Mapping from chapter name to ID
chaptertoid = {
    "Infectious and Parasitic Diseases": 0,
    "Neoplasms": 1,
    "Endocrine, Nutritional and Metabolic Diseases, and Immunity Disorders": 2,
    "Diseases of the Blood and Blood-Forming Organs": 3,
    "Mental Disorders": 4,
    "Diseases of the Nervous System and Sense Organs": 5,
    "Diseases of the Circulatory System": 6,
    "Diseases of the Respiratory System": 7,
    "Diseases of the Digestive System": 8,
    "Diseases of the Genitourinary System": 9,
    "Complications of Pregnancy, Childbirth, and the Puerperium": 10,
    "Diseases of the Skin and Subcutaneous Tissue": 11,
    "Diseases of the Musculoskeletal System and Connective Tissue": 12,
    "Congenital Anomalies": 13,
    "Certain Conditions Originating in the Perinatal Period": 14,
    "Symptoms, Signs, and Ill-Defined Conditions": 15,
    "Injury and Poisoning": 16,
    "External Causes of Injury (E codes)": 17,
    "Supplemental Classification (V codes)": 18
}

# Reverse mapping from ID to chapter name
idtochapter = {v: k for k, v in chaptertoid.items()}

print(chaptertoid)
print(idtochapter)

### Converting Diagnoses to Label Indices

This code maps each list of ICD-9 codes in the dataset to their corresponding integer IDs using the `label2id` dictionary, and stores the result in a new `DIAGNOSES_INDICES` column.


In [None]:
# Create the DIAGNOSES_INDICES column using the label2id mapping
merged_df['DIAGNOSES_INDICES'] = merged_df['DIAGNOSES'].apply(
    lambda codes: [label2id[code] for code in codes if code in label2id]
)


In [None]:
# Define ICD-9 numeric code ranges and their chapter names
icd9_chapter_ranges = [
    ((1, 139), "Infectious and Parasitic Diseases"),
    ((140, 239), "Neoplasms"),
    ((240, 279), "Endocrine, Nutritional and Metabolic Diseases, and Immunity Disorders"),
    ((280, 289), "Diseases of the Blood and Blood-Forming Organs"),
    ((290, 319), "Mental Disorders"),
    ((320, 389), "Diseases of the Nervous System and Sense Organs"),
    ((390, 459), "Diseases of the Circulatory System"),
    ((460, 519), "Diseases of the Respiratory System"),
    ((520, 579), "Diseases of the Digestive System"),
    ((580, 629), "Diseases of the Genitourinary System"),
    ((630, 679), "Complications of Pregnancy, Childbirth, and the Puerperium"),
    ((680, 709), "Diseases of the Skin and Subcutaneous Tissue"),
    ((710, 739), "Diseases of the Musculoskeletal System and Connective Tissue"),
    ((740, 759), "Congenital Anomalies"),
    ((760, 779), "Certain Conditions Originating in the Perinatal Period"),
    ((780, 799), "Symptoms, Signs, and Ill-Defined Conditions"),
    ((800, 999), "Injury and Poisoning")
]

def icd9_to_chapter_index(code):
    try:
        code = str(code).strip()

        # Handle E and V codes directly
        if code.startswith('E'):
            return chaptertoid["External Causes of Injury (E codes)"]
        elif code.startswith('V'):
            return chaptertoid["Supplemental Classification (V codes)"]

        # Remove leading zeros, then take first 3 digits
        code_no_leading_zeros = code.lstrip('0')
        root_code = code_no_leading_zeros[:3]
        numeric_code = float(root_code)

        for (start, end), chapter in icd9_chapter_ranges:
            if start <= numeric_code <= end:
                return chaptertoid[chapter]
    except (ValueError, TypeError):
        pass

    return None

In [None]:
merged_df['CHAPTER_INDICES'] = merged_df['DIAGNOSES'].apply(
    lambda codes: list({icd9_to_chapter_index(code) for code in codes if icd9_to_chapter_index(code) is not None})
)

### Splitting Data into Train and Validation Sets

This code splits the dataset into training and validation sets using an 80/20 split. The data is shuffled before splitting, and indices are reset for convenience.

In [None]:
from sklearn.model_selection import train_test_split

# Split the DataFrame
train_df, val_df = train_test_split(merged_df, test_size=0.2, random_state=42, shuffle=True)

train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)


### Tokenizing and Saving Data in JSONL Format

This code tokenizes the admission text using the Clinical-Longformer tokenizer and saves the data in JSONL format. Each line contains tokenized inputs (`input_ids` and `attention_mask`) and corresponding diagnosis labels. Separate files are created for the training and validation sets.


In [None]:
from transformers import AutoTokenizer
import json
from tqdm import tqdm

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpie/distil-clinicalbert")

# Function to convert DataFrame to JSONL format
def dataframe_to_jsonl(df, output_path, type="codes"):
    with open(output_path, 'w') as f:
        for _, row in tqdm(df.iterrows(), total=len(df)):
            tokens = tokenizer(
                row["TEXT_ADMISSION"],
                padding='max_length',
                truncation=True,
                max_length=512,
                return_tensors=None
            )
            labels = row["DIAGNOSES_INDICES"] if type == "codes" else row["CHAPTER_INDICES"]

            json_obj = {
                "id": int(row["HADM_ID"]),
                "tokens": {
                    "input_ids": tokens["input_ids"],
                    "attention_mask": tokens["attention_mask"]
                },
                "labels": labels
            }
            f.write(json.dumps(json_obj) + "\n")

In [None]:
# Save train and val sets
dataframe_to_jsonl(train_df, "train.jsonl", type="codes")
dataframe_to_jsonl(val_df, "val.jsonl", type="codes")

In [None]:
# Save train and val sets
dataframe_to_jsonl(train_df, "train_chapters.jsonl", type="chapters")
dataframe_to_jsonl(val_df, "val_chapters.jsonl", type="chapters")

100%|██████████| 29505/29505 [01:14<00:00, 397.03it/s]
100%|██████████| 7377/7377 [00:16<00:00, 442.28it/s]


### Save Files for Later Use

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import os

# Define desired path inside Google Drive
save_dir = "/content/drive/MyDrive/clinical_notes_data"
os.makedirs(save_dir, exist_ok=True)

In [None]:
# Paths to save the mapping dictionaries
label2id_path = os.path.join(save_dir, "label2id.json")
id2label_path = os.path.join(save_dir, "id2label.json")

# Save as JSON
with open(label2id_path, 'w') as f:
    json.dump(label2id, f)

with open(id2label_path, 'w') as f:
    json.dump(id2label, f)


In [None]:
# Paths to save the mapping dictionaries
chapter2id_path = os.path.join(save_dir, "chapter2id.json")
id2chapter_path = os.path.join(save_dir, "id2chapter.json")

# Save as JSON
with open(chapter2id_path, 'w') as f:
    json.dump(chaptertoid, f)

with open(id2chapter_path, 'w') as f:
    json.dump(idtochapter, f)

In [None]:
import shutil

local_train_path = "/content/train.jsonl"
local_val_path = "/content/val.jsonl"

drive_train_path = os.path.join(save_dir, "train.jsonl")
drive_val_path = os.path.join(save_dir, "val.jsonl")

shutil.move(local_train_path, drive_train_path)
shutil.move(local_val_path, drive_val_path)


'/content/drive/MyDrive/clinical_notes_data/val.jsonl'

In [None]:
local_train_chapter_path = "/content/train_chapters.jsonl"
local_val_chapter_path = "/content/val_chapters.jsonl"

drive_train_chapter_path = os.path.join(save_dir, "train_chapters.jsonl")
drive_val_chapter_path = os.path.join(save_dir, "val_chapters.jsonl")

shutil.move(local_train_chapter_path, drive_train_chapter_path)
shutil.move(local_val_chapter_path, drive_val_chapter_path)

'/content/drive/MyDrive/clinical_notes_data/val_chapters.jsonl'

### Custom PyTorch Dataset for Clinical Notes

This code defines a custom `Dataset` class for loading tokenized clinical notes from a JSONL file. Each sample includes input tokens, attention masks, and multi-hot encoded diagnosis labels, making it suitable for multi-label classification tasks.


In [None]:
import torch
from torch.utils.data import Dataset
import json

class ClinicalNotesDataset(Dataset):
    def __init__(self, json_path, num_labels):
        """
        Args:
            json_path (str): Path to the JSONL file.
            num_labels (int): Total number of label classes (used for multi-hot encoding).
        """
        self.samples = []
        self.num_labels = num_labels

        # Load data from JSONL file
        with open(json_path, 'r') as f:
            for line in f:
                self.samples.append(json.loads(line.strip()))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        item = self.samples[idx]

        # Extract input_ids and attention_mask
        input_ids = torch.tensor(item["tokens"]["input_ids"], dtype=torch.long)
        attention_mask = torch.tensor(item["tokens"]["attention_mask"], dtype=torch.long)

        # Create multi-hot encoded label vector
        labels = torch.zeros(self.num_labels, dtype=torch.float)
        for label_idx in item["labels"]:
            labels[label_idx] = 1.0

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }


### Evaluation Function for Multi-Label Classification

This function evaluates a trained model on a validation set using various multi-label metrics. It computes AUC, precision, recall, F1 (micro and macro), Hamming loss. Predictions are based on a sigmoid activation and a 0.5 threshold.


In [None]:
import torch
import numpy as np
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    precision_score, recall_score, f1_score,
    hamming_loss, label_ranking_average_precision_score
)
from tqdm import tqdm

def evaluate(model, dataloader, device, k=5):
    model.eval()
    model.to(device)
    all_logits = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits  # [batch_size, num_labels]

            all_logits.append(logits.cpu())
            all_labels.append(labels.cpu())

    # Concatenate across batches
    all_logits = torch.cat(all_logits)
    all_labels = torch.cat(all_labels)

    # Convert to numpy
    probs = torch.sigmoid(all_logits).numpy()
    true = all_labels.numpy()
    preds = (probs >= 0.5).astype(int)  # Binary predictions with 0.5 threshold

    # Compute metrics
    metrics = {}

    # AUC metrics
    metrics["AUC_ROC_micro"] = roc_auc_score(true, probs, average="micro")
    metrics["AUC_PR_micro"] = average_precision_score(true, probs, average="micro")

    # Precision / Recall / F1
    metrics["Precision_micro"] = precision_score(true, preds, average="micro", zero_division=0)
    metrics["Recall_micro"] = recall_score(true, preds, average="micro", zero_division=0)
    metrics["F1_micro"] = f1_score(true, preds, average="micro", zero_division=0)

    metrics["Precision_macro"] = precision_score(true, preds, average="macro", zero_division=0)
    metrics["Recall_macro"] = recall_score(true, preds, average="macro", zero_division=0)
    metrics["F1_macro"] = f1_score(true, preds, average="macro", zero_division=0)

    # Hamming Loss
    metrics["Hamming_Loss"] = hamming_loss(true, preds)

    return metrics


### Training Loop for Multi-Label Classification

This function trains the model over multiple epochs using the provided data loaders, loss function, and optimizer. After each epoch, it prints the average training loss and evaluates the model on the validation set using the `evaluate` function.


In [None]:
import os
import torch

def train_model(model, device, train_loader, val_loader, loss_fn, optimizer, num_epochs, save_dir="checkpoints"):
    model.to(device)

    # Create directory if it doesn't exist
    os.makedirs(save_dir, exist_ok=True)

    for epoch in range(1, num_epochs + 1):
        model.train()
        total_loss = 0.0

        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch}"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            loss = loss_fn(logits, labels)
            loss.backward()

            optimizer.step()
            optimizer.zero_grad()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"\nEpoch {epoch}/{num_epochs} - Training Loss: {avg_loss:.4f}")

        # Evaluate model
        val_metrics = evaluate(model, val_loader, device)
        print("Validation Metrics:")
        for key, value in val_metrics.items():
            print(f"{key}: {value:.4f}")

        # Save model at end of epoch
        checkpoint_path = os.path.join(save_dir, f"model_epoch_{epoch}.pt")
        torch.save(model.state_dict(), checkpoint_path)
        print(f"Model saved to {checkpoint_path}")

    return model


In [None]:
import os
import json

data_dir = "/content/drive/MyDrive/clinical_notes_data"

with open(os.path.join(data_dir, "label2id.json"), 'r') as f:
    label2id = json.load(f)

with open(os.path.join(data_dir, "id2label.json"), 'r') as f:
    id2label = json.load(f)

with open(os.path.join(data_dir, "chapter2id.json"), 'r') as f:
    chapter2id = json.load(f)

with open(os.path.join(data_dir, "id2chapter.json"), 'r') as f:
    id2chapter = json.load(f)

num_labels_codes = len(label2id)
num_labels_chapters = len(chapter2id)


In [None]:
train_path_codes = "/content/drive/MyDrive/clinical_notes_data/train.jsonl"
val_path_codes = "/content/drive/MyDrive/clinical_notes_data/val.jsonl"
train_path_chapters = "/content/drive/MyDrive/clinical_notes_data/train_chapters.jsonl"
val_path_chapters = "/content/drive/MyDrive/clinical_notes_data/val_chapters.jsonl"

In [None]:
train_dataset_codes = ClinicalNotesDataset(train_path_codes, num_labels_codes)
val_dataset_codes = ClinicalNotesDataset(val_path_codes, num_labels_codes)

In [None]:
train_dataset_chapters = ClinicalNotesDataset(train_path_chapters, num_labels_chapters)
val_dataset_chapters = ClinicalNotesDataset(val_path_chapters, num_labels_chapters)

In [None]:
from transformers import AutoModelForSequenceClassification

model_codes = AutoModelForSequenceClassification.from_pretrained(
    "nlpie/distil-clinicalbert",
    num_labels=num_labels_codes,
    problem_type="multi_label_classification"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlpie/distil-clinicalbert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import AutoModelForSequenceClassification

model_chapters = AutoModelForSequenceClassification.from_pretrained(
    "nlpie/distil-clinicalbert",
    num_labels=num_labels_chapters,
    problem_type="multi_label_classification"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at nlpie/distil-clinicalbert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 256
num_workers = 2

In [None]:
# Create DataLoaders
train_loader_codes = DataLoader(
    train_dataset_codes,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers
)

val_loader_codes = DataLoader(
    val_dataset_codes,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers
)

In [None]:
train_loader_chapters = DataLoader(
    train_dataset_chapters,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers
)

val_loader_chapters = DataLoader(
    val_dataset_chapters,
    batch_size=batch_size,
    shuffle=False,
    num_workers=num_workers
)

In [None]:
# Evaluate using full evaluation function
initial_metrics_codes = evaluate(model_codes, val_loader_codes, device)

# Print the results
print("Evaluation BEFORE training:")
for key, value in initial_metrics_codes.items():
    print(f"{key}: {value:.4f}")


Evaluating: 100%|██████████| 58/58 [00:21<00:00,  2.74it/s]


Evaluation BEFORE training:
AUC_ROC_micro: 0.5035
AUC_PR_micro: 0.0017
Precision_micro: 0.0018
Recall_micro: 0.5067
F1_micro: 0.0035
Precision_macro: 0.0014
Recall_macro: 0.2840
F1_macro: 0.0024
Hamming_Loss: 0.4938


In [None]:
initial_metrics_chapters = evaluate(model_chapters, val_loader_chapters, device)

print("Evaluation BEFORE training:")
for key, value in initial_metrics_chapters.items():
    print(f"{key}: {value:.4f}")

Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.41it/s]

Evaluation BEFORE training:
AUC_ROC_micro: 0.4741
AUC_PR_micro: 0.3059
Precision_micro: 0.3315
Recall_micro: 0.4268
F1_micro: 0.3731
Precision_macro: 0.2794
Recall_macro: 0.4323
F1_macro: 0.2325
Hamming_Loss: 0.4759





In [None]:
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "value"]  # Standard names in BERT-like models
)

In [None]:
model_codes = get_peft_model(model_codes, lora_config)
model_codes.print_trainable_parameters()

In [None]:
model_chapters = get_peft_model(model_chapters, lora_config)
model_chapters.print_trainable_parameters()

trainable params: 162,067 || all params: 65,959,718 || trainable%: 0.2457


In [None]:
from torch.nn import BCEWithLogitsLoss
from torch.optim import AdamW

num_epochs_codes = 5

# Initialize count array
label_counts_codes = np.zeros(num_labels_codes)

# Loop through file and count labels
with open(train_path_codes, 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        for label_idx in data["labels"]:
            label_counts_codes[label_idx] += 1

# Avoid division by zero (clip label counts to minimum of 1)
label_counts_codes = np.clip(label_counts_codes, 1, None)

# Compute positive weight per class: (negatives / positives)
pos_weight_codes = (len(train_dataset_codes) - label_counts_codes) / label_counts_codes

# Convert to torch tensor
pos_weight_codes = torch.tensor(pos_weight_codes, dtype=torch.float)

# Create the loss function with pos_weight
loss_fn_codes = BCEWithLogitsLoss(pos_weight=pos_weight_codes.to(device))
optimizer_codes = AdamW(model_codes.parameters(), lr=2e-5)


In [None]:
from torch.nn import BCEWithLogitsLoss
from torch.optim import AdamW

num_epochs_chapters = 5

# Initialize count array
label_counts_chapters = np.zeros(num_labels_chapters)

with open(train_path_chapters, 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        for label_idx in data["labels"]:
            label_counts_chapters[label_idx] += 1

label_counts_chapters = np.clip(label_counts_chapters, 1, None)

pos_weight_chapters = (len(train_dataset_chapters) - label_counts_chapters) / label_counts_chapters

pos_weight_chapters = torch.tensor(pos_weight_chapters, dtype=torch.float)

loss_fn_chapters = BCEWithLogitsLoss(pos_weight=pos_weight_chapters.to(device))
optimizer_chapters = AdamW(model_chapters.parameters(), lr=2e-5)

In [None]:
print(label_counts_chapters)
print(len(label_counts_chapters))

[ 1623.  5167. 20415. 10905.  9570. 11334. 24838. 14392. 11816. 12058.
   109.  4327.  5758.  1069.    69. 11492. 13592.  9892. 15024.]
19


In [None]:
trained_model_codes = train_model(
    model=model_codes,
    device=device,
    train_loader=train_loader_codes,
    val_loader=val_loader_codes,
    loss_fn=loss_fn_codes,
    optimizer=optimizer_codes,
    num_epochs=num_epochs_codes,
    save_dir="/content/drive/MyDrive/clinical_notes_checkpoints"
)

Training Epoch 1: 100%|██████████| 116/116 [03:03<00:00,  1.59s/it]



Epoch 1/5 - Training Loss: 1.2494


Evaluating: 100%|██████████| 29/29 [00:21<00:00,  1.36it/s]


Validation Metrics:
AUC_ROC_micro: 0.8054
AUC_PR_micro: 0.0045
Precision_micro: 0.0052
Recall_micro: 0.5439
F1_micro: 0.0102
Precision_macro: 0.0025
Recall_macro: 0.1808
F1_macro: 0.0042
Hamming_Loss: 0.1807
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints/model_epoch_1.pt


Training Epoch 2: 100%|██████████| 116/116 [03:03<00:00,  1.59s/it]



Epoch 2/5 - Training Loss: 1.2395


Evaluating: 100%|██████████| 29/29 [00:21<00:00,  1.36it/s]


Validation Metrics:
AUC_ROC_micro: 0.8136
AUC_PR_micro: 0.0053
Precision_micro: 0.0054
Recall_micro: 0.5712
F1_micro: 0.0108
Precision_macro: 0.0025
Recall_macro: 0.1925
F1_macro: 0.0044
Hamming_Loss: 0.1799
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints/model_epoch_2.pt


Training Epoch 3: 100%|██████████| 116/116 [03:03<00:00,  1.59s/it]



Epoch 3/5 - Training Loss: 1.2230


Evaluating: 100%|██████████| 29/29 [00:21<00:00,  1.36it/s]


Validation Metrics:
AUC_ROC_micro: 0.8223
AUC_PR_micro: 0.0085
Precision_micro: 0.0057
Recall_micro: 0.6265
F1_micro: 0.0112
Precision_macro: 0.0026
Recall_macro: 0.2146
F1_macro: 0.0046
Hamming_Loss: 0.1890
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints/model_epoch_3.pt


Training Epoch 4: 100%|██████████| 116/116 [03:03<00:00,  1.59s/it]



Epoch 4/5 - Training Loss: 1.2094


Evaluating: 100%|██████████| 29/29 [00:21<00:00,  1.36it/s]


Validation Metrics:
AUC_ROC_micro: 0.8329
AUC_PR_micro: 0.0162
Precision_micro: 0.0058
Recall_micro: 0.6830
F1_micro: 0.0116
Precision_macro: 0.0026
Recall_macro: 0.2436
F1_macro: 0.0047
Hamming_Loss: 0.2001
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints/model_epoch_4.pt


Training Epoch 5: 100%|██████████| 116/116 [03:03<00:00,  1.58s/it]



Epoch 5/5 - Training Loss: 1.1946


Evaluating: 100%|██████████| 29/29 [00:21<00:00,  1.36it/s]


Validation Metrics:
AUC_ROC_micro: 0.8395
AUC_PR_micro: 0.0238
Precision_micro: 0.0059
Recall_micro: 0.7173
F1_micro: 0.0117
Precision_macro: 0.0027
Recall_macro: 0.2677
F1_macro: 0.0049
Hamming_Loss: 0.2077
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints/model_epoch_5.pt


In [None]:
trained_model_chapters = train_model(
    model=model_chapters,
    device=device,
    train_loader=train_loader_chapters,
    val_loader=val_loader_chapters,
    loss_fn=loss_fn_chapters,
    optimizer=optimizer_chapters,
    num_epochs=num_epochs_chapters,
    save_dir="/content/drive/MyDrive/clinical_notes_checkpoints_chapters"
)


Training Epoch 1: 100%|██████████| 116/116 [03:03<00:00,  1.59s/it]



Epoch 1/5 - Training Loss: 0.9312


Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.39it/s]


Validation Metrics:
AUC_ROC_micro: 0.5576
AUC_PR_micro: 0.3512
Precision_micro: 0.3757
Recall_micro: 0.5253
F1_micro: 0.4381
Precision_macro: 0.3821
Recall_macro: 0.5166
F1_macro: 0.3717
Hamming_Loss: 0.4472
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints_chapters/model_epoch_1.pt


Training Epoch 2: 100%|██████████| 116/116 [03:03<00:00,  1.58s/it]



Epoch 2/5 - Training Loss: 0.9261


Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.39it/s]


Validation Metrics:
AUC_ROC_micro: 0.6496
AUC_PR_micro: 0.4769
Precision_micro: 0.4262
Recall_micro: 0.6234
F1_micro: 0.5063
Precision_macro: 0.3937
Recall_macro: 0.5980
F1_macro: 0.4248
Hamming_Loss: 0.4035
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints_chapters/model_epoch_2.pt


Training Epoch 3: 100%|██████████| 116/116 [03:03<00:00,  1.58s/it]



Epoch 3/5 - Training Loss: 0.9174


Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.39it/s]


Validation Metrics:
AUC_ROC_micro: 0.6942
AUC_PR_micro: 0.5468
Precision_micro: 0.4554
Recall_micro: 0.6835
F1_micro: 0.5466
Precision_macro: 0.3986
Recall_macro: 0.6507
F1_macro: 0.4477
Hamming_Loss: 0.3763
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints_chapters/model_epoch_3.pt


Training Epoch 4: 100%|██████████| 116/116 [03:03<00:00,  1.58s/it]



Epoch 4/5 - Training Loss: 0.9034


Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.39it/s]


Validation Metrics:
AUC_ROC_micro: 0.6987
AUC_PR_micro: 0.5705
Precision_micro: 0.4468
Recall_micro: 0.6961
F1_micro: 0.5443
Precision_macro: 0.4004
Recall_macro: 0.7030
F1_macro: 0.4551
Hamming_Loss: 0.3868
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints_chapters/model_epoch_4.pt


Training Epoch 5: 100%|██████████| 116/116 [03:03<00:00,  1.58s/it]



Epoch 5/5 - Training Loss: 0.8805


Evaluating: 100%|██████████| 29/29 [00:20<00:00,  1.39it/s]


Validation Metrics:
AUC_ROC_micro: 0.7087
AUC_PR_micro: 0.5826
Precision_micro: 0.4619
Recall_micro: 0.6981
F1_micro: 0.5560
Precision_macro: 0.4091
Recall_macro: 0.7040
F1_macro: 0.4638
Hamming_Loss: 0.3701
Model saved to /content/drive/MyDrive/clinical_notes_checkpoints_chapters/model_epoch_5.pt
