<a href="https://colab.research.google.com/github/April-Taylor/NLP-Qualifications-Project/blob/main/Revised_Paired_MIMIC_NLP_ATaylor_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔧 Initial Project Setup (Run Once per Project)

This notebook sets up GitHub integration for a Google Colab project.

It connects Colab to Google Drive and a GitHub repository, allowing work done
in Colab to be saved locally in Drive and versioned via GitHub. It also securely
loads any API keys or tokens needed (e.g., GitHub, WandB, UMLS, HuggingFace) using Colab's Secrets tool.



**📁 Important:**


You must:
- Have a GitHub account and pre-created repository
- Use the same GitHub repo name in the USER INPUT section
- Place your working notebooks inside /notebooks subfolder of the Colab Google Drive. This ensures your work is versioned by Git and automatically pushed to GitHub at the end of each session.
- Create API tokens

⚠️ To ensure autosave works correctly, always open this notebook directly from Google Drive — not from “Recent,” “GitHub,” or an uploaded file.

🔐 Required Secrets:
- GitHubToken
- wandb (optional)
- UMLS (optional)
- HF_TOKEN (optional)

⚠️ Make sure you have created these API tokens on their respective platforms and added them via the Colab Secrets UI (🔑 icon on the left sidebar → Add new secret). The GitHub Token is required for proper setup. The other secrets are optional depending on the project needs.



🚀 **To initiate a new project**
1.   Update the user inputs
2.   Run this notebook through RUN SETUP


📌 Note: This entire section should be run only once when starting a new project. It initializes the GitHub connection, sets up the folder structure, and loads your Colab secrets.

For regular use:

*   At the start of each new Colab session, run the "Session Start: Sync from GitHub" cell to mount Google Drive, load secrets, and pull any updates.

*   At the end of your session, run the "Push Changes to GitHub" cell to save your work back to the repository.

## USER INPUT



In [3]:
# Only update this section to reuse across projects
GITHUB_USER = "April-Taylor"
REPO_NAME = "NLP-Qualifications-Project"
USER_EMAIL = "Taylor.april.dawn@gmail.com"
USER_FULLNAME = "April Taylor"
REPO_DESCRIPTION = """
This repository contains code, data, and deliverables related to NLP-based adverse event detection integrating structured glucose-insulin time series and unstructured clinical text. Designed for modular reuse across projects.
"""
DRIVE_BASE = "MyDrive/ColabRepos"  # ✅ Adjust this if your folder structure is different

from google.colab import drive, userdata
import os



## FUNCTIONALIZED SETUP
Project setup function - run only once per project

In [4]:
def setup_colab_project(github_user, repo_name, user_email, user_fullname, repo_description, drive_base):
    project_path = f"/content/drive/{drive_base}/{repo_name}"
    repo_url = f"https://github.com/{github_user}/{repo_name}.git"

    # Mount Google Drive
    if not os.path.ismount("/content/drive"):
        drive.mount('/content/drive')
    os.makedirs(f"/content/drive/{drive_base}", exist_ok=True)

    # Git identity setup
    !git config --global user.email "{user_email}"
    !git config --global user.name "{user_fullname}"

    # Clone repo if not already in Drive
    if not os.path.exists(project_path):
        !git clone {repo_url} "{project_path}"

    %cd "{project_path}"

    # Load secrets securely
    os.environ['GITHUB_TOKEN'] = userdata.get('GitHubToken')
    os.environ['WANDB_API_KEY'] = userdata.get('wandb')
    os.environ['UMLS_API_KEY'] = userdata.get('UMLS')
    os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

    # Create folder structure
    for folder in ["notebooks", "models", "data", "src", "outputs"]:
        folder_path = os.path.join(project_path, folder)
        os.makedirs(folder_path, exist_ok=True)
        gitkeep = os.path.join(folder_path, ".gitkeep")
        if not os.path.exists(gitkeep):
            open(gitkeep, "w").close()

    # Add README if not present
    readme_path = os.path.join(project_path, "README.md")
    if not os.path.exists(readme_path):
        with open(readme_path, "w") as f:
            f.write(f"""# {repo_name}

{repo_description}

---
Maintained by **{github_user}**, 2025
""")

    # Initial Git push
    push_url = f"https://{github_user}:{os.environ['GITHUB_TOKEN']}@github.com/{github_user}/{repo_name}.git"
    !git add .
    !git commit -m "Initial setup from Colab"
    !git push {push_url}

    return project_path, github_user, repo_name

In [None]:
## RUN SETUP (RUN ONCE: Set up project structure and sync with GitHub)
project_path, github_user, repo_name = setup_colab_project(
    GITHUB_USER, REPO_NAME, USER_EMAIL, USER_FULLNAME, REPO_DESCRIPTION, DRIVE_BASE
)

# Daily Colab Session Setup (Run Every Day)
📥 Mount Drive & Load Secrets


In [None]:
from google.colab import drive, userdata
import os

# Mount your Google Drive
drive.mount('/content/drive')

# Navigate to your repo
%cd /content/drive/MyDrive/ColabRepos/NLP-Qualifications-Project

# Git identity (needed for commits)
!git config --global user.email "Taylor.april.dawn@gmail.com"
!git config --global user.name "April Taylor"

# 🔐 Load API tokens into environment
os.environ['GITHUB_TOKEN'] = userdata.get('GitHubToken')
os.environ['WANDB_API_KEY'] = userdata.get('wandb')
os.environ['UMLS_API_KEY'] = userdata.get('UMLS')
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')


# END-OF-SESSION PUSH TO GITHUB
Run this manually after your work session to sync with GitHub

In [2]:
import os
from google.colab import userdata

# USER SETTINGS
GITHUB_USER = "April-Taylor"
REPO_NAME = "NLP-Qualifications-Project"
COMMIT_MESSAGE = "End-of-session update from Colab"

# Authenticate using stored GitHub token
GITHUB_TOKEN = userdata.get("GitHubToken")
push_url = f"https://{GITHUB_USER}:{GITHUB_TOKEN}@github.com/{GITHUB_USER}/{REPO_NAME}.git"

# Git push sequence
!git add .
!git commit -m "{COMMIT_MESSAGE}"
!git pull origin main --rebase
!git push "{push_url}"


fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git


# Load Libraries

In [None]:
# Install Packages
!pip install --upgrade pip
!pip install spacy==3.4.4 scispacy==0.5.1
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_md-0.5.1.tar.gz
!pip install nmslib==2.1.1 pybind11==2.6.1

# Import Python Libraries
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re

#  Machine Learning & Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from joblib import Parallel, delayed
import torch
from torch.utils.data import Dataset

# Transformers (Hugging Face)
from transformers import (
    BertTokenizerFast, BertForTokenClassification,
    Trainer, TrainingArguments,
    DataCollatorForTokenClassification
)

# SpaCy + SciSpaCy
import spacy
import scispacy
from scispacy.umls_linking import UmlsEntityLinker
from spacy.pipeline import EntityRuler
from spacy import displacy
from pathlib import Path
from collections import defaultdict

# W&B Setup
import wandb
wandb.init(
    project="clinical-umls-linking",
    config={"min_score": 0.75},
    tags=["scispacy-md", "glucose-insulin", "umls-caching"]
)

# Load NLP Model with UMLS Linker
nlp = spacy.load("en_core_sci_md")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

# Download NLTK Stopwords
nltk.download('stopwords')
nltk_stopwords = set(nltk.corpus.stopwords.words('english'))

# Set Seeds for Reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)


Collecting scispacy==0.5.1
  Using cached scispacy-0.5.1-py3-none-any.whl.metadata (15 kB)
Collecting nmslib>=1.7.3.6 (from scispacy==0.5.1)
  Using cached nmslib-2.1.1.tar.gz (188 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11<2.6.2 (from nmslib>=1.7.3.6->scispacy==0.5.1)
  Using cached pybind11-2.6.1-py2.py3-none-any.whl.metadata (8.7 kB)
Using cached scispacy-0.5.1-py3-none-any.whl (44 kB)
Using cached pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
Building wheels for collected packages: nmslib
[33m  DEPRECATION: Building 'nmslib' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'nmslib'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m

# Data Loading

## Load MIMIC-III curated glucose-insulin paired data

The project uses the glucose_insulin_pair.csv dataset from the curated glucose-insulin files, available on PhysioNet (https://physionet.org/content/glucose-management-mimic/1.0.1/Notebooks/#files-panel).

In [None]:
# Load the data into a dataframe
insulin_file_path = "/content/drive/MyDrive/glucose_insulin_pair.csv"
columns_needed = ['SUBJECT_ID', 'HADM_ID', 'GLC', 'GLC_AL', 'GLCSOURCE','EVENT', 'INSULINTYPE','INPUT', 'INPUT_HRS', 'INFXSTOP', 'TIMER']
insulin_data = pd.read_csv(insulin_file_path, usecols=columns_needed)


In [None]:
insulin_data.shape

## Load MIMIC-III v1.4 notes data

The project uses the NOTEEVENTS.csv dataset from MIMIC-III, available on PhysioNet (https://physionet.org/content/mimiciii/1.4/). These notes align with the same MIMIC-III data from which the curated glucose_insulin_pair data is derived.

In [None]:
# Load the data into a dataframe
notes_file_path = "/content/drive/My Drive/NOTEEVENTS.csv"
notes_cols = ['ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME', 'CATEGORY', 'TEXT']
noteevents = pd.read_csv(notes_file_path, usecols=notes_cols, low_memory=False)

# Inspect Data

## Inspect the Glucose_Insulin_Pair dataset

**Description of fields:**
*   **SUBJECT_ID:** It is the unique identifier for an individual patient.
*  **HADM_ID:**Represents a single patient’s admission to the hospital.
*   **ICUSTAY_ID:** Unique identifier for a single patient’s admission to the ICU.
*   **LOS_ICU_days:** Length of stay in days.
*   **first_ICU_stay:** True if it is the first admission to the ICU for a hospital admission.
*   **TIMER:** Gathers the timestamps for either the STARTTIME for a single insulin input or the GLCTIMER for a single glucose reading. It is used to order chronologically the events along a hospital admission.
*   **STARTTIME:** Timestamp that depicts when the administration of an insulin event started or when a new infusion rate was indicated.
*   **INPUT:** Dose for a single bolus of insulin in U.
*   **INPUT_HRS:** Insulin infusion rate in units/hr.
*   **ENDTIME:** Timestamp that specifies when an insulin input stopped, or an infusion rate changed.
*   **INSULINTYPE:** Acting type of insulin: short, intermediate, or long.
*   **EVENT:** Specifies whether the bolus of insulin was subcutaneous (BOLUS_INYECTION), or intravenous (BOLUS_PUSH), or if the insulin was infused (INFUSION).
*   **INFXSTOP:** Indicates when an infusion of insulin was discontinued. A value equal to 1 indicates when an infusion was discontinued, otherwise (entries equal to 0) this column indicates that the associated infusion started or the rate of infusion was modified.
*   **GLCTIMER:** Timestamp that depicts when a glycemic check was done.
*   **GLC:** Glycemia value in mg/dL.
*   **GLCSOURCE:** Reading method for a glycemic check: fingerstick (FINGERSTICK) or lab analyzer (BLOOD).

Specific to glucose_insulin_pair.csv
*   **GLCTIMER_AL:** Timestamp that depicts when a glycemic check was done for a paired glucose reading. This value should match with the timestamp in GLCTIMER of a preceding glucose reading according to the rule applied for this pairing case.
*   **GLC_AL:** Glycemia value in mg/dL for a paired glucose reading with a single insulin input. This value should match with the value in GLC of a preceding glucose reading according to the rule applied for this pairing case.
*   **GLCSOURCE_AL:** Reading method for a glycemic check that was paired with an insulin input. This value should match with the GLCSOURCE value of a preceding glucose reading according to the rule applied for this pairing case.









In [None]:
# Inspect the Glucose_Insulin_Pair dataset
print(insulin_data.head(10))
print(insulin_data.info())
print(insulin_data.describe())

In [None]:
# Inspect initial dataset sizes
print(f"Original insulin_data size: {insulin_data.shape[0]} rows")
print(f"Original noteevents size: {noteevents.shape[0]} rows")

In [None]:
print(f"Missing TIMER in insulin_data: {insulin_data['TIMER'].isna().sum()} out of {len(insulin_data)}")



# PreProcess Glucose_Insulin Data

## Filter and format glucose/insulin data

In [None]:
# Convert HADM_ID to int
insulin_data['HADM_ID'] = insulin_data['HADM_ID'].fillna(-1).astype(int)

# Ensure TIMER is datetime and timezone-naive for both insulin and glucose datasets
insulin_data['TIMER'] = pd.to_datetime(insulin_data['TIMER'], errors='coerce').dt.tz_localize(None)
if insulin_data['TIMER'].isna().any():
    print("Invalid TIMER values detected.")


# Verify that all TIMER values are properly formatted
assert insulin_data['TIMER'].notna().all(), "TIMER in insulin_data contains NaT!"


# Remove fully duplicate rows (all columns identical)
insulin_data = insulin_data.drop_duplicates()

# Sort the insulin_data by TIMER for merging
insulin_data = insulin_data.sort_values(by='TIMER').reset_index(drop=True)

# Verify sorting
assert insulin_data['TIMER'].is_monotonic_increasing, "TIMER column is not sorted!"

# Final check of the data structure
print("Final insulin_data structure:")
print(insulin_data.info())

# Sample to verify the structure
print("Sample of data after removing duplicates:")
print(insulin_data.head())


### Create a combined column for all glucose values

In [None]:
# Separate rows with GLC_AL values
glc_al_data = insulin_data[~insulin_data['GLC_AL'].isna()].copy()

# Separate rows with GLC values
glc_only_data = insulin_data[~insulin_data['GLC'].isna()].copy()

import pandas as pd

# Sort both datasets by TIMER
glc_only_data = glc_only_data.sort_values('TIMER')
glc_al_data = glc_al_data.sort_values('TIMER')

# Create a temporary merge to find matches within the 2-hour window
merged_within_2hr = pd.merge_asof(
    glc_only_data,
    glc_al_data[['TIMER']],  # Only use TIMER column from glc_al_data for matching
    on='TIMER',
    tolerance=pd.Timedelta(hours=2),  # Define the 2-hour window
    direction='nearest'  # Allow matching in both directions
)

# Reset index of glc_only_data and merged_within_2hr before filtering
glc_only_data = glc_only_data.reset_index(drop=True)
merged_within_2hr = merged_within_2hr.reset_index(drop=True)


# Filter out rows from glc_only_data that have a match in glc_al_data within 2 hours
filtered_glc_only_data = glc_only_data[merged_within_2hr['TIMER'].isna()]

# Combine GLC_AL and filtered GLC data
GLC_ALL = pd.concat([glc_al_data, filtered_glc_only_data], ignore_index=True)


# Add GLC_ALL column to insulin_data for consistency
insulin_data['GLC_ALL'] = insulin_data['GLC_AL'].combine_first(insulin_data['GLC'])

# Verify the structure of insulin_data
print("Updated insulin_data with GLC_ALL:")
print(insulin_data.info())
print(insulin_data[['TIMER', 'GLC', 'GLC_AL', 'GLC_ALL']].head())

In [None]:
# Aggregate insulin_data at SUBJECT_ID, HADM_ID level
insulin_aggregated = insulin_data.groupby(['SUBJECT_ID', 'HADM_ID']).agg(
    TotalBolus=('INPUT', 'sum'),                  # Total bolus insulin administered
    AvgBolus=('INPUT', 'mean'),                  # Average bolus insulin per event
    MaxBolus=('INPUT', 'max'),                   # Max bolus insulin
    TotalInfusion=('INPUT_HRS', 'sum'),          # Total infusion hours
    AvgInfusionRate=('INPUT_HRS', 'mean'),       # Average infusion rate
    MaxGlucose=('GLC_ALL', 'max'),                   # Maximum glucose level
    MinGlucose=('GLC_ALL', 'min'),                   # Minimum glucose level
    AvgGlucose=('GLC_ALL', 'mean'),                  # Average glucose level
    InfusionStops=('INFXSTOP', 'sum')            # Count of infusion stops
).reset_index()

In [None]:
print(insulin_aggregated.head())

## Categorize Glucose_Insulin data

In [None]:
# Categorize insulin delivery types
def categorize_insulin_delivery(row):
    if pd.isna(row['INPUT']):  # No bolus given
        if pd.isna(row['INPUT_HRS']):  # No infusion rate recorded
            return "No insulin adjustment"
    elif isinstance(row['EVENT'], str) and ("BOLUS_INYECTION" in row['EVENT'] or "BOLUS_PUSH" in row['EVENT']):
        return "Bolus"
    elif isinstance(row['EVENT'], str) and "INFUSION" in row['EVENT']:
        return "Infusion"
    return "Unknown"  # Fallback case

# Apply the function to categorize insulin delivery
insulin_data['InsulinDeliveryCategory'] = insulin_data.apply(categorize_insulin_delivery, axis=1)

# Categorize glucose levels
def categorize_glucose(value):
    if pd.isna(value):  # Handle NaN case
        return None
    elif value < 70:
        return "Hypoglycemic"
    elif value > 180:
        return "Hyperglycemic"
    else:
        return "Euglycemic"

# Apply glucose categorization to the GLC_ALL column
insulin_data['GlucoseCategory'] = insulin_data['GLC_ALL'].apply(categorize_glucose)

# Verify updated categories
print("\nUnique Insulin Delivery Categories in categorized_data after refined processing:")
print(insulin_data['InsulinDeliveryCategory'].unique())

print("\nUnique Glucose Categories in insulin_data after refined processing:")
print(insulin_data['GlucoseCategory'].unique())


In [None]:
# Count the number of glucose events by category
glucose_category_counts = insulin_data['GlucoseCategory'].value_counts()

# Bar Plot for Glucose Category Distribution with AGP Colors
plt.figure(figsize=(10, 6))

# Define AGP colors for each category
agp_colors = {
    'Hypoglycemic': 'red',
    'Euglycemic': 'green',
    'Hyperglycemic': 'orange'
}

# Get the colors for the bars in the correct order
bar_colors = [agp_colors[category] for category in glucose_category_counts.index]

# Create the bar plot with AGP colors
bars = sns.barplot(
    x=glucose_category_counts.index,
    y=glucose_category_counts.values,
    palette=bar_colors  # Use the AGP colors
)

plt.title('Distribution of Glucose Categories')
plt.xlabel('Glucose Category')
plt.ylabel('Count')
plt.xticks(rotation=45)


In [None]:
# Count occurrences of each label in 'GlucoseCategory'
glucose_category_counts = insulin_data['GlucoseCategory'].value_counts()
print("Glucose Category Counts:\n", glucose_category_counts)

# Count occurrences of each label in 'InsulinDeliveryCategory'
insulin_delivery_counts = insulin_data['InsulinDeliveryCategory'].value_counts()
print("\nInsulin Delivery Category Counts:\n", insulin_delivery_counts)

In [None]:
# Frequency of insulin categories
plt.figure(figsize=(10, 6))
sns.countplot(data=insulin_data, x='InsulinDeliveryCategory', order=insulin_data['InsulinDeliveryCategory'].value_counts().index)
plt.title('Frequency of Insulin Delivery Categories')
plt.xlabel('Insulin Delivery Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Exploratory Data Analysis (EDA)

##Distributions

In [None]:
# Statistical summaries for Glucose, Bolus, and Infusion
summary_stats = insulin_data[['GLC_ALL', 'INPUT', 'INPUT_HRS']].rename(columns={
    'GLC_ALL': 'Glucose',
    'INPUT': 'Bolus',
    'INPUT_HRS': 'Infusion'
}).describe()

print("Statistical Summary:")
print(summary_stats)


In [None]:
# Consolidated Distribution Plot
def plot_distribution(data, column, title, xlabel, bins=30, color='blue'):
    plt.figure(figsize=(10, 6))
    sns.histplot(data[column], bins=bins, kde=True, color=color)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Frequency')

# Plot Glucose Levels
plot_distribution(insulin_data, 'GLC_ALL', 'Distribution of Glucose Levels', 'Glucose (mg/dL)', color='red')
plt.xticks(range(0, int(insulin_data['GLC_ALL'].max()) + 100, 100))
plt.show()

# Plot Bolus Insulin Doses
plot_distribution(insulin_data, 'INPUT', 'Distribution of Bolus Insulin Doses', 'Bolus Insulin (units)', color='orange')
plt.show()

# Plot Infusion Rates
plot_distribution(insulin_data, 'INPUT_HRS', 'Distribution of Infusion Rates', 'Infusion Rate (units/hr)', color='blue')
plt.show()

In [None]:
# Select relevant columns for plotting
boxplot_data = insulin_data[['GLC_ALL', 'INPUT', 'INPUT_HRS']].rename(columns={
    'GLC_ALL': 'Glucose',
    'INPUT': 'Bolus',
    'INPUT_HRS': 'Infusion'
})

# Define the column names and titles for the plots
columns = ['Glucose', 'Bolus', 'Infusion']
titles = ['Distribution of Glucose Levels', 'Distribution of Bolus Insulin Doses', 'Distribution of Infusion Rates']

# Create individual boxplots
for i, col in enumerate(columns):
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=boxplot_data, y=col, color='skyblue', width=0.5)

    # Add plot labels and title
    plt.title(titles[i], fontsize=16)
    plt.ylabel(col, fontsize=14)
    plt.xticks([])  # Remove x-ticks for simplicity
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Adjust y-axis ticks for better granularity
    if col == 'Bolus':  # For Bolus, set y-ticks to every 20 units
        plt.yticks(range(0, int(boxplot_data[col].max() + 1), 20))
    elif col != 'Glucose':  # For Infusion, finer granularity (1 unit steps)
        plt.yticks(range(0, int(boxplot_data[col].max() + 1), 1))

    # Display the plot
    plt.tight_layout()
    plt.show()


In [None]:
hypoglycemic_cases = insulin_data[insulin_data['GLC_ALL'] < 70]
print(hypoglycemic_cases)

In [None]:
# Count of each insulin category
insulin_category_counts = insulin_data['InsulinDeliveryCategory'].value_counts()
print("Insulin Category Counts:\n", insulin_category_counts)

# Count of each glucose category
glucose_category_counts = insulin_data['GlucoseCategory'].value_counts()
print("\nGlucose Category Counts:\n", glucose_category_counts)

In [None]:
# Group data by GlucoseCategory, InsulinDeliveryCategory, and INFXSTOP
grouped_data = insulin_data.groupby(['GlucoseCategory', 'InsulinDeliveryCategory']).size().reset_index(name='Count')

# Pivot table for heatmap (optional to include INFXSTOP as additional dimension)
pivot_data = grouped_data.pivot_table(index='GlucoseCategory', columns='InsulinDeliveryCategory', values='Count', aggfunc='sum')

plt.figure(figsize=(10, 6))
sns.heatmap(
    pivot_data,
    annot=True,
    fmt=".0f",
    cmap="YlGnBu",
    linewidths=0.5,
    cbar_kws={"label": "Event Count"}
)
plt.title('Insulin Delivery by Glucose Category')
plt.xlabel('Insulin Delivery Method')
plt.ylabel('Glucose Category')
plt.show()


In [None]:
# Group by SUBJECT_ID and calculate metrics
grouped_data = insulin_data.groupby('SUBJECT_ID').agg({
    'GLC_ALL': 'mean',  # Average glucose level
    'INPUT': 'mean',  # Average insulin dose per event
    'GlucoseCategory': 'count'  # Number of glucose readings
}).rename(columns={'GLC_ALL': 'AvgGlucose', 'INPUT': 'AvgInsulinPerEvent', 'GlucoseCategory': 'NumReadings'})

# Full scatterplot
plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=grouped_data,
    x='AvgInsulinPerEvent',
    y='AvgGlucose',
    hue='NumReadings',
    palette='coolwarm',
    size='NumReadings',
    sizes=(20, 200)
)
plt.axhline(y=70, color='red', linestyle='--', label='Hypoglycemia (<70 mg/dL)')
plt.axhline(y=180, color='orange', linestyle='--', label='Hyperglycemia (>180 mg/dL)')
plt.title('Relationship between Avg. Insulin Dose per Event and Avg. Glucose Levels (Full Range)', fontsize=16)
plt.xlabel('Avg. Insulin Dose per Event (units)', fontsize=14)
plt.ylabel('Average Glucose Level (mg/dL)', fontsize=14)
plt.legend(title="Num. of Readings", loc='upper right', fontsize=10)
plt.grid(True)
plt.show()

# Filter for granular zoom: 1–20 units of insulin
filtered_zoom_data = grouped_data[(grouped_data['AvgInsulinPerEvent'] >= 1) & (grouped_data['AvgInsulinPerEvent'] <= 20)]

# Zoomed scatterplot
plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=filtered_zoom_data,
    x='AvgInsulinPerEvent',
    y='AvgGlucose',
    hue='NumReadings',
    palette='coolwarm',
    size='NumReadings',
    sizes=(20, 200)
)
plt.axhline(y=70, color='red', linestyle='--', label='Hypoglycemia (<70 mg/dL)')
plt.axhline(y=180, color='orange', linestyle='--', label='Hyperglycemia (>180 mg/dL)')
plt.title('Relationship between Avg. Insulin Dose per Event and Avg. Glucose Levels (1–20 Units)', fontsize=16)
plt.xlabel('Avg. Insulin Dose per Event (units)', fontsize=14)
plt.ylabel('Average Glucose Level (mg/dL)', fontsize=14)
plt.xticks(ticks=range(1, 21, 1))  # Sets x-axis ticks from 1 to 20 with step of 1
plt.legend(title="Num. of Readings", loc='upper right', fontsize=10)
plt.grid(True)
plt.show()



##Time Series Comparisons

In [None]:
import random
import matplotlib.pyplot as plt

random.seed(42)

def plot_time_series(insulin_data, subject_ids, random_seed=42):
    """Plots time series for specified subjects, highlighting insulin values with black labels and red dots."""

    for subject_id in subject_ids:
        subject_data = insulin_data[insulin_data['SUBJECT_ID'] == subject_id]
        if not subject_data.empty:
            plt.figure(figsize=(15, 4))

            # Plot glucose levels
            plt.plot(subject_data['TIMER'], subject_data['GLC_ALL'], label='Glucose (mg/dL)', color='blue')

            # Highlight insulin data points as red dots
            insulin_data_points = subject_data[~subject_data['INPUT'].isna()]
            plt.scatter(
                insulin_data_points['TIMER'],
                insulin_data_points['INPUT'],
                color='red',
                label='Bolus Doses (units)',
                marker='o',
                s=50
            )

            # Add data labels for insulin points (as whole numbers and black text)
            for _, row in insulin_data_points.iterrows():
                plt.text(
                    row['TIMER'],
                    row['INPUT'] + 10,  # Offset the label slightly above the point
                    f"{int(row['INPUT'])}",  # Convert to whole number
                    fontsize=9,
                    color='black',
                    ha='center'
                )

            plt.xlabel('Date')
            plt.ylabel('Values')
            plt.title(f'Time Series for Subject {subject_id}')
            plt.xticks(rotation=45)
            plt.legend()
            plt.tight_layout()
            plt.ylim(0, 400)
            plt.show()

# Get three random subject IDs from the data
num_subjects_to_plot = 3
subject_ids = random.sample(list(insulin_data['SUBJECT_ID'].unique()), num_subjects_to_plot)

# Plot the time series for those subjects, one plot per subject
plot_time_series(insulin_data, subject_ids, random_seed=42)


In [None]:
# Inspect raw insulin and glucose values for the selected patient
patient_id =55642  # Replace with the patient ID from your plot
raw_patient_data = insulin_data[insulin_data['SUBJECT_ID'] == patient_id]

# Display raw insulin and glucose data
print("Raw Data for Patient:")
print(raw_patient_data[['TIMER', 'GLC_ALL', 'INPUT', 'INPUT_HRS', 'InsulinDeliveryCategory']])

# Check unique values for INPUT and INPUT_HRS
print("\nUnique Bolus Insulin Doses (INPUT):")
print(raw_patient_data['INPUT'].unique())

print("\nUnique Infusion Rates (INPUT_HRS):")
print(raw_patient_data['INPUT_HRS'].unique())

# Check if there are duplicates or overlapping data
duplicates = raw_patient_data.duplicated(subset=['TIMER', 'INPUT', 'INPUT_HRS'])
print(f"\nNumber of Duplicate Entries for Patient {patient_id}: {duplicates.sum()}")


In [None]:
!pip install adjustText
from adjustText import adjust_text


# Ensure TIMER is a datetime object
insulin_data['TIMER'] = pd.to_datetime(insulin_data['TIMER'], errors='coerce')

# Extract Hour and Day from TIMER
insulin_data['Hour'] = insulin_data['TIMER'].dt.hour
insulin_data['Day'] = insulin_data['TIMER'].dt.date

# Group by Hour and calculate mean values for Glucose, Bolus Insulin, and Infusion Rate
hourly_avg = insulin_data.groupby('Hour').agg({
    'GLC_ALL': 'mean',  # Average Glucose Level
    'INPUT': 'mean',         # Average Bolus Insulin
    'INPUT_HRS': 'mean'      # Average Infusion Rate
}).rename(columns={
    'GLC_ALL': 'Glucose (mg/dL)',
    'INPUT': 'Bolus (units)',
    'INPUT_HRS': 'Infusion (units/hr)'
}).reset_index()

# Plot the hourly trends
plt.figure(figsize=(14, 8))

# Initialize the list for dynamic text annotations
texts = []

# Plot Glucose Levels
plt.plot(hourly_avg['Hour'], hourly_avg['Glucose (mg/dL)'], marker='o', label='Glucose (mg/dL)', color='blue', linestyle='-')
for _, row in hourly_avg.iterrows():
    texts.append(
        plt.text(
            row['Hour'], row['Glucose (mg/dL)'],  # Add text dynamically
            f"{round(row['Glucose (mg/dL)'], 1)}",
            fontsize=9, color='blue', ha='center'
        )
    )

# Plot Bolus Insulin Levels
plt.plot(hourly_avg['Hour'], hourly_avg['Bolus (units)'], marker='x', label='Bolus (units)', color='red', linestyle='--')
for _, row in hourly_avg.iterrows():
    texts.append(
        plt.text(
            row['Hour'], row['Bolus (units)'],  # Add text dynamically
            f"{round(row['Bolus (units)'], 1)}",
            fontsize=9, color='red', ha='center'
        )
    )

# Plot Infusion Rates
plt.plot(hourly_avg['Hour'], hourly_avg['Infusion (units/hr)'], marker='s', label='Infusion (units/hr)', color='green', linestyle='-.')
for _, row in hourly_avg.iterrows():
    texts.append(
        plt.text(
            row['Hour'], row['Infusion (units/hr)'],  # Add text dynamically
            f"{round(row['Infusion (units/hr)'], 1)}",
            fontsize=9, color='green', ha='center'
        )
    )

# Adjust overlapping text labels
adjust_text(texts, arrowprops=dict(arrowstyle='-', color='gray'))

# Customize plot
plt.title('Hourly Trends in Glucose and Insulin Levels (1-24 Hours)', fontsize=16)
plt.xlabel('Hour of Day (0-23)', fontsize=14)
plt.ylabel('Average Levels', fontsize=14)

# Set y-axis ticks and limits
y_min = 0  # Minimum value for y-axis
y_max = int(hourly_avg[['Glucose (mg/dL)', 'Bolus (units)', 'Infusion (units/hr)']].max().max() + 20)
plt.yticks(np.arange(y_min, y_max + 1, 20))  # Intervals of 20

# Configure x-axis ticks for hourly scale
plt.xticks(ticks=np.arange(0, 24), labels=np.arange(0, 24), fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)

# Adjust legend placement
plt.legend(title="Legend", fontsize=12, loc='upper center', bbox_to_anchor=(0.5, -0.1), ncol=3)

# Adjust layout to accommodate legend
plt.tight_layout(rect=[0, 0.05, 1, 1])

# Show the plot
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from adjustText import adjust_text
import random

random_seed = 42
random.seed(random_seed)

# Filter for patients with both insulin and glucose values
pairwise_distances_argmin_min_data = insulin_data[
    (insulin_data['GLC_ALL'].notna()) &  # Glucose value present
    (
       insulin_data['INPUT'].notna() |  # Bolus insulin
        insulin_data['INPUT_HRS'].notna() |  # Infusion rate
        (insulin_data['INFXSTOP'] == 1)  # Infusion stop
    )
]

# Get a list of unique patient IDs from paired data
valid_patient_ids = insulin_data['SUBJECT_ID'].unique()

# Select 3 random patient IDs from valid ones
if len(valid_patient_ids) < 3:
    print("Not enough patients with both insulin and glucose values for plotting.")
else:
    random_patient_ids = random.sample(list(valid_patient_ids), 3)

    # Iterate over each random patient ID and create individual plots
    for patient_id in random_patient_ids:
        # Filter data for this patient
        patient_data = insulin_data[insulin_data['SUBJECT_ID'] == patient_id].copy()

        # Ensure TIMER is sorted
        patient_data = patient_data.sort_values(by='TIMER')

        # Check if patient data is empty after filtering
        if patient_data.empty:
            print(f"No valid data for patient ID {patient_id}.")
            continue

        # Create text annotations list
        texts = []

        # Initialize the figure
        plt.figure(figsize=(14, 8))

        # Plot Glucose Levels
        glucose_data = patient_data[patient_data['GLC_ALL'] > 0]
        plt.plot(
            glucose_data['TIMER'],
            glucose_data['GLC_ALL'],
            color='blue',
            label='Glucose Levels (mg/dL)',
            marker='o',
            linestyle='-',
            alpha=0.7
        )
        for _, row in glucose_data.iterrows():
            texts.append(
                plt.text(
                    row['TIMER'], row['GLC_ALL'] + 5,  # Offset upward
                    f"{int(row['GLC_ALL'])} mg/dL",
                    fontsize=10,
                    color='blue',
                    ha='center'
                )
            )

        # Scatterplot for Bolus Insulin
        bolus_data = patient_data[patient_data['INPUT'] > 0]
        plt.scatter(
            bolus_data['TIMER'],
            bolus_data['INPUT'],
            color='red',
            label='Bolus Insulin (Units)',
            alpha=0.7,
            marker='x'
        )
        for _, row in bolus_data.iterrows():
            texts.append(
                plt.text(
                    row['TIMER'], row['INPUT'] + 5,  # Offset upward
                    f"{int(row['INPUT'])} units",
                    fontsize=10,
                    color='red',
                    ha='center'
                )
            )

        # Scatterplot for Infusion Rates
        infusion_data = patient_data[patient_data['INPUT_HRS'] > 0]
        plt.scatter(
            infusion_data['TIMER'],
            infusion_data['INPUT_HRS'],
            color='green',
            label='Infusion Rate (Units/hr)',
            alpha=0.7,
            marker='s'
        )
        for _, row in infusion_data.iterrows():
            texts.append(
                plt.text(
                    row['TIMER'], row['INPUT_HRS'] + 5,  # Offset upward
                    f"{int(row['INPUT_HRS'])} units/hr",
                    fontsize=10,
                    color='green',
                    ha='center'
                )
            )

        # Scatterplot for Infusion Stops
        infusion_stop_data = patient_data[patient_data['INFXSTOP'] == 1]
        plt.scatter(
            infusion_stop_data['TIMER'],
            infusion_stop_data['INPUT_HRS'],
            color='purple',
            label='Infusion Stopped',
            alpha=0.9,
            marker='P',
            s=150
        )
        for _, row in infusion_stop_data.iterrows():
            texts.append(
                plt.text(
                    row['TIMER'], row['INPUT_HRS'] + 5,  # Offset upward
                    "Stopped",
                    fontsize=10,
                    color='purple',
                    ha='center'
                )
            )

        # Adjust overlapping text labels
        adjust_text(texts, arrowprops=dict(arrowstyle='-', color='black'))

        # Customize the plot
        plt.title(f'Overlay of Glucose and Insulin Trends for Patient ID {patient_id}', fontsize=16)
        plt.xlabel('Time', fontsize=14)
        plt.ylabel('Values (Glucose Levels and Insulin Doses)', fontsize=14)
        plt.legend(fontsize=12)
        plt.ylim(0)
        plt.grid(True)
        plt.xticks(rotation=45)

        # Show the plot
        plt.tight_layout()
        plt.show()


##Correlations

In [None]:
# Correlation matrix
correlation_matrix = insulin_data[['GLC_ALL', 'INPUT', 'INPUT_HRS']].rename(columns={
    'GLC_ALL': 'Glucose',
    'INPUT': 'Bolus',
    'INPUT_HRS': 'Infusion'
}).corr(method='pearson')

print("Correlation Matrix:")
print(correlation_matrix)


In [None]:
# Select relevant columns for correlation analysis
correlation_data = insulin_data[['GLC_ALL', 'INPUT', 'INPUT_HRS']].rename(columns={
    'GLC_ALL': 'Glucose',
    'INPUT': 'Bolus',
    'INPUT_HRS': 'Infusion'
})

# Calculate the correlation matrix
correlation_matrix = correlation_data.corr()

# Plot the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation_matrix,
    annot=True,  # Display correlation coefficients
    cmap='coolwarm',  # Color map for visualization
    fmt=".2f",  # Format the correlation coefficients
    linewidths=0.5,  # Add space between cells
    cbar_kws={"shrink": 0.8}  # Shrink color bar for better layout
)

# Customize plot labels and title
plt.title('Correlation Matrix: Glucose, Bolus, and Infusion', fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(rotation=45, fontsize=12)

# Display the plot
plt.tight_layout()
plt.show()


In [None]:
# Import the required function
from scipy.stats import ttest_ind

# Separate groups
infusion_group = insulin_data[insulin_data['INPUT_HRS'] > 0]['GLC_ALL']  # Glucose levels for patients on infusion
bolus_group = insulin_data[insulin_data['INPUT'] > 0]['GLC_ALL']  # Glucose levels for patients on bolus

# Perform t-test
t_stat, p_value = ttest_ind(infusion_group, bolus_group, equal_var=False)  # Assuming unequal variances
print(f"T-Test Results: t-statistic = {t_stat}, p-value = {p_value}")

# Interpret results
if p_value < 0.05:
    print("There is a significant difference in glucose levels between patients on infusion vs. bolus insulin.")
else:
    print("No significant difference in glucose levels between patients on infusion vs. bolus insulin.")

In [None]:
from sklearn.preprocessing import StandardScaler
# Prepare the data
# Map categories to numerical values, handling potential NaN values
insulin_data['GlucoseCategoryEncoded'] = insulin_data['GlucoseCategory'].map({
    'Hypoglycemic': 0,
    'Euglycemic': 1,
    'Hyperglycemic': 2
})

# Select features and target
features = ['INPUT', 'INPUT_HRS']  # Bolus and infusion insulin rates
target = 'GlucoseCategoryEncoded'

X = insulin_data[features]
y = insulin_data[target]

# Impute NaN values with the mean of each column using SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') # You can also use 'median' or other strategies
X = imputer.fit_transform(X)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Drop rows with NaN values in the target variable 'y'
# Create a boolean mask indicating rows with NaN in 'y'
nan_mask = y.isna()

# Filter out rows with NaN in both X and y
X_scaled = X_scaled[~nan_mask]
y = y[~nan_mask]

# Split data into train and test sets (after dropping NaNs)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

# Check class distribution in y_train
print("Class distribution in training data:", np.bincount(y_train.astype(int)))  # Convert to int for bincount

# Ensure all classes are present in training data
classes = np.unique(np.concatenate((y_train, y_test)))  # Get all unique classes from both train and test

# Calculate class weights to address class imbalance
# Assuming 'balanced' strategy, adjust as needed
from sklearn.utils.class_weight import compute_class_weight
class_weight_dict = dict(enumerate(compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)))


# Train the logistic regression model
from sklearn.linear_model import LogisticRegression # Import LogisticRegression
model = LogisticRegression(
    multi_class='multinomial', solver='lbfgs', max_iter=1000, class_weight=class_weight_dict
)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Classification report with explicit labels
print("Classification Report:")
print(classification_report(
    y_test, y_pred,
    labels=[0, 1, 2],
    target_names=['Hypoglycemic', 'Euglycemic', 'Hyperglycemic']
))

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report # Import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, y_pred, labels=[0, 1, 2])
plt.figure(figsize=(8, 6))
sns.heatmap(
    conf_matrix,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['Hypoglycemic', 'Euglycemic', 'Hyperglycemic'],
    yticklabels=['Hypoglycemic', 'Euglycemic', 'Hyperglycemic']
)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Inspect the Noteevents dataset

**Description of fields:**


*   **ROW_ID:** Unique identifier for each row in the dataset.
*   **SUBJECT_ID:** It is the unique identifier for an individual patient.
*  **HADM_ID:** Represents a single patient’s admission to the hospital.
*   **CHARTDATE:** Gathers the the date on which the note was recorded or created.
*   **CHARTTIME:** The specific time (if available) when the note was recorded or created. This column often contains NaN for notes where the exact time isn't recorded.
*   **STORETIME:** The time when the note was stored in the database (if available). Often NaN in this dataset.
*   **CATEGORY:** The category of the clinical note (e.g., "Discharge summary," "Nursing," etc.), indicating the type or source of the note.
*   **DESCRIPTION:** Additional descriptive information about the note (e.g., "Report"). Typically, it provides a summary of the note's content or purpose.
*   **CGID:** Identifier for the caregiver who wrote the note, if available. Often NaN if the caregiver's ID isn't recorded.
*   **ISERROR:** Indicator of whether the note contains an error. Often NaN if errors aren't explicitly flagged.
*   **TEXT:** The full text content of the clinical note, including details about the patient's condition, treatment, and other relevant observations.

In [None]:
# Display the first 5 rows
print(noteevents.head())

In [None]:
# Display column data types
print(noteevents.dtypes)

In [None]:
print(f"Missing CHARTTIME in noteevents: {noteevents['CHARTTIME'].isna().sum()} out of {len(noteevents)}")

In [None]:
# Plot the distribution of note categories
noteevents['CATEGORY'].value_counts().plot(kind='bar')
plt.title('Distribution of Note Categories')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()


In [None]:
# View unique values in a column
print(noteevents['CATEGORY'].unique())  # List of unique categories

# View value counts for a column
print(noteevents['TEXT'].value_counts())


In [None]:
# Filter noteevents DataFrame
categories_to_keep = ['Discharge summary', 'Nursing', 'Physician', 'Nutrition', 'Nursing/other']
noteevents = noteevents[noteevents['CATEGORY'].isin(categories_to_keep)]

In [None]:
# Display a sample of note text
print(noteevents['TEXT'].iloc[0])  # First note



In [None]:
# Display a sample of note text
print(noteevents['TEXT'].iloc[1000])  # 1000th note

#PreProcess Noteevents Data

## Filter and format notes data

In [None]:
def preprocess_text(text):
    """Preprocess clinical text."""
    if pd.isna(text):  # Handle NaN values
        return ""
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens).strip()

# Merge glucose/insulin and notes data

In [None]:



# Combine 'CHARTDATE' and 'CHARTTIME' into a new 'DATETIME' column
noteevents['DATETIME'] = pd.to_datetime(noteevents['CHARTDATE'].astype(str) + ' ' + noteevents['CHARTTIME'].astype(str), errors='coerce')

# Convert 'DATETIME' column to datetime objects, handling errors
noteevents['DATETIME'] = pd.to_datetime(noteevents['DATETIME'], errors='coerce')


# Ensure DATETIME is timezone-naive
noteevents['DATETIME'] = noteevents['DATETIME'].dt.tz_localize(None)

# Extract CHARTDATE (date-only) from DATETIME
noteevents['CHARTDATE'] = noteevents['DATETIME'].dt.date


# Ensure DATETIME in filtered_notes is timezone-naive
noteevents['DATETIME'] = pd.to_datetime(noteevents['DATETIME'], errors='coerce').dt.tz_localize(None)

# Extract CHARTDATE (date-only) from DATETIME in filtered_notes
noteevents['CHARTDATE'] = noteevents['DATETIME'].dt.date

# Ensure TIMER in insulin_data is timezone-naive
insulin_data['TIMER'] = pd.to_datetime(insulin_data['TIMER'], errors='coerce').dt.tz_localize(None)

# Extract CHARTDATE (date-only) from TIMER in insulin_data
insulin_data['CHARTDATE'] = insulin_data['TIMER'].dt.date

# Separate valid and invalid HADM_ID rows in filtered_notes
valid_hadm_notes = noteevents[noteevents['HADM_ID'] != -1].copy()
invalid_hadm_notes = noteevents[noteevents['HADM_ID'] == -1].copy()

# Sort both datasets by CHARTDATE for merge_asof
valid_hadm_notes = valid_hadm_notes.sort_values('CHARTDATE')
invalid_hadm_notes = invalid_hadm_notes.sort_values('CHARTDATE')
insulin_data = insulin_data.sort_values('CHARTDATE')

# Merge valid HADM_ID rows on SUBJECT_ID, HADM_ID, and CHARTDATE
merged_valid = pd.merge(
    valid_hadm_notes,
    insulin_data,
    on=['SUBJECT_ID', 'HADM_ID', 'CHARTDATE'],  # Match on same day
    how='inner'  # Only include rows where all keys match
)

# Merge invalid HADM_ID rows on SUBJECT_ID and CHARTDATE only
merged_invalid = pd.merge(
    invalid_hadm_notes,
    insulin_data,
    on=['SUBJECT_ID', 'CHARTDATE'],  # Match on same day
    how='inner'  # Only include rows where all keys match
)

# Combine the results
merged_data = pd.concat([merged_valid, merged_invalid], ignore_index=True)

# Verify the structure and content of the merged dataset
print(f"Final merged dataset shape: {merged_data.shape}")
print(merged_data.info())
print(merged_data.head())


# Label data

## Create UMLS dictionary

In [None]:
# Define synonym mapping with CUIs
synonym_dict = {
    "BLOOD": {"CUI": "C0184597", "Synonyms": ["serum glucose", "plasma glucose", "glucose level", "glucose measurement"]},
    "FINGERSTICK": {"CUI": "C0184597", "Synonyms": ["blood sugar", "FBG", "BG", "blood glucose", "random glucose", "post-prandial insulin", "glucose reading"]},
    "Hypoglycemic": {"CUI": "C0020615", "Synonyms": ["low blood sugar", "low glucose", "hypo", "hypoglycemia"]},
    "Hyperglycemic": {"CUI": "C0020456", "Synonyms": ["high blood sugar", "high glucose", "hyper", "hyperglycemia"]},
    "Euglycemic": {"CUI": "C0006681", "Synonyms": ["normal blood sugar", "normal glucose", "euglycemia"]},
    "Bolus": {"CUI": "C0201974", "Synonyms": ["bolus insulin", "bolus push", "SSI"]},
    "Infusion": {"CUI": "C0202076", "Synonyms": ["continuous insulin infusion", "IV insulin", "intravenous insulin"]},
    "Short-acting": {"CUI": "C0201974", "Synonyms": ["short-acting insulin", "rapid-acting insulin", "regular insulin"]},
    "Intermediate-acting": {"CUI": "C0201982", "Synonyms": ["intermediate-acting insulin", "NPH insulin"]},
    "Long-acting": {"CUI": "C0201985", "Synonyms": ["long-acting insulin", "detemir", "glargine"]},
    "No insulin adjustment": {"CUI": "C0037595", "Synonyms": ["no change in insulin dose", "no insulin adjustment", "no dose changed", "stable insulin rate", "no change in insulin rate"]},
}

# Collect all synonyms
all_synonyms = []
for concept, data in synonym_dict.items():
    for phrase in data["Synonyms"]:
        all_synonyms.append({"label": concept.upper(), "pattern": phrase})



In [None]:
# Add EntityRuler to pre-tag important synonym phrases
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns(all_synonyms)

# Add UMLS entity linker to pipeline
linker = UmlsEntityLinker(resolve_abbreviations=True, name="umls")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

# Define CUI filter terms from dictionary
target_cuis = {entry["CUI"] for entry in synonym_dict.values()}

In [None]:
# Function to annotate and cache notes, log to wandb
def annotate_notes_with_umls(clinical_notes, subject_ids=None, allowed_cuis=None, min_score=0.75, artifact_name="umls_annotations", artifact_version="v1"):
    cache_path = Path("cache/umls_output.parquet")
    if cache_path.exists():
        umls_df = pd.read_parquet(cache_path)
    else:
        results = []
        subjects = subject_ids if subject_ids is not None else [None] * len(clinical_notes)
        for i, (doc, subject_id) in enumerate(zip(nlp.pipe(clinical_notes, batch_size=50), subjects)):
            for ent in doc.ents:
                for umls_ent in ent._.kb_ents:
                    cui = umls_ent[0]
                    score = umls_ent[1]
                    if (allowed_cuis and cui not in allowed_cuis) or (score < min_score):
                        continue
                    name = linker.kb.cui_to_entity[cui].canonical_name
                    concept_label = ent.label_ if ent.label_ in synonym_dict else None
                    results.append({
                        "note_id": i,
                        "subject_id": subject_id,
                        "entity_text": ent.text,
                        "start_char": ent.start_char,
                        "end_char": ent.end_char,
                        "umls_cui": cui,
                        "umls_name": name,
                        "score": score,
                        "concept_label": concept_label
                    })
        umls_df = pd.DataFrame(results)
        cache_path.parent.mkdir(parents=True, exist_ok=True)
        umls_df.to_parquet(cache_path)


In [None]:
    wandb.log({"umls_df": wandb.Table(dataframe=umls_df)})

In [None]:
    summary_df = summarize_cui_counts(umls_df)
    concept_counts = summary_df.groupby("concept_label")["count"].sum().sort_values(ascending=False).reset_index()
    fig, ax = plt.subplots(figsize=(10, 5))
    sns.barplot(data=concept_counts, x="concept_label", y="count", ax=ax)
    plt.xticks(rotation=45, ha="right")
    plt.title("Total Mentions per Concept Label")
    plt.tight_layout()
    wandb.log({"concept_distribution": wandb.Image(fig)})

    heatmap_fig, heatmap_ax = plt.subplots(figsize=(12, 8))
    pivot = summary_df.pivot_table(index="subject_id", columns="concept_label", values="count", fill_value=0)
    sns.heatmap(pivot, annot=False, cmap="Blues", ax=heatmap_ax)
    plt.title("Concept Frequency per Patient")
    plt.tight_layout()
    wandb.log({"concept_heatmap": wandb.Image(heatmap_fig)})

    artifact = wandb.Artifact(artifact_name, type="dataset", metadata={"version": artifact_version, "min_score": min_score, "note_count": len(clinical_notes)})
    artifact.add_file(str(cache_path))
    wandb.log_artifact(artifact)

    return umls_df

In [None]:
def summarize_cui_counts(df):
    if df.empty:
        return pd.DataFrame(columns=["subject_id", "concept_label", "count"])
    return df.groupby(["subject_id", "concept_label"]).size().reset_index(name="count")

In [None]:
# Usage example - integrate with merged_data
notes = merged_data["TEXT"].fillna("").astype(str).tolist()
subjects = merged_data["SUBJECT_ID"].tolist()

umls_df = annotate_notes_with_umls(
    clinical_notes=notes,
    subject_ids=subjects,
    allowed_cuis=target_cuis,
    min_score=0.75
)

In [None]:
# Combine UMLS annotations back into merged_data
umls_grouped = umls_df.groupby("subject_id").agg({
    "umls_cui": lambda x: list(set(x)),
    "concept_label": lambda x: list(set(filter(None, x)))
}).reset_index()

umls_grouped.rename(columns={
    "umls_cui": "UMLS_Annotations",
    "concept_label": "UMLS_Annotations_Labels"
}, inplace=True)

merged_data = merged_data.merge(umls_grouped, how="left", left_on="SUBJECT_ID", right_on="subject_id")

label_to_id = {label: i+1 for i, label in enumerate(synonym_dict.keys())}
id_to_label = {v: k for k, v in label_to_id.items()}
merged_data["UMLS_Annotations_Numeric"] = merged_data["UMLS_Annotations_Labels"].apply(
    lambda labels: [label_to_id.get(lbl, 0) for lbl in labels] if isinstance(labels, list) else []
)

# NER Modelling

In [None]:
# Prepare data for NER
texts = merged_data["TEXT"].apply(str.split).tolist()
labels = merged_data["UMLS_Annotations_Numeric"].tolist()

In [None]:
# Initialize ClinicalBERT tokenizer and device
tokenizer = BertTokenizerFast.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## Data Tokenization and Labeling

### Tokenize

In [None]:
# Function to align labels with tokenized inputs
def align_labels_with_tokens(labels, word_ids):
    aligned_labels = []
    current_word_idx = 0
    for word_id in word_ids:
        if word_id is None:
            aligned_labels.append(-100)  # Ignore special tokens
        elif word_id == current_word_idx:
            aligned_labels.append(labels[word_id] if word_id < len(labels) else -100)
            current_word_idx += 1
        else:
            aligned_labels.append(labels[current_word_idx - 1] if current_word_idx > 0 else -100)
    return aligned_labels


In [None]:
# Function to tokenize text and align labels
def tokenize_and_align_labels(texts, labels):
    tokenized_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,  # Adjust as needed
        return_tensors="pt",
        is_split_into_words=True,
    )
    aligned_labels_list = []
    for i, label in enumerate(labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = align_labels_with_tokens(label, word_ids)
        aligned_labels_list.append(aligned_labels)
    return tokenized_inputs, aligned_labels_list

    tokenized_inputs, aligned_labels = tokenize_and_align_labels(texts, labels)

In [None]:
class ClinicalNERDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "labels": torch.tensor(self.labels[idx], dtype=torch.long)
        }
    def __len__(self):
        return len(self.labels)

dataset = ClinicalNERDataset(tokenized_inputs, aligned_labels)
train_idx, test_idx = train_test_split(list(range(len(dataset))), test_size=0.2, random_state=42)
train_dataset = Subset(dataset, train_idx)
test_dataset = Subset(dataset, test_idx)


In [None]:
# Training function
def train_ner_model(train_dataset, model, learning_rate=5e-5, epochs=3):
    training_args = TrainingArguments(
        output_dir="./clinicalbert_results",
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir="./clinicalbert_logs",
        logging_steps=10,
        learning_rate=learning_rate,
        load_best_model_at_end=True,
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
    )
    trainer.train()
    return trainer

In [None]:
# Evaluate ClinicalBERT
predictions, labels, _ = trainer.predict(test_dataset)
predictions = predictions.argmax(axis=-1)

true_labels = []
pred_labels = []
for pred, label in zip(predictions, labels):
    for p, l in zip(pred, label):
        if l != -100:
            true_labels.append(id_to_label[l])
            pred_labels.append(id_to_label[p])

report = classification_report(true_labels, pred_labels, zero_division=0)
accuracy = accuracy_score(true_labels, pred_labels)
precision = precision_score(true_labels, pred_labels, average='weighted', zero_division=0)
recall = recall_score(true_labels, pred_labels, average='weighted', zero_division=0)
f1 = f1_score(true_labels, pred_labels, average='weighted', zero_division=0)

print("NER Model Evaluation:")
print(report)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


In [None]:
from tqdm import tqdm
import torch.nn as nn

# Check if GPU is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}") # Print the device being used


# Number of epochs
num_epochs = 3  # Adjust as needed

# Move model to the correct device
model.to(device)

# Loss tracker
training_loss = []

# Training loop
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0

    # Use dataloader instead of data_loader in the loop
    for batch in tqdm(dataloader, desc=f"Epoch {epoch + 1}/{num_epochs}"): # Change data_loader to dataloader
        # Move batch to device
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        # Compute loss
        loss = outputs.loss
        epoch_loss += loss.item()

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        scheduler.step()

    # Track and log loss
    avg_loss = epoch_loss / len(dataloader) # Change data_loader to dataloader
    training_loss.append(avg_loss)
    print(f"Epoch {epoch + 1} Loss: {avg_loss:.4f}")

## Split data and Create dataset for Training

## Train ClinicalBert for NER

## Evaluate the NER Model

In [None]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Evaluation function
def evaluate_ner_model(trainer, test_dataset, id_to_label):
    predictions, labels, _ = trainer.predict(test_dataset)
    predictions = predictions.argmax(axis=-1)

    true_labels = []
    pred_labels = []

    for pred, label in zip(predictions, labels):
        for p, l in zip(pred, label):
            if l != -100:
                true_labels.append(id_to_label[l])
                pred_labels.append(id_to_label[p])

    report = classification_report(true_labels, pred_labels, zero_division=0)
    accuracy = accuracy_score(true_labels, pred_labels)
    precision = precision_score(true_labels, pred_labels, average='weighted', zero_division=0)
    recall = recall_score(true_labels, pred_labels, average='weighted', zero_division=0)
    f1 = f1_score(true_labels, pred_labels, average='weighted', zero_division=0)

    print("NER Model Evaluation:")
    print(report)
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

# Evaluate Model
evaluate_ner_model(trainer, test_dataset, id_to_label)



# Relation Extraction

In [None]:
def prepare_relation_data_fixed(data, umls_dict):
    pairs, labels = [], []
    for _, row in data.iterrows():
        entities = row['UMLS_Annotations']
        if not entities:
            continue
        for i, entity1 in enumerate(entities):
            for j, entity2 in enumerate(entities):
                if i < j:  # Avoid redundancy (entity1, entity2) == (entity2, entity1)
                    pairs.append((entity1, entity2))
                    labels.append(1 if entity1 == entity2 else 0)
    return pairs, labels

pairs, labels = prepare_relation_data(merged_data, umls_dict)

# Create a custom dataset class
class RelationDataset(Dataset):
    def __init__(self, pairs, labels, tokenizer):
        self.pairs = pairs
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        entity1, entity2 = self.pairs[idx]  # Unpack the tuple
        label = self.labels[idx]

        # Tokenize the pair using the [SEP] token
        encoding = self.tokenizer(
            entity1,
            entity2,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            max_length=128,  # Adjust as needed
            return_tensors='pt'
        )

        # Return a dictionary
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Create Dataset objects
train_dataset = RelationDataset(pairs, labels, tokenizer)
# Assuming you want to use the same data for evaluation,
# you can create a separate eval_dataset or use train_dataset
eval_dataset = RelationDataset(pairs, labels, tokenizer)

# Use BertForSequenceClassification for relation extraction
from transformers import BertForSequenceClassification # Import BertForSequenceClassification
relation_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

relation_training_args = TrainingArguments(
    output_dir="./relation_results",
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64
)
relation_trainer = Trainer(
    model=relation_model,
    args=relation_training_args,
    train_dataset=train_dataset,  # Pass the Dataset object
    eval_dataset=eval_dataset     # Pass the Dataset object
)
relation_trainer.train

In [None]:
relation_eval = relation_trainer.evaluate()
print("Relation Extraction Results:", relation_eval)

# Visualize RE Results
plt.figure(figsize=(10, 6))
sns.histplot([f"{pair}: {label}" for pair, label in zip(pairs, labels)], kde=False)
plt.title('Common Relationships')
plt.xlabel('Relationship')
plt.ylabel('Frequency')
plt.show()

In [None]:
from transformers import BertTokenizerFast, BertForTokenClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
import torch

# Initialize ClinicalBERT tokenizer and device
tokenizer = BertTokenizerFast.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Function to align labels with tokenized inputs
def align_labels_with_tokens(labels, word_ids):
    aligned_labels = []
    current_word_idx = 0
    for word_id in word_ids:
        if word_id is None:
            aligned_labels.append(-100)  # Ignore special tokens
        elif word_id == current_word_idx:
            aligned_labels.append(labels[word_id] if word_id < len(labels) else -100)
            current_word_idx += 1
        else:
            aligned_labels.append(labels[current_word_idx - 1] if current_word_idx > 0 else -100)
    return aligned_labels

# Function to tokenize text and align labels
def tokenize_and_align_labels(texts, labels):
    tokenized_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=128,  # Adjust as needed
        return_tensors="pt",
        is_split_into_words=True,
    )
    aligned_labels_list = []
    for i, label in enumerate(labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = align_labels_with_tokens(label, word_ids)
        aligned_labels_list.append(aligned_labels)
    return tokenized_inputs, aligned_labels_list

# Dataset class
class NERDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

# Example texts and labels
texts = [
    "Patient presented with hyperglycemia.",
    "Administered insulin and rechecked glucose levels.",
    "The glucose reading was 150 mg/dL.",
    "No signs of hypoglycemia were noted.",
]
labels = [
    [0, 3, 0, 0, 4, 0],   # Example labels for the first sentence
    [0, 6, 0, 0, 0, 4, 0],  # Example labels for the second sentence
    [0, 0, 4, 0, 0, 0, 0, 0],  # Example labels for the third sentence
    [0, 0, 0, 3, 0, 0],   # Example labels for the fourth sentence
]

# Train-Test Split
texts_train, texts_test, labels_train, labels_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Tokenize and align labels
train_encodings, train_labels = tokenize_and_align_labels(texts_train, labels_train)
test_encodings, test_labels = tokenize_and_align_labels(texts_test, labels_test)

# Prepare datasets
train_dataset = NERDataset(train_encodings, train_labels)
test_dataset = NERDataset(test_encodings, test_labels)

# Load ClinicalBERT model
num_labels = 12 # Replace with the actual number of labels
model = BertForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT", num_labels=num_labels)
model.to(device)

# Training function
def train_ner_model(train_dataset, test_dataset, model, learning_rate=5e-5, epochs=3):
    training_args = TrainingArguments(
        output_dir="./clinicalbert_results",
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir="./clinicalbert_logs",
        logging_steps=10,
        learning_rate=learning_rate,
        load_best_model_at_end=True,
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
    )
    trainer.train()
    return trainer

# Train the model
trainer = train_ner_model(train_dataset, test_dataset, model, learning_rate=5e-5, epochs=3)
