# Patient Journey Mapping

This notebook demonstrates how to extract clinical events from a PDF document and visualize them as a patient journey timeline using various Python libraries and a large language model (LLM).

## Setup

The following cells install necessary libraries for PDF processing, document handling, text splitting, and interacting with a large language model (LLM).

### Install Unsloth and vLLM

This cell installs Unsloth and vLLM, which are libraries used for accelerating the fine-tuning and inference of large language models, particularly relevant for working with models like Mistral. The conditional installation ensures it works correctly in Google Colab and other environments.

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.8.5.post1

### Install Additional Colab Dependencies

This cell installs extra dependencies specifically needed for Colab when using Unsloth and vLLM, including libraries for quantization (bitsandbytes), acceleration, and other related tools. It also handles potential conflicts with existing libraries like numpy.

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Install PDF, Docx, and Langchain Libraries

This cell installs PyMuPDF for PDF processing, python-docx for generating Word documents, and langchain for text splitting and working with language models.

In [3]:
!pip install PyMuPDF python-docx langchain

Collecting PyMuPDF
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m73.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_docx-1.2.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx, PyMuPDF
Successfully installed PyMuPDF-1.26.3 python-docx-1.2.0


### Mount Google Drive and Define File Paths

This cell mounts your Google Drive to access files stored there. It also defines variables for input and output file paths, including the path to the PDF document that will be processed. **Make sure to update the `pdf_File_Path` and `pdf_Output_Path` variables to point to your specific files.**

In [16]:
from google.colab import drive
drive.mount('/content/drive')
pdf_File_Path = '/content/drive/MyDrive/HealthcareCJM/Referral_Letter_Events.pdf'
pdf_Output_Path = '/content/drive/MyDrive/HealthcareCJM/Referral Letter_ Nephrology Evaluation for Mr.docx'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## PDF Processing

The following cells handle the extraction of text from the PDF document.

### Extract Text from PDF

This function uses PyMuPDF to open the specified PDF file, iterate through its pages, and extract all text content. The extracted text is then stored in the `pdf_text` variable.

In [17]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    return full_text

pdf_text = extract_text_from_pdf(pdf_File_Path)
print(f"Extracted {len(pdf_text)} characters from PDF.")

Extracted 19500 characters from PDF.


### Save Extracted Text to Docx

This function takes the extracted text and saves it as a .docx file using the python-docx library. Each paragraph in the original text (separated by double newlines) is added as a paragraph in the Word document.

In [18]:
from docx import Document

def save_text_to_docx(text, output_path="output.docx"):
    doc = Document()
    for paragraph in text.split('\n\n'):
        doc.add_paragraph(paragraph.strip())
    doc.save(pdf_Output_Path)
    print(f"Saved text to {pdf_Output_Path}")

save_text_to_docx(pdf_text)

Saved text to /content/drive/MyDrive/HealthcareCJM/Referral Letter_ Nephrology Evaluation for Mr.docx


## Text Chunking for LLM Processing

Since large documents can exceed the context window of LLMs, the text is split into smaller chunks.

### Split Text into Chunks

This cell uses Langchain's `RecursiveCharacterTextSplitter` to divide the extracted PDF text into smaller chunks. This is necessary for processing the text with an LLM, which typically has a limited input size. The `chunk_size` and `chunk_overlap` parameters control how the splitting is done.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Treat the full PDF text as one document
from langchain_core.documents import Document as LC_Document
doc = LC_Document(page_content=pdf_text)

# Chunk it
page_chunks = splitter.split_documents([doc])
print(f"Created {len(page_chunks)} chunks.")

Created 16 chunks.


## LLM Setup and Event Extraction

This section sets up the large language model and processes each text chunk to extract structured clinical events.

### Define LLM Prompt for Event Extraction

This cell defines the system prompt used to instruct the LLM on how to extract clinical events. It specifies the goal, instructions, required JSON format, and provides an example. The `<<CHUNK>>` placeholder will be replaced with each text chunk during processing.

In [9]:
mistral_system_prompt = """
You are an AI system specialized in extracting structured clinical events for Healthcare Journey Mapping.

Your goal: From the given text chunk, extract all clinical events mentioned and represent them in a structured, chronological manner.

### Instructions:
1. Identify events such as:
   - Diagnosis
   - Referral Sent
   - Specialist Visit
   - Hospitalization
   - Procedure
   - Imaging Ordered / Imaging Result
   - Lab Ordered / Lab Result
   - Medication Started / Changed / Stopped
   - Follow-up Visit
   - Missed Appointment
   - Patient Reported Symptom
   - Functional/Social Note
   - Other relevant clinical events

2. For each event, provide:
   - **event_type**: One of the categories above.
   - **date**: Use `YYYY-MM-DD` if exact date is available; otherwise `YYYY-MM` or "Unknown".
   - **provider**: Doctor name or facility if present; otherwise "N/A".
   - **details**: Short summary (e.g., reason, findings, medication name/dose, lab value).

3. Do NOT infer or hallucinate events not explicitly mentioned in the text.
4. Keep identifiers exactly as in text (do not generate new names).
5. Output strictly as a JSON array.

### JSON Format:
[
  {
    "event_type": "",
    "date": "",
    "provider": "",
    "details": ""
  }
]

### Example:
[
  {
    "event_type": "Referral Sent",
    "date": "2023-03-01",
    "provider": "Dr. Smith",
    "details": "Concern: nephropathy"
  },
  {
    "event_type": "Hospitalization",
    "date": "2024-02-10",
    "provider": "St. Mary’s Hospital",
    "details": "Acute kidney injury, creatinine peaked at 3.0 mg/dL"
  }
]

Now process this text chunk and return ONLY valid JSON with all events found:
<<CHUNK>>
"""

### Load and Configure the Language Model

This cell loads the specified large language model (`mistralai/Mistral-7B-Instruct-v0.3`) and configures it for text generation. It includes optional 4-bit quantization to reduce memory usage, which is helpful in environments with limited GPU resources like Colab. A text generation pipeline is then created for easy interaction with the model.

In [10]:
# Install dependencies
!pip install -q -U transformers accelerate bitsandbytes

# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

# Choose model ID
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# Optional: configure 4-bit quantization if using limited GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model with quantization and auto device placement
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# Build a text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/367.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m367.1/367.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2025.7.7 requires tyro, which is not installed.
unsloth 2025.7.5 requires tyro, which is not installed.
unsloth-zoo 2025.7.7 requires datasets<4.0.0,>=3.4.1, but you have datasets 4.0.0 which is incompatible.
unsloth 2025.7.5 requires datasets<4.0.0,>=3.4.1, but you have datasets 4.0.0 which is incompatible.[0m[31m
[0m

tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Device set to use cuda:0


### Process Chunks and Extract Events

This cell iterates through each text chunk, constructs the full prompt by adding the chunk to the system prompt, and calls the language model pipeline to extract events. It attempts to parse the model's output as JSON and collects all extracted events into a single list (`all_events`). Error handling is included for JSON parsing issues. Finally, the collected events are sorted by date.

In [19]:
import json

# List to store all events from all chunks
all_events = []

# Iterate through all chunks
for i, chunk in enumerate(page_chunks):
    print(f"\n--- Processing chunk {i+1}/{len(page_chunks)} ---")

    # Combine your pre-defined prompt with the chunk
    full_prompt = mistral_system_prompt.replace("<<CHUNK>>", str(chunk)) + "\n\n Return ONLY a valid JSON array without any text before or after."

    # Call the model pipeline
    # Set return_full_text=False to get only the generated part after the prompt
    output = pipe(full_prompt, max_new_tokens=4000, do_sample=True, return_full_text=False)[0]['generated_text']

    # Debug: Print raw output for verification
    print("\nRaw Model Output (after prompt):\n", output)

    # ✅ Extract JSON safely
    try:
        # Clean up the output string to isolate the JSON
        cleaned_output = output.strip()
        if cleaned_output.startswith("```json"):
            cleaned_output = cleaned_output[len("```json"):].strip()
        if cleaned_output.endswith("```"):
            cleaned_output = cleaned_output[:-len("```")].strip()

        # Find the first occurrence of '[' and the last occurrence of ']' in the cleaned output
        start_idx = cleaned_output.find('[')
        end_idx = cleaned_output.rfind(']') + 1

        if start_idx == -1 or end_idx == -1 or start_idx >= end_idx:
            raise ValueError("Could not find a valid JSON array in cleaned output.")

        json_text = cleaned_output[start_idx:end_idx]

        # Load JSON
        chunk_events = json.loads(json_text)

        all_events.extend(chunk_events)
        print(f"\n✅ Extracted {len(chunk_events)} events from chunk {i}")

    except json.JSONDecodeError as e:
        print(f"❌ JSON decoding error for chunk {i}: {e}")
        print("\n--- Debug Info ---")
        print("Attempted JSON text snippet:\n", json_text[:500] if 'json_text' in locals() else "json_text not defined") # Print snippet of problematic text
        print("\nCleaned Output snippet:\n", cleaned_output[:500] if 'cleaned_output' in locals() else "cleaned_output not defined") # Print snippet of problematic text
        print("\nRaw Output snippet:\n", output[:500]) # Print snippet of raw output
    except Exception as e:
        print(f"❌ An unexpected error occurred for chunk {i}: {e}")
        print("\n--- Debug Info ---")
        print("\nRaw Output snippet:\n", output[:500])


# ✅ Sort events by date
def sort_key(event):
    return event.get("date", "9999-99-99")  # Unknown dates go last

all_events.sort(key=sort_key)

# ✅ Print combined results
print("\n✅ Final Combined Events:")
print(json.dumps(all_events, indent=2))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



--- Processing chunk 1/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 ```json
[
  {
    "event_type": "Diagnosis",
    "date": "2008-06",
    "provider": "Dr. Ortiz",
    "details": "Type 2 Diabetes Mellitus"
  },
  {
    "event_type": "Diagnosis",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Progressive Diabetic Nephropathy, Long-standing Hypertension, Cardiovascular Disease"
  },
  {
    "event_type": "Referral Sent",
    "date": "2025-06-15",
    "provider": "Dr. Ortiz",
    "details": "Nephrology Evaluation for chronic kidney disease"
  }
]
```

✅ Extracted 3 events from chunk 0

--- Processing chunk 2/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 ```
 [
  {
    "event_type": "Medication Started",
    "date": "2012-01-01",
    "provider": "N/A",
    "details": "Metformin monotherapy initiated"
  },
  {
    "event_type": "Medication Changed",
    "date": "2014-01-01",
    "provider": "N/A",
    "details": "Transitioned to basal insulin regimen (glargine) and continued metformin"
  },
  {
    "event_type": "Lab Result",
    "date": "2015-09-01",
    "provider": "N/A",
    "details": "Urine microalbuminuria noted"
  },
  {
    "event_type": "Medication Started",
    "date": "2015-09-01",
    "provider": "N/A",
    "details": "ACE inhibitor therapy (lisinopril 10 mg) initiated"
  }
]
```

✅ Extracted 4 events from chunk 1

--- Processing chunk 3/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Medication Started",
    "date": "2017",
    "provider": "N/A",
    "details": "Amlodipine added to medication regimen"
  },
  {
    "event_type": "Patient Reported Symptom",
    "date": "2019-Late",
    "provider": "N/A",
    "details": "Occasional tingling in both feet"
  },
  {
    "event_type": "Diagnosis",
    "date": "2020-01-01",
    "provider": "N/A",
    "details": "Diagnosed with non-proliferative diabetic retinopathy"
  },
  {
    "event_type": "Emergency Department Visit",
    "date": "2020-02",
    "provider": "Emergency Department",
    "details": "New-onset chest pressure, shortness of breath, and elevated troponins"
  },
  {
    "event_type": "Diagnosis",
    "date": "2020-02",
    "provider": "N/A",
    "details": "Diagnosed with Non-ST elevation myocardial infarction (NSTEMI)"
  },
  {
    "event_type": "Procedure",
    "date": "2020-02",
    "provider": "N/A",
    "details": "Coronary angiography with ste

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Procedure",
    "date": "20XX-XX-XX",
    "provider": "N/A",
    "details": "Stenting of the LAD"
  },
  {
    "event_type": "Lab Result",
    "date": "20XX-XX-XX",
    "provider": "N/A",
    "details": "Creatinine briefly rose from 1.5 to 1.9 mg/dL"
  },
  {
    "event_type": "Medication Started",
    "date": "20XX-XX-XX",
    "provider": "N/A",
    "details": "Dual antiplatelet therapy, intensified statin dosing, and a beta blocker"
  },
  {
    "event_type": "Lab Result",
    "date": "2020-XX-XX",
    "provider": "N/A",
    "details": "Creatinine 1.8–2.0, eGFR ~42–45 mL/min"
  },
  {
    "event_type": "Lab Result",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Proteinuria increased steadily, with occasional findings in the range of 1.0 to 1.4 g/g."
  },
  {
    "event_type": "Follow-up Visit",
    "date": "2023-08-XX",
    "provider": "N/A",
    "details": "Noted progressive fatigue, worsening exertional 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

```
[
  {
    "event_type": "Hospitalization",
    "date": "2025-05-28",
    "provider": "St. David’s Medical Center",
    "details": "IV antihypertensives and diuretics treatment, peak creatinine 3.0 mg/dL, mild hyperkalemia, metabolic acidosis"
  },
  {
    "event_type": "Lab Result",
    "date": "2025-05-28",
    "provider": "St. David’s Medical Center",
    "details": "Creatinine: 2.6 mg/dL, Estimated GFR: 24 mL/min, BUN: 38 mg/dL, Potassium: 5.1 mmol/L, Urine Protein/Creatinine: 1.8 g/g, Hemoglobin: 11.6 g/dL, HbA1c (March 2025): 7.3%"
  },
  {
    "event_type": "Medication Started",
    "date": "2014",
    "provider": "N/A",
    "details": "Lantus (glargine) added"
  },
  {
    "event_type": "Medication Changed",
    "date": "2017",
    "provider": "N/A",
    "details": "Amlodipine added"
  }
]
```

✅ Extracted 4 events from chunk 4

--- Processing chunk 6/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

[
  {
    "event_type": "Medication Started",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Amlodipine 20 mg daily, Atorvastatin 40 mg at bedtime, Furosemide 40 mg daily, Tamsulosin 0.4 mg HS, Allopurinol 100 mg daily"
  },
  {
    "event_type": "Medication Changed",
    "date": "2017-01-01",
    "provider": "N/A",
    "details": "Up-titrated Amlodipine over years"
  },
  {
    "event_type": "Medication Stopped",
    "date": "2024-12-01",
    "provider": "N/A",
    "details": "Ibuprofen PRN (discontinued), Clopidogrel 75 mg daily"
  },
  {
    "event_type": "Lab Result",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Creatinine increased from 2.1 to 2.6 mg/dL, eGFR fell below 25 mL/min, significant increase in urine protein excretion"
  }
]

✅ Extracted 4 events from chunk 5

--- Processing chunk 7/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

[
  {
    "event_type": "Patient Reported Symptom",
    "date": "2023-09-01",
    "provider": "N/A",
    "details": "Greater day-to-day fatigue, requiring longer rest periods between basic activities, increased lower extremity swelling in the evenings, mild dyspnea on exertion"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-09-01",
    "provider": "N/A",
    "details": "Mild normocytic anemia (Hgb ~11.8), slightly elevated potassium"
  },
  {
    "event_type": "Follow-up Visit",
    "date": "2023-09-01",
    "provider": "N/A",
    "details": "Subtle change in complexion and demeanor, expressed frustration about feeling “slower” and more mentally foggy, denied syncope, falls, or severe memory issues"
  },
  {
    "event_type": "Lab Ordered / Lab Result (Repeated)",
    "date": "2023-09-01",
    "provider": "N/A",
    "details": "Repeated renal function tests every 6–8 weeks"
  }
]

✅ Extracted 4 events from chunk 6

--- Processing chunk 8

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 
 [
  {
    "event_type": "Lab Result",
    "date": "2025-04-04",
    "provider": "ER Staff",
    "details": "Creatinine peaked at 3.0 mg/dL, BUN of 45, and a serum potassium of 5.3"
  },
  {
    "event_type": "Hospitalization",
    "date": "2025-04-04",
    "provider": "ER",
    "details": "Admitted due to hypertensive emergency, acute-on-chronic kidney injury, and edema"
  },
  {
    "event_type": "Medication Changed",
    "date": "2025-04",
    "provider": "Dr. John",
    "details": "Increased furosemide to 40 mg daily"
  }
]

✅ Extracted 3 events from chunk 7

--- Processing chunk 9/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

[
  {
    "event_type": "Lab Result",
    "date": "2023-05",
    "provider": "N/A",
    "details": "Creatinine: 2.1 mg/dL"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-06",
    "provider": "N/A",
    "details": "Creatinine: 2.6 mg/dL"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-06",
    "provider": "N/A",
    "details": "Proteinuria: 1.8 g/g"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-06",
    "provider": "N/A",
    "details": "Potassium: 5.3 mEq/L"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-06",
    "provider": "N/A",
    "details": "Hemoglobin: Trending low"
  },
  {
    "event_type": "Medication Changed",
    "date": "2023-04-04",
    "provider": "N/A",
    "details": "Lasix dose increased to 40 mg daily"
  },
  {
    "event_type": "Hospitalization",
    "date": "2023-04-04",
    "provider": "St. Mary’s Hospital",
    "details": "HTN emergency"
  },
  {
    "event_type": "Follow-up Visi

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Referral Decision",
    "date": "2025-05-Unknown",
    "provider": "Dr. Smith",
    "details": "Referred to nephrology specialist"
  },
  {
    "event_type": "Referral Sent",
    "date": "2025-06-18",
    "provider": "Dr. N/A",
    "details": "Renal ultrasound ordered"
  },
  {
    "event_type": "Imaging Ordered",
    "date": "2025-06-18",
    "provider": "Southlake Imaging",
    "details": "Renal ultrasound scheduled"
  },
  {
    "event_type": "Functional/Social Note",
    "date": "2025-06-18",
    "provider": "N/A",
    "details": "Lives alone in single-level home"
  }
]

✅ Extracted 4 events from chunk 9

--- Processing chunk 11/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

[
  {
    "event_type": "Patient Reported Symptom",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Minor lapses in memory, forgetting lab appointments or misplacing glucometer"
  },
  {
    "event_type": "Follow-up Visit",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Daughter visits monthly from Denver"
  },
  {
    "event_type": "Medication Started",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Compliant with Medicare Part B"
  },
  {
    "event_type": "Procedure",
    "date": "2021-08-15",
    "provider": "N/A",
    "details": "Knee replacement"
  },
  {
    "event_type": "Functional/Social Note",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Mr. Halperin lives alone in a single-level home, cooks for himself and drives short distances, uses a cane on uneven surfaces due to right knee osteoarthritis, stopped traveling beyond his immediate area, expressed anxiety about future hospitali

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Patient Reported Symptom",
    "date": "Unknown",
    "provider": "N/A",
    "details": "fatigue, poor appetite, mild cognitive slowing"
  },
  {
    "event_type": "Lab Result",
    "date": "Unknown",
    "provider": "N/A",
    "details": "creatinine continues to rise, and eGFR has dropped into the low 20s"
  },
  {
    "event_type": "Lab Result",
    "date": "Unknown",
    "provider": "N/A",
    "details": "increasing proteinuria"
  },
  {
    "event_type": "Follow-up Visit",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Despite optimized therapy with ACE inhibitors, statin, insulin, and diuretic support"
  },
  {
    "event_type": "Clinical Question",
    "date": "Unknown",
    "provider": "N/A",
    "details": "Etiologic Clarification: While diabetic nephropathy remains the leading hypothesis"
  }
]

✅ Extracted 5 events from chunk 11

--- Processing chunk 13/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

[
  {
    "event_type": "Patient Reported Symptom",
    "date": "Unknown",
    "provider": "N/A",
    "details": "rate of decline and degree of proteinuria"
  },
  {
    "event_type": "Consultation Recommendation",
    "date": "Unknown",
    "provider": "N/A",
    "details": "consider biopsy"
  },
  {
    "event_type": "Dialysis Planning",
    "date": "Unknown",
    "provider": "N/A",
    "details": "pre-dialysis education or vascular access evaluation"
  },
  {
    "event_type": "Medication Started",
    "date": "Unknown",
    "provider": "N/A",
    "details": "lisinopril and furosemide"
  },
  {
    "event_type": "Medication Recommendation",
    "date": "Unknown",
    "provider": "N/A",
    "details": "SGLT2 inhibitor or other renally-protective agent"
  },
  {
    "event_type": "Electrolyte Surveillance",
    "date": "Unknown",
    "provider": "N/A",
    "details": "mild hyperkalemia and fluctuating bicarbonate levels"
  },
  {
    "event_type": 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Follow-up Visit",
    "date": "2023-05-XX",
    "provider": "Your Name",
    "details": "Discussion on dialysis, medication management, lifestyle suggestions, and renal ultrasound"
  },
  {
    "event_type": "Imaging Ordered",
    "date": "2023-06-18",
    "provider": "Imaging Facility",
    "details": "Renal ultrasound"
  },
  {
    "event_type": "Lab Result",
    "date": "Unknown",
    "provider": "Lab Facility",
    "details": "List of recent labs"
  },
  {
    "event_type": "Medication Started",
    "date": "Unknown",
    "provider": "Your Name",
    "details": "Medication for nephropathy management"
  }
]

✅ Extracted 4 events from chunk 13

--- Processing chunk 15/16 ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Raw Model Output (after prompt):
 

 [
  {
    "event_type": "Referral Sent",
    "date": "2023-03-01",
    "provider": "Jennifer L. Ortiz, MD",
    "details": "Referral for specialty care"
  },
  {
    "event_type": "Patient Reported Symptom",
    "date": "2023-01",
    "provider": "N/A",
    "details": "Mild morning headaches and occasional dizziness"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-02",
    "provider": "N/A",
    "details": "Mild upward trend in creatinine to 2.0 mg/dL"
  },
  {
    "event_type": "Patient Reported Symptom",
    "date": "2023-03",
    "provider": "N/A",
    "details": "Muscle cramps during nighttime"
  },
  {
    "event_type": "Lab Result",
    "date": "2023-04",
    "provider": "N/A",
    "details": "Review of his home glucose logs showed increased postprandial spikes"
  },
  {
    "event_type": "Patient Reported Symptom",
    "date": "2023-05",
    "provider": "N/A",
    "details": "Increased fatigue and reduced appetite"
  },
  {
    "e

## Data Visualization

The following cells visualize the extracted clinical events as a timeline.

### Visualize Events with Plotly (Gantt and Scatter)

This cell uses Plotly to create two different visualizations of the extracted events: a Gantt-style timeline and a scatter plot timeline. Both charts show the events plotted against time, colored by event type. The Gantt chart uses bars, while the scatter plot uses points. Both are combined into a single figure using subplots.

In [20]:
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Convert all_events to DataFrame
df = pd.DataFrame(all_events)

# Remove "Unknown" dates
df = df[df['date'] != "Unknown"].copy()
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Sort by date
df.sort_values(by='date', inplace=True)

# -------------------------
# ✅ Gantt-style timeline (bars)
fig_timeline = px.timeline(
    df,
    x_start="date",
    x_end="date",
    y="event_type",
    color="event_type",
    hover_data=["details", "provider"]
)
fig_timeline.update_yaxes(autorange="reversed")
fig_timeline.update_layout(title="Patient Journey Timeline (Gantt-style)")

# -------------------------
# ✅ Scatter plot timeline (points)
fig_scatter = px.scatter(
    df,
    x="date",
    y="event_type",
    color="event_type",
    hover_data=["details", "provider"],
    symbol="event_type"
)
fig_scatter.update_traces(marker=dict(size=12))
fig_scatter.update_layout(title="Patient Journey Timeline (Scatter)")

# -------------------------
# ✅ Combine both in subplots
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=("Gantt-style Timeline", "Scatter Timeline"),
    vertical_spacing=0.2
)

# Add Gantt bars
for trace in fig_timeline.data:
    fig.add_trace(trace, row=1, col=1)

# Add scatter points
for trace in fig_scatter.data:
    fig.add_trace(trace, row=2, col=1)

# Update layout
fig.update_layout(
    height=1000,
    title_text="Patient Journey Dashboard",
    showlegend=True
)

fig.show()

### Visualize Events with Altair

This cell uses Altair to create an interactive scatter plot timeline of the extracted events. This visualization allows filtering by event type using the legend and by date range using a brush selection on the x-axis. Hovering over a point reveals detailed information about the event.

In [21]:
import pandas as pd
import altair as alt

df = pd.DataFrame(all_events)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.sort_values('date')

# Legend-based event-type filter
legend_sel = alt.selection_multi(fields=['event_type'], bind='legend')

# Brush for date range selection
brush = alt.selection_interval(encodings=['x'])

chart = alt.Chart(df).mark_circle(size=100).encode(
    x='date:T',
    y=alt.Y('event_type:N', sort=alt.EncodingSortField('date', order='ascending')),
    color='event_type:N',
    opacity=alt.condition(legend_sel & brush, alt.value(1), alt.value(0.2)),
    tooltip=['date:T', 'event_type:N', 'provider:N', 'details:N']
).add_selection(
    legend_sel, brush
).properties(
    width=800, height=400,
    title="Patient Journey: Filter by Event Type and Date"
).interactive()

chart



Deprecated since `altair=5.0.0`. Use selection_point instead.



Deprecated since `altair=5.0.0`. Use add_params instead.



### Visualize Events with Bokeh

This cell uses Bokeh to create an interactive scatter plot timeline. It displays events as circles plotted against date and event type. Hovering over a circle shows a tooltip with event details. The legend allows toggling the visibility of different event types. Vertical jitter is added to the points to prevent overlap when multiple events occur on the same date for the same event type.

In [22]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import Category10, Category20
from bokeh.transform import jitter
import pandas as pd

output_notebook()

# Prepare DataFrame
df = pd.DataFrame(all_events)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.dropna(subset=['date'])  # Remove rows without valid dates

# Define unique event categories and color mapping
categories = sorted(df['event_type'].unique())
num = len(categories)
palette = Category20[num] if num <= 20 else (Category10[10] * ((num // 10) + 1))[:num]
color_map = dict(zip(categories, palette))
df['color'] = df['event_type'].map(color_map)

# Create ColumnDataSource
source = ColumnDataSource(df)

# Create the figure
p = figure(
    width=900, height=400,
    x_axis_type='datetime',
    y_range=categories,
    title="Patient Journey – Hover for Details, Click Legend to Toggle",
    tools="pan,box_zoom,reset,save"
)

# Add HoverTool
hover = HoverTool(
    tooltips=[
        ("Date", "@date{%F}"),
        ("Event", "@event_type"),
        ("Provider", "@provider"),
        ("Details", "@details")
    ],
    formatters={'@date': 'datetime'},
    mode='mouse'
)
p.add_tools(hover)

# Plot using scatter with vertical jitter to avoid overlap
p.scatter(
    x='date',
    y=jitter('event_type', width=0.6, range=p.y_range),
    source=source,
    color='color',
    size=10,
    alpha=0.8,
    legend_field='event_type',
    muted_alpha=0.2,
    marker="circle"
)

# Configure legend for click-to-toggle behavior
p.legend.location = 'top_left'
p.legend.click_policy = 'mute'

# Label axes
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = 'Event Type'

show(p)