## Final Assignment Overview: Working with Patient Records and Encounter Notes

In this final assignment, we’ll focus on patient records related to COVID-19 encounters. Our task is to analyze, process, and transform the data while applying the concepts we’ve covered throughout this course. Here's a detailed breakdown of the assignment:

What Are Encounter Notes?
An encounter note is a record that captures details about a patient’s visit with a doctor. It includes both structured and semi-structured information that is crucial for understanding the context of the visit. Here’s what an encounter note typically looks like:

```
AMBULATORY ENCOUNTER NOTE
Date of Service: March 2, 2020 15:45-16:30

DEMOGRAPHICS:
Name: Jeffrey Greenfelder
DOB: 1/16/2005
Gender: Male
Address: 428 Wiza Glen Unit 91, Springfield, Massachusetts 01104
Insurance: Guardian
MRN: 055ae6fc-7e18-4a39-8058-64082ca6d515

PERTINENT MEDICAL HISTORY:
- Obesity 

Recent Visit: Well child visit (2/23/2020)
Immunizations: Influenza vaccine (2/23/2020)

Recent Baseline (2/23/2020):
Height: 155.0 cm
Weight: 81.2 kg
BMI: 33.8 kg/m² (99.1th percentile)
BP: 123/80 mmHg
HR: 92/min
RR: 13/min

SUBJECTIVE:
Adolescent patient presents with multiple symptoms including:
- Cough
- Sore throat
- Severe fatigue
- Muscle pain
- Joint pain
- Fever
Never smoker. Symptoms began recently.

OBJECTIVE:
Vitals:
Temperature: 39.3°C (102.7°F)
Heart Rate: 131.1/min
Blood Pressure: 120/73 mmHg
Respiratory Rate: 27.6/min
O2 Saturation: 75.8% on room air
Weight: 81.2 kg

Laboratory/Testing:
Comprehensive Respiratory Panel:
- Influenza A RNA: Negative
- Influenza B RNA: Negative
- RSV RNA: Negative
- Parainfluenza virus 1,2,3 RNA: Negative
- Rhinovirus RNA: Negative
- Human metapneumovirus RNA: Negative
- Adenovirus DNA: Negative
- SARS-CoV-2 RNA: Positive

ASSESSMENT:
1. Suspected COVID-19 with severe symptoms
2. Severe hypoxemia requiring immediate intervention
3. Tachycardia (HR 131)
4. High-grade fever
5. Risk factors:
   - Obesity (BMI 33.8)
   - Adolescent age

PLAN:
1. Face mask provided for immediate oxygen support
2. Infectious disease care plan initiated
3. Close monitoring required due to:
   - Severe hypoxemia
   - Tachycardia
   - Age and obesity risk factors
4. Parent/patient education on:
   - Home isolation protocols
   - Warning signs requiring emergency care
   - Return precautions
5. Follow-up plan:
   - Daily monitoring during acute phase
   - Virtual check-ins as needed

Encounter Duration: 45 minutes
Encounter Type: Ambulatory
Provider: ID# e2c226c2-3e1e-3d0b-b997-ce9544c10528
Facility: 5103c940-0c08-392f-95cd-446e0cea042a
```


The enocuter contains

* General encounter information: 

  * When the encounter took place: Date and time of the visit.
  * Demographics: Patient’s age, gender, and unique medical record identifier.
  * Encounter details: The reason for the visit, diagnosis, and any associated costs.


* Semi-Structured Notes:

These notes mirror how doctors organize their thoughts and observations during an encounter. They generally follow a SOAP format:

* Subjective: The patient’s subjective description of their symptoms, feelings, and medical concerns.
* Objective: The doctor’s objective findings, including test results, measurements, or physical examination outcomes.
* Assessment: The doctor’s evaluation or diagnosis based on subjective and objective information.
* Plan: The proposed treatment plan, including medications, follow-ups, or other interventions.

While some encounter notes might include additional details, the majority conform to this semi-structured format, making them ideal for analysis and transformation.

* Goals for the Assignment

1. Transforming Encounter Notes:

Using an LLM to convert semi-structured encounter notes into a JSON format that organizes the information into structured fields. The JSON will include details such as demographics, encounter specifics, and the SOAP components of the note. Subsequently, you will need to transform the JSON data into a Parquet file, which is not only suitable for analysis in Spark but also ideal for storage later.
Here we will use the ML classificaition to assing the objective and assessment semi-structured fields into standardized, structured fields. The medical taxonomy for this task will be the one provided by the CDC, which defines standard codes for diagnoses, symptoms, procedures, and treatments. This step ensures the structured data aligns with domain-wide medical standards, making it interoperable and ready for deeper analysis.

The JSON format should capture the hierachies described in the structure below. 




2. Basic Analytics and Visualizations:
Using Apache Spark, perform comprehensive data analysis on the encounter data and create visualizations that reveal meaningful patterns. Your analysis must include:
- COVID-19 Case Demographics: Case breakdown by age ranges ([0-5], [6-10], [11-17], [18-30], [31-50], [51-70], [71+])
- Cumulative case count of Covid between the earliest case observed in the dataset and last case observed
- Symptoms for all COVID-19 patients versus patients that admitted into the intensive care unit due to COVID.
- Rank medications by frequency of prescription
- Analyze medication patterns across different demographic groups (e.g., top 3 per age group)
- Identify and plot co-morbidity information from the patient records (e.g., hypertension, obesity, prediabetes, etc.) provided in the dataset. 
- An independent group analysis: You need to develop and execute THREE original analyses that provide meaningful insights about COVID-19 patterns in this dataset. For each analysis:
  - Clearly state your analytical question/hypothesis
  - Justify why this analysis is valuable
  - Show your Spark code and methodology
  - Present results with appropriate visualizations


In [None]:
EncounterType:
    code
    description

Encounter:
    id
    date
    time
    type: EncounterType
    provider_id
    facility_id

Address:
    city
    state

Demographics:
    id
    name
    date_of_birth
    age
    gender
    address: Address
    insurance

Condition:
    code
    description

Medication:
    code
    description

Immunization:
    code
    description
    date: date

VitalMeasurement:
    code
    value: float
    unit

BloodPressure:
    systolic: VitalMeasurement
    diastolic: VitalMeasurement

CurrentVitals:
    temperature: VitalMeasurement
    heart_rate: VitalMeasurement
    blood_pressure: BloodPressure
    respiratory_rate: VitalMeasurement
    oxygen_saturation: VitalMeasurement
    weight: VitalMeasurement

BaselineVitals:
    date: date
    height: VitalMeasurement
    weight: VitalMeasurement
    bmi: VitalMeasurement
    bmi_percentile: VitalMeasurement

Vitals:
    current: CurrentVitals
    baseline: BaselineVitals

RespiratoryTest:
    code
    result

RespiratoryPanel:
    influenza_a: RespiratoryTest
    influenza_b: RespiratoryTest
    rsv: RespiratoryTest
    parainfluenza_1: RespiratoryTest
    parainfluenza_2: RespiratoryTest
    parainfluenza_3: RespiratoryTest
    rhinovirus: RespiratoryTest
    metapneumovirus: RespiratoryTest
    adenovirus: RespiratoryTest

Covid19Test:
    code
    description
    result

Laboratory:
    covid19: Covid19Test
    respiratory_panel: RespiratoryPanel

Procedure:
    code
    description
    date: date
    reasonCode
    reasonDescription

CarePlan:
    id
    code
    description
    start: date
    stop: date
    reasonCode
    reasonDescription

PatientRecord:
    encounter: Encounter
    demographics: Demographics
    conditions: List[Condition]
    medications: List[Medication]
    immunizations: List[Immunization]
    vitals: Vitals
    laboratory: Laboratory
    procedures: List[Procedure]
    care_plans: List[CarePlan]

To find list of medications and their codes, search by similarity at the embedding level (find top 10 most similar with FAISS)

Retrieval Augmented Generation


In [58]:
file_name = "199c586f-af16-4091-9998-ee4cfc02ee7a.txt"
file_path = "data/encounter_notes/" + file_name
with open(file_path, "r") as f:
    #copy content to a string
    query_text = f.read()
print(query_text)

URGENT CARE ENCOUNTER NOTE
Date of Service: March 2, 2020 04:15-05:15

DEMOGRAPHICS:
Name: Jimmie Harris
DOB: 1/9/2004 (16y/o)
Gender: Female
Address: Pembroke, MA
Insurance: Medicare/Medicaid
MRN: 199c586f-af16-4091-9998-ee4cfc02ee7a

PERTINENT MEDICAL HISTORY:
No significant past medical history
Current Medications:
- Jolivette (oral contraceptive)
Last Visit: Well child visit (2/22/2020)
Immunizations: 
- Influenza vaccine (2/21/2020)
- Meningococcal vaccine (2/21/2020)

Recent Labs (2/21/2020):
CBC Results:
- WBC: 7.9 K/uL
- RBC: 4.6 M/uL
- Hemoglobin: 12.6 g/dL
- Hematocrit: 46.5%
- Platelets: 398.3 K/uL

SUBJECTIVE:
Previously healthy adolescent presents with fever, productive cough with sputum, nausea, and vomiting. Symptoms began yesterday. The patient has no history of smoking and reports no known contacts with COVID-19.

OBJECTIVE:
Vitals:
Temperature: 40.7°C (105.3°F)
Heart Rate: 98/min
Blood Pressure: 120/89 mmHg
Respiratory Rate: 22/min
O2 Saturation: 78.2% on room air
Wei

In [None]:
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date, time
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.chains import LLMChain

class Encounter(BaseModel): 
    code: Optional[int] = Field(None, description="Code representing the encounter type.") 
    encounter_start: date = Field(..., description="Date of the encounter (ISO format).")
    description: str = Field(..., description="Description of the encounter type.")
    provider_id: str = Field(..., description="Unique identifier for the healthcare provider.")
    facility_id: str = Field(..., description="Unique identifier for the facility where the encounter occurred.")

class Address(BaseModel):
    street: Optional[str] = Field(None, description="Street address of the patient's residence.")
    city: str = Field(..., description="City of the patient's residence.")
    state: str = Field(..., description="State of the patient's residence.")
    zip: Optional[str] = Field(None, description="Zip code of the patient's residence.")

class Demographics(BaseModel):
    name: str = Field(..., description="Full name of the patient.")
    date_of_birth: date = Field(..., description="Date of birth of the patient (ISO format).")
    age: int = Field(..., description="Age of the patient in years, to be calculated from subtracting date of birth from date of service.")
    gender: str = Field(..., description="Gender of the patient.")
    insurance: str = Field(..., description="Insurance information for the patient.")

class Symptom(BaseModel):
    description: List[str] = Field(..., description="List of all symptoms (e.g., 'fever', 'cough').")

    @field_validator("description", mode="before")
    def ensure_list(cls, value):
        """Ensures the description is always a list."""
        if isinstance(value, str):
            return [value]  # Convert single string to list
        if isinstance(value, list):
            return value
        raise ValueError("Description must be a list of strings.")

class Medication(BaseModel):
    code: Optional[int] = Field(None, description="Code representing the medication.")
    description: Optional[str] = Field(None, description="Description of the medication (e.g., 'Hydrochlorothiazide 12.5 MG daily').")

class Immunization(BaseModel):
    code: Optional[int] = Field(None, description="Code representing the immunization.")
    description: Optional[str] = Field(None, description="Description of the immunization (e.g., 'Influenza vaccine').")
    immunization_date: Optional[date] = Field(None, description="Date the immunization was administered (ISO format).")

class VitalMeasurement(BaseModel):
    code: Optional[str] = Field(None, description="Code representing the vital sign (e.g., '8310-5' for Body temperature).")
    value: float = Field(..., description="Value of the measurement.")
    unit: str = Field(..., description="Unit of the measurement (e.g., 'Celsius').")

class BloodPressure(BaseModel):
    systolic: VitalMeasurement = Field(..., description="Systolic blood pressure measurement.")
    diastolic: VitalMeasurement = Field(..., description="Diastolic blood pressure measurement.")

class CurrentVitals(BaseModel):
    temperature: VitalMeasurement = Field(..., description="Patient's current temperature.")
    heart_rate: VitalMeasurement = Field(..., description="Patient's current heart rate.")
    blood_pressure: BloodPressure = Field(..., description="Patient's current blood pressure.")
    respiratory_rate: VitalMeasurement = Field(..., description="Patient's current respiratory rate.")
    oxygen_saturation: VitalMeasurement = Field(..., description="Patient's current oxygen saturation level.")
    weight: VitalMeasurement = Field(..., description="Patient's current weight.")

class BaselineVitals(BaseModel):
    vital_measurement_date: date = Field(..., description="Date of the baseline measurement (ISO format).")
    height: VitalMeasurement = Field(..., description="Patient's height at baseline.")
    weight: VitalMeasurement = Field(..., description="Patient's weight at baseline.")
    bmi: VitalMeasurement = Field(..., description="Patient's BMI at baseline.")
    bmi_percentile: VitalMeasurement = Field(..., description="Patient's BMI percentile at baseline.")

class Vitals(BaseModel):
    current: CurrentVitals = Field(..., description="Current vitals of the patient.")
    baseline: BaselineVitals = Field(..., description="Baseline vitals of the patient.")

class RespiratoryTest(BaseModel):
    code: str = Field(..., description="Code for the respiratory test (e.g., 'influenza_a').")
    result: str = Field(..., description="Result of the respiratory test (e.g., 'Negative').")

class RespiratoryPanel(BaseModel):
    influenza_a: RespiratoryTest = Field(..., description="Result for Influenza A RNA test.")
    influenza_b: RespiratoryTest = Field(..., description="Result for Influenza B RNA test.")
    rsv: RespiratoryTest = Field(..., description="Result for RSV RNA test.")
    parainfluenza_1: RespiratoryTest = Field(..., description="Result for Parainfluenza 1 RNA test.")
    parainfluenza_2: RespiratoryTest = Field(..., description="Result for Parainfluenza 2 RNA test.")
    parainfluenza_3: RespiratoryTest = Field(..., description="Result for Parainfluenza 3 RNA test.")
    rhinovirus: RespiratoryTest = Field(..., description="Result for Rhinovirus RNA test.")
    metapneumovirus: RespiratoryTest = Field(..., description="Result for Human Metapneumovirus RNA test.")
    adenovirus: RespiratoryTest = Field(..., description="Result for Adenovirus DNA test.")

class Covid19Test(BaseModel):
    code: str = Field(..., description="Code for the COVID-19 test.")
    description: str = Field(..., description="Description of the COVID-19 test.")
    result: str = Field(..., description="Result of the COVID-19 test (e.g., 'Positive').")

class Laboratory(BaseModel):
    covid19: Covid19Test = Field(..., description="COVID-19 test details.")
    respiratory_panel: RespiratoryPanel = Field(..., description="Details of the comprehensive respiratory panel.")

class Procedure(BaseModel):
    code: str = Field(..., description="Code representing the procedure.")
    description: str = Field(..., description="Description of the procedure.")
    procedure_date: date = Field(..., description="Date the procedure was performed.")
    reasonCode: str = Field(..., description="Code for the reason the procedure was performed.")
    reasonDescription: str = Field(..., description="Description of the reason the procedure was performed.")

class CarePlan(BaseModel):
    description: str = Field(..., description="Description of the care plan (e.g., 'Face mask and oxygen support provided').")
    start: Optional[date] = Field(None, description="Start date of the care plan.")
    stop: Optional[date] = Field(None, description="Stop date of the care plan (if applicable).")
    reasonCode: Optional[int] = Field(None, description="Code for the reason the care plan was initiated.")
    reasonDescription: Optional[str] = Field(None, description="Description of the reason the care plan was initiated.")

class PatientRecord(BaseModel):
    id: str = Field(..., description="Medical Record Number for the patient (e.g., '055ae6fc-7e18-4a39-8058-64082ca6d515').")
    encounter: Encounter = Field(..., description="Details of the patient's encounter.")
    symptoms: List[Symptom] = Field(..., description="List of symptoms reported by the patient.")
    demographics: Demographics = Field(..., description="Demographic information of the patient.")
    medications: List[Medication] = Field(..., description="List of medications prescribed to the patient.")
    immunizations: List[Immunization] = Field(..., description="List of immunizations received by the patient.")
    vitals: Vitals = Field(..., description="Current and baseline vitals of the patient.")
    laboratory: Laboratory = Field(..., description="Laboratory test results for the patient.")
    procedures: Optional[List[Procedure]] = Field(None, description="List of procedures performed on the patient.")
    care_plans: List[CarePlan] = Field(..., description="List of care plans for the patient.")

    @field_validator("care_plans")
    def validate_care_plans(cls, value):
        """Ensure that care_plans is not empty."""
        if not value:
            raise ValueError("care_plans must contain at least one entry.")
        return value

In [None]:
import json
from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
llm = ChatOpenAI(
  model_name="gpt-3.5-turbo",
  temperature=0,
  api_key=OPENAI_API_KEY
)

structured_llm = llm.with_structured_output(PatientRecord)
patient_record = structured_llm.invoke(query_text)

# parser = PydanticOutputParser(pydantic_object=PatientRecord)

# prompt = PromptTemplate(
#     template="Extract patient record information from the following text.\n{format_instructions}\nText:\n{query}\n",
#     input_variables=["query"],
#     partial_variables={"format_instructions": parser.get_format_instructions()},
# )

# chain = LLMChain(llm=llm, prompt=prompt)

# # Format the input and run the chain
# formatted_prompt = prompt.format(query=query_text)
# response = chain.run(query=query_text)

# # Parse the response into the Pydantic model
# patient_record = parser.parse(response)
# print(patient_record)

### Save to .json file

In [31]:
# Convert to a JSON string
patient_record_json = patient_record.model_dump_json(indent=4)
# Save the JSON string to a file
with open(file_name + ".json", "w") as json_file:
    json_file.write(patient_record_json)

In [46]:
import pandas as pd
from sentence_transformers import SentenceTransformer
file_path = "data/medications_assignment_1.csv"
medications_df = pd.read_csv(file_path)
codes = medications_df['CODE'].tolist()
medications = medications_df['DESCRIPTION'].tolist()
model = SentenceTransformer('all-MiniLM-L6-v2')
medication_embeddings = model.encode(medications, convert_to_numpy=True)


In [47]:
import faiss
import numpy as np

# Create a FAISS index
dimension = medication_embeddings.shape[1]  # Dimensionality of the embeddings
index = faiss.IndexFlatL2(dimension)
index.add(medication_embeddings)

# Save the index and codes mapping for later use
faiss.write_index(index, "medication_index.faiss")
with open("codes_mapping.txt", "w") as f:
    for code in codes:
        f.write(f"{code}\n")

In [None]:
# Load the patient data JSON
with open("199c586f-af16-4091-9998-ee4cfc02ee7a.txt.json", "r") as f:
    patient_data = json.load(f)

# Extract the list of medications from the JSON
json_medications = [med['description'] for med in patient_data.get('medications', []) if med.get('description') is not None]

# Load the FAISS index and codes mapping
index = faiss.read_index("medication_index.faiss")
with open("codes_mapping.txt", "r") as f:
    codes = [line.strip() for line in f]

# Load the embedding model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Generate embeddings for medications in the JSON
json_medication_embeddings = model.encode(json_medications, convert_to_numpy=True)

# Find the closest matches for each medication in the JSON
for i, embedding in enumerate(json_medication_embeddings):
    distances, indices = index.search(np.array([embedding]), 1)  # Find the closest match
    closest_index = indices[0][0]
    # Update the code in the original medications list
    patient_data['medications'][i]['code'] = codes[closest_index]  # Update the code directly

# Save the updated JSON
with open("199c586f-af16-4091-9998-ee4cfc02ee7a(updated).json", "w") as f:
    json.dump(patient_data, f, indent=4)

[[-4.60334830e-02  4.19001430e-02 -2.87848376e-02  8.63825232e-02
  -1.64056402e-02  1.44396825e-02  3.40992846e-02  1.22785665e-01
   6.51678890e-02 -6.88969716e-02  2.87695718e-03  2.60263421e-02
  -3.56357545e-02 -2.04211213e-02  3.19922678e-02  7.54131703e-03
   7.43771046e-02  5.99673875e-02 -1.31955324e-02  8.53589773e-02
   5.02683111e-02  1.02522913e-02  1.01783304e-02 -4.21547964e-02
   4.59218994e-02 -2.34145839e-02 -1.49428565e-02  2.52934862e-02
   2.91561503e-02  8.29499273e-04  6.72279522e-02  3.86803038e-02
  -2.37191431e-02 -1.14343770e-01 -5.80655821e-02 -3.96173745e-02
  -4.76787761e-02 -6.74951682e-03 -2.07032561e-02  1.90995317e-02
   2.13018972e-02 -1.83223058e-02 -7.12340325e-02 -5.30714840e-02
   3.18807065e-02  4.23531421e-03 -8.34204555e-02  3.96485813e-02
  -6.16532378e-02  8.61814469e-02 -1.07873939e-01 -2.07861602e-01
   2.77951453e-02  5.48124835e-02 -4.55260351e-02 -1.20067254e-01
  -3.66952680e-02  2.00975370e-02  2.31909174e-02  5.79630695e-02
  -3.28474