## Final Assignment Overview: Working with Patient Records and Encounter Notes

In this final assignment, we’ll focus on patient records related to COVID-19 encounters. Our task is to analyze, process, and transform the data while applying the concepts we’ve covered throughout this course. Here's a detailed breakdown of the assignment:

What Are Encounter Notes?
An encounter note is a record that captures details about a patient’s visit with a doctor. It includes both structured and semi-structured information that is crucial for understanding the context of the visit. Here’s what an encounter note typically looks like:

```
AMBULATORY ENCOUNTER NOTE
Date of Service: March 2, 2020 15:45-16:30

DEMOGRAPHICS:
Name: Jeffrey Greenfelder
DOB: 1/16/2005
Gender: Male
Address: 428 Wiza Glen Unit 91, Springfield, Massachusetts 01104
Insurance: Guardian
MRN: 055ae6fc-7e18-4a39-8058-64082ca6d515

PERTINENT MEDICAL HISTORY:
- Obesity 

Recent Visit: Well child visit (2/23/2020)
Immunizations: Influenza vaccine (2/23/2020)

Recent Baseline (2/23/2020):
Height: 155.0 cm
Weight: 81.2 kg
BMI: 33.8 kg/m² (99.1th percentile)
BP: 123/80 mmHg
HR: 92/min
RR: 13/min

SUBJECTIVE:
Adolescent patient presents with multiple symptoms including:
- Cough
- Sore throat
- Severe fatigue
- Muscle pain
- Joint pain
- Fever
Never smoker. Symptoms began recently.

OBJECTIVE:
Vitals:
Temperature: 39.3°C (102.7°F)
Heart Rate: 131.1/min
Blood Pressure: 120/73 mmHg
Respiratory Rate: 27.6/min
O2 Saturation: 75.8% on room air
Weight: 81.2 kg

Laboratory/Testing:
Comprehensive Respiratory Panel:
- Influenza A RNA: Negative
- Influenza B RNA: Negative
- RSV RNA: Negative
- Parainfluenza virus 1,2,3 RNA: Negative
- Rhinovirus RNA: Negative
- Human metapneumovirus RNA: Negative
- Adenovirus DNA: Negative
- SARS-CoV-2 RNA: Positive

ASSESSMENT:
1. Suspected COVID-19 with severe symptoms
2. Severe hypoxemia requiring immediate intervention
3. Tachycardia (HR 131)
4. High-grade fever
5. Risk factors:
   - Obesity (BMI 33.8)
   - Adolescent age

PLAN:
1. Face mask provided for immediate oxygen support
2. Infectious disease care plan initiated
3. Close monitoring required due to:
   - Severe hypoxemia
   - Tachycardia
   - Age and obesity risk factors
4. Parent/patient education on:
   - Home isolation protocols
   - Warning signs requiring emergency care
   - Return precautions
5. Follow-up plan:
   - Daily monitoring during acute phase
   - Virtual check-ins as needed

Encounter Duration: 45 minutes
Encounter Type: Ambulatory
Provider: ID# e2c226c2-3e1e-3d0b-b997-ce9544c10528
Facility: 5103c940-0c08-392f-95cd-446e0cea042a
```


The enocuter contains

* General encounter information: 

  * When the encounter took place: Date and time of the visit.
  * Demographics: Patient’s age, gender, and unique medical record identifier.
  * Encounter details: The reason for the visit, diagnosis, and any associated costs.


* Semi-Structured Notes:

These notes mirror how doctors organize their thoughts and observations during an encounter. They generally follow a SOAP format:

* Subjective: The patient’s subjective description of their symptoms, feelings, and medical concerns.
* Objective: The doctor’s objective findings, including test results, measurements, or physical examination outcomes.
* Assessment: The doctor’s evaluation or diagnosis based on subjective and objective information.
* Plan: The proposed treatment plan, including medications, follow-ups, or other interventions.

While some encounter notes might include additional details, the majority conform to this semi-structured format, making them ideal for analysis and transformation.

* Goals for the Assignment

1. Transforming Encounter Notes:

Using an LLM to convert semi-structured encounter notes into a JSON format that organizes the information into structured fields. The JSON will include details such as demographics, encounter specifics, and the SOAP components of the note. Subsequently, you will need to transform the JSON data into a Parquet file, which is not only suitable for analysis in Spark but also ideal for storage later.
Here we will use the ML classificaition to assing the objective and assessment semi-structured fields into standardized, structured fields. The medical taxonomy for this task will be the one provided by the CDC, which defines standard codes for diagnoses, symptoms, procedures, and treatments. This step ensures the structured data aligns with domain-wide medical standards, making it interoperable and ready for deeper analysis.

The JSON format should capture the hierachies described in the structure below. 




2. Basic Analytics and Visualizations:
Using Apache Spark, perform comprehensive data analysis on the encounter data and create visualizations that reveal meaningful patterns. Your analysis must include:
- COVID-19 Case Demographics: Case breakdown by age ranges ([0-5], [6-10], [11-17], [18-30], [31-50], [51-70], [71+])
- Cumulative case count of Covid between the earliest case observed in the dataset and last case observed
- Symptoms for all COVID-19 patients versus patients that admitted into the intensive care unit due to COVID.
- Rank medications by frequency of prescription
- Analyze medication patterns across different demographic groups (e.g., top 3 per age group)
- Identify and plot co-morbidity information from the patient records (e.g., hypertension, obesity, prediabetes, etc.) provided in the dataset. 
- An independent group analysis: You need to develop and execute THREE original analyses that provide meaningful insights about COVID-19 patterns in this dataset. For each analysis:
  - Clearly state your analytical question/hypothesis
  - Justify why this analysis is valuable
  - Show your Spark code and methodology
  - Present results with appropriate visualizations


In [None]:
"""""
EncounterType:
    code
    description

Encounter:
    id
    date
    time
    type: EncounterType
    provider_id
    facility_id

Address:
    city
    state

Demographics:
    id
    name
    date_of_birth
    age
    gender
    address: Address
    insurance

Condition:
    code
    description

Medication:
    code
    description

Immunization:
    code
    description
    date: date

VitalMeasurement:
    code
    value: float
    unit

BloodPressure:
    systolic: VitalMeasurement
    diastolic: VitalMeasurement

CurrentVitals:
    temperature: VitalMeasurement
    heart_rate: VitalMeasurement
    blood_pressure: BloodPressure
    respiratory_rate: VitalMeasurement
    oxygen_saturation: VitalMeasurement
    weight: VitalMeasurement

BaselineVitals:
    date: date
    height: VitalMeasurement
    weight: VitalMeasurement
    bmi: VitalMeasurement
    bmi_percentile: VitalMeasurement

Vitals:
    current: CurrentVitals
    baseline: BaselineVitals

RespiratoryTest:
    code
    result

RespiratoryPanel:
    influenza_a: RespiratoryTest
    influenza_b: RespiratoryTest
    rsv: RespiratoryTest
    parainfluenza_1: RespiratoryTest
    parainfluenza_2: RespiratoryTest
    parainfluenza_3: RespiratoryTest
    rhinovirus: RespiratoryTest
    metapneumovirus: RespiratoryTest
    adenovirus: RespiratoryTest

Covid19Test:
    code
    description
    result

Laboratory:
    covid19: Covid19Test
    respiratory_panel: RespiratoryPanel

Procedure:
    code
    description
    date: date
    reasonCode
    reasonDescription

CarePlan:
    id
    code
    description
    start: date
    stop: date
    reasonCode
    reasonDescription

PatientRecord:
    encounter: Encounter
    demographics: Demographics
    conditions: List[Condition]
    medications: List[Medication]
    immunizations: List[Immunization]
    vitals: Vitals
    laboratory: Laboratory
    procedures: List[Procedure]
"""

In [3]:
!pip install langchain openai pydantic faiss-cpu sentence-transformers pandas



In [4]:
!pip install langchain-community langchain-core



In [5]:
!pip install -qU langchain-openai

In [6]:
!pip install --upgrade typing_extensions



In [7]:
!pip install tiktoken



In [8]:
import os

In [9]:
import json

In [10]:
import pandas as pd

In [11]:
from langchain_community.chat_models import ChatOpenAI

In [12]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

In [13]:
os.environ['OPENAI_API_KEY'] = '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'

In [14]:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date, time

In [15]:
class EncounterType(BaseModel):
    code: str
    description: str

class Encounter(BaseModel):
    id: str
    date: date
    time: time
    type: EncounterType
    provider_id: str
    facility_id: str

class Address(BaseModel):
    city: str
    state: str

class Demographics(BaseModel):
    id: str
    name: str
    date_of_birth: date
    age: int
    gender: str
    address: Address
    insurance: str

class Condition(BaseModel):
    code: str
    description: str

class Medication(BaseModel):
    code: str
    description: str

class Immunization(BaseModel):
    code: str
    description: str
    date: date

class VitalMeasurement(BaseModel):
    code: str
    value: float
    unit: str

class BloodPressure(BaseModel):
    systolic: VitalMeasurement
    diastolic: VitalMeasurement

class CurrentVitals(BaseModel):
    temperature: VitalMeasurement
    heart_rate: VitalMeasurement
    blood_pressure: BloodPressure
    respiratory_rate: VitalMeasurement
    oxygen_saturation: VitalMeasurement
    weight: VitalMeasurement

class BaselineVitals(BaseModel):
    date: date
    height: VitalMeasurement
    weight: VitalMeasurement
    bmi: VitalMeasurement
    bmi_percentile: VitalMeasurement

class Vitals(BaseModel):
    current: CurrentVitals
    baseline: BaselineVitals

class RespiratoryTest(BaseModel):
    code: str
    result: str

class RespiratoryPanel(BaseModel):
    influenza_a: RespiratoryTest
    influenza_b: RespiratoryTest
    rsv: RespiratoryTest
    parainfluenza_1: RespiratoryTest
    parainfluenza_2: RespiratoryTest
    parainfluenza_3: RespiratoryTest
    rhinovirus: RespiratoryTest
    metapneumovirus: RespiratoryTest
    adenovirus: RespiratoryTest

class Covid19Test(BaseModel):
    code: str
    description: str
    result: str

class Laboratory(BaseModel):
    covid19: Covid19Test
    respiratory_panel: RespiratoryPanel

class Procedure(BaseModel):
    code: str
    description: str
    date: date
    reasonCode: str
    reasonDescription: str

class CarePlan(BaseModel):
    id: str
    code: str
    description: str
    start: date
    stop: Optional[date]
    reasonCode: str
    reasonDescription: str

class PatientRecord(BaseModel):
    encounter: Encounter
    demographics: Demographics
    conditions: List[Condition]
    medications: List[Medication]
    immunizations: List[Immunization]
    vitals: Vitals
    laboratory: Laboratory
    procedures: List[Procedure]
    care_plan: Optional[List[CarePlan]] = None

In [16]:
from sentence_transformers import SentenceTransformer

In [17]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

In [19]:
encounter_note_text = """AMBULATORY ENCOUNTER NOTE
Date of Service: March 2, 2020 15:45-16:30

DEMOGRAPHICS:
Name: Jeffrey Greenfelder
DOB: 1/16/2005
Gender: Male
Address: 428 Wiza Glen Unit 91, Springfield, Massachusetts 01104
Insurance: Guardian
MRN: 055ae6fc-7e18-4a39-8058-64082ca6d515

PERTINENT MEDICAL HISTORY:
- Obesity 

Recent Visit: Well child visit (2/23/2020)
Immunizations: Influenza vaccine (2/23/2020)

Recent Baseline (2/23/2020):
Height: 155.0 cm
Weight: 81.2 kg
BMI: 33.8 kg/m² (99.1th percentile)
BP: 123/80 mmHg
HR: 92/min
RR: 13/min

SUBJECTIVE:
Adolescent patient presents with multiple symptoms including:
- Cough
- Sore throat
- Severe fatigue
- Muscle pain
- Joint pain
- Fever
Never smoker. Symptoms began recently.

OBJECTIVE:
Vitals:
Temperature: 39.3°C (102.7°F)
Heart Rate: 131.1/min
Blood Pressure: 120/73 mmHg
Respiratory Rate: 27.6/min
O2 Saturation: 75.8% on room air
Weight: 81.2 kg

Laboratory/Testing:
Comprehensive Respiratory Panel:
- Influenza A RNA: Negative
- Influenza B RNA: Negative
- RSV RNA: Negative
- Parainfluenza virus 1,2,3 RNA: Negative
- Rhinovirus RNA: Negative
- Human metapneumovirus RNA: Negative
- Adenovirus DNA: Negative
- SARS-CoV-2 RNA: Positive

ASSESSMENT:
1. Suspected COVID-19 with severe symptoms
2. Severe hypoxemia requiring immediate intervention
3. Tachycardia (HR 131)
4. High-grade fever
5. Risk factors:
   - Obesity (BMI 33.8)
   - Adolescent age

PLAN:
1. Face mask provided for immediate oxygen support
2. Infectious disease care plan initiated
3. Close monitoring required due to:
   - Severe hypoxemia
   - Tachycardia
   - Age and obesity risk factors
4. Parent/patient education on:
   - Home isolation protocols
   - Warning signs requiring emergency care
   - Return precautions
5. Follow-up plan:
   - Daily monitoring during acute phase
   - Virtual check-ins as needed

Encounter Duration: 45 minutes
Encounter Type: Ambulatory
Provider: ID# e2c226c2-3e1e-3d0b-b997-ce9544c10528
Facility: 5103c940-0c08-392f-95cd-446e0cea042a"""


In [20]:
# things to load into prompt
schema_description = """
PatientRecord:
    encounter: Encounter
    demographics: Demographics
    conditions: List[Condition]
    medications: List[Medication]
    immunizations: List[Immunization]
    vitals: Vitals
    laboratory: Laboratory
    procedures: List[Procedure]
    care_plan: Optional[List[CarePlan]]
"""

In [21]:
prompt_template = """
Please return the following information in a strictly valid JSON object with no additional text or formatting. The JSON must be complete and well-formed. If you are unsure of any value, leave it as an empty string.

{schema}

Encounter Note:
{encounter_note}

Provide the extracted information in JSON format according to the schema. Leave the attributes named "code" as empty strings.
"""

In [22]:
prompt = prompt_template.format(
    encounter_note=encounter_note_text,
    schema=schema_description
)

In [23]:

model = ChatOpenAI(model="gpt-4o-mini", temperature=0.0, max_tokens=3000)

  model = ChatOpenAI(model="gpt-4o-mini", temperature=0.0, max_tokens=3000)


In [24]:
assistant_message = model.invoke(prompt)

In [25]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
prompt_tokens = len(tokenizer.encode(prompt))
prompt_tokens
# If prompt_tokens is very high, you'll have fewer tokens left for the response.

798

In [26]:
print(assistant_message.content)

{
  "PatientRecord": {
    "encounter": {
      "date_of_service": "March 2, 2020 15:45-16:30",
      "duration": "45 minutes",
      "type": "Ambulatory",
      "provider_id": "e2c226c2-3e1e-3d0b-b997-ce9544c10528",
      "facility_id": "5103c940-0c08-392f-95cd-446e0cea042a"
    },
    "demographics": {
      "name": "Jeffrey Greenfelder",
      "dob": "1/16/2005",
      "gender": "Male",
      "address": "428 Wiza Glen Unit 91, Springfield, Massachusetts 01104",
      "insurance": "Guardian",
      "mrn": "055ae6fc-7e18-4a39-8058-64082ca6d515"
    },
    "conditions": [
      {
        "code": "",
        "description": "Obesity"
      },
      {
        "code": "",
        "description": "Suspected COVID-19 with severe symptoms"
      },
      {
        "code": "",
        "description": "Severe hypoxemia requiring immediate intervention"
      },
      {
        "code": "",
        "description": "Tachycardia (HR 131)"
      },
      {
        "code": "",
        "description": "Hi

In [27]:
def parse_ai_response(response_text):
    # Clean up and parse the AI response
    try:
        data = json.loads(response_text)
    except json.JSONDecodeError as e:
        # Handle parsing errors
        print("Error parsing JSON:", e)
        data = {}
    return data


In [28]:
extracted_data = parse_ai_response(assistant_message.content)

In [29]:
# Assuming 'extracted_data' is the parsed JSON from the AI response
medications = extracted_data.get('medications', [])
encounter_type = extracted_data.get('encounter', {}).get('type', {})
immunizations = extracted_data.get('immunizations', [])
conditions = extracted_data.get('conditions', [])

In [30]:
# Load medications data
medications_df = pd.read_csv('data/medications_assignment_1.csv')

# Load encounter types data
encounter_types_df = pd.read_csv('data/encounters_types_assignment_1.csv')

# Load immunizations data
immunizations_df = pd.read_csv('data/immunizations_assignment_1.csv')

# Load conditions data
# conditions_df = pd.read_csv('data/observations_assignment_1.csv')
conditions_df = pd.read_csv('data/observations_assignment_1.csv', header=None, names=['CODE', 'CONDITION'])

In [31]:
# extract from json
# Medications
medication_descriptions = medications_df['DESCRIPTION'].tolist()
medication_codes = medications_df['CODE'].tolist()

# Encounter Types
encounter_type_descriptions = encounter_types_df['DESCRIPTION'].tolist()
encounter_type_codes = encounter_types_df['CODE'].tolist()

# Immunizations
immunization_descriptions = immunizations_df['DESCRIPTION'].tolist()
immunization_codes = immunizations_df['CODE'].tolist()

# Conditions
condition_descriptions = conditions_df['CONDITION'].tolist()
condition_codes = conditions_df['CODE'].tolist()

In [32]:
# Initialize the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [33]:
def build_faiss_index(descriptions):
    embeddings = model.encode(descriptions)
    embeddings = np.array(embeddings).astype('float32')
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

In [34]:
# Build indices
medications_index = build_faiss_index(medication_descriptions)
encounter_types_index = build_faiss_index(encounter_type_descriptions)
immunizations_index = build_faiss_index(immunization_descriptions)
conditions_index = build_faiss_index(condition_descriptions)

In [35]:
def match_entities(extracted_entities, index, reference_descriptions, reference_codes, top_k=1, threshold=0.5):
    matched_entities = []
    for entity in extracted_entities:
        description = entity.get('DESCRIPTION', '')
        if not description:
            continue

        # Embed the entity description
        entity_embedding = model.encode([description]).astype('float32')

        # Search the index
        distances, indices = index.search(entity_embedding, top_k)

        # Check if the best match is within the threshold
        if distances[0][0] <= threshold:
            matched_idx = indices[0][0]
            matched_entity = {
                'code': reference_codes[matched_idx],
                'description': reference_descriptions[matched_idx]
            }
        else:
            # No good match found
            matched_entity = {
                'code': None,
                'description': description
            }
        matched_entities.append(matched_entity)
    return matched_entities

In [36]:
def match_single_entity(entity, index, reference_descriptions, reference_codes, threshold=0.5):
    description = entity.get('DESCRIPTION', '')
    if not description:
        return entity  # Return as is if no description

    # Embed the description
    entity_embedding = model.encode([description]).astype('float32')

    # Search the index
    distances, indices = index.search(entity_embedding, 1)

    # Check if the best match is within the threshold
    if distances[0][0] <= threshold:
        matched_idx = indices[0][0]
        matched_entity = {
            'code': reference_codes[matched_idx],
            'description': reference_descriptions[matched_idx]
        }
    else:
        # No good match found
        matched_entity = {
            'code': None,
            'description': description
        }
    return matched_entity

In [37]:
# Update medications with matched codes
matched_medications = match_entities(
    medications,
    medications_index,
    medication_descriptions,
    medication_codes
)
extracted_data['medications'] = matched_medications

In [38]:
print(extracted_data.keys())

dict_keys(['PatientRecord', 'medications'])


In [39]:
# Update encounter type with matched code
matched_encounter_type = match_single_entity(
    encounter_type,
    encounter_types_index,
    encounter_type_descriptions,
    encounter_type_codes
)
extracted_data['encounter']['type'] = matched_encounter_type


KeyError: 'encounter'