## Final Assignment Overview: Working with Patient Records and Encounter Notes

In this final assignment, we’ll focus on patient records related to COVID-19 encounters. Our task is to analyze, process, and transform the data while applying the concepts we’ve covered throughout this course. Here's a detailed breakdown of the assignment:

What Are Encounter Notes?
An encounter note is a record that captures details about a patient’s visit with a doctor. It includes both structured and semi-structured information that is crucial for understanding the context of the visit. Here’s what an encounter note typically looks like:

```
AMBULATORY ENCOUNTER NOTE
Date of Service: March 2, 2020 15:45-16:30

DEMOGRAPHICS:
Name: Jeffrey Greenfelder
DOB: 1/16/2005
Gender: Male
Address: 428 Wiza Glen Unit 91, Springfield, Massachusetts 01104
Insurance: Guardian
MRN: 055ae6fc-7e18-4a39-8058-64082ca6d515

PERTINENT MEDICAL HISTORY:
- Obesity

Recent Visit: Well child visit (2/23/2020)
Immunizations: Influenza vaccine (2/23/2020)

Recent Baseline (2/23/2020):
Height: 155.0 cm
Weight: 81.2 kg
BMI: 33.8 kg/m² (99.1th percentile)
BP: 123/80 mmHg
HR: 92/min
RR: 13/min

SUBJECTIVE:
Adolescent patient presents with multiple symptoms including:
- Cough
- Sore throat
- Severe fatigue
- Muscle pain
- Joint pain
- Fever
Never smoker. Symptoms began recently.

OBJECTIVE:
Vitals:
Temperature: 39.3°C (102.7°F)
Heart Rate: 131.1/min
Blood Pressure: 120/73 mmHg
Respiratory Rate: 27.6/min
O2 Saturation: 75.8% on room air
Weight: 81.2 kg

Laboratory/Testing:
Comprehensive Respiratory Panel:
- Influenza A RNA: Negative
- Influenza B RNA: Negative
- RSV RNA: Negative
- Parainfluenza virus 1,2,3 RNA: Negative
- Rhinovirus RNA: Negative
- Human metapneumovirus RNA: Negative
- Adenovirus DNA: Negative
- SARS-CoV-2 RNA: Positive

ASSESSMENT:
1. Suspected COVID-19 with severe symptoms
2. Severe hypoxemia requiring immediate intervention
3. Tachycardia (HR 131)
4. High-grade fever
5. Risk factors:
   - Obesity (BMI 33.8)
   - Adolescent age

PLAN:
1. Face mask provided for immediate oxygen support
2. Infectious disease care plan initiated
3. Close monitoring required due to:
   - Severe hypoxemia
   - Tachycardia
   - Age and obesity risk factors
4. Parent/patient education on:
   - Home isolation protocols
   - Warning signs requiring emergency care
   - Return precautions
5. Follow-up plan:
   - Daily monitoring during acute phase
   - Virtual check-ins as needed

Encounter Duration: 45 minutes
Encounter Type: Ambulatory
Provider: ID# e2c226c2-3e1e-3d0b-b997-ce9544c10528
Facility: 5103c940-0c08-392f-95cd-446e0cea042a
```


The enocuter contains

* General encounter information:

  * When the encounter took place: Date and time of the visit.
  * Demographics: Patient’s age, gender, and unique medical record identifier.
  * Encounter details: The reason for the visit, diagnosis, and any associated costs.


* Semi-Structured Notes:

These notes mirror how doctors organize their thoughts and observations during an encounter. They generally follow a SOAP format:

* Subjective: The patient’s subjective description of their symptoms, feelings, and medical concerns.
* Objective: The doctor’s objective findings, including test results, measurements, or physical examination outcomes.
* Assessment: The doctor’s evaluation or diagnosis based on subjective and objective information.
* Plan: The proposed treatment plan, including medications, follow-ups, or other interventions.

While some encounter notes might include additional details, the majority conform to this semi-structured format, making them ideal for analysis and transformation.

* Goals for the Assignment

1. Transforming Encounter Notes:

Using an LLM to convert semi-structured encounter notes into a JSON format that organizes the information into structured fields. The JSON will include details such as demographics, encounter specifics, and the SOAP components of the note. Subsequently, you will need to transform the JSON data into a Parquet file, which is not only suitable for analysis in Spark but also ideal for storage later.
Here we will use the ML classificaition to assing the objective and assessment semi-structured fields into standardized, structured fields. The medical taxonomy for this task will be the one provided by the CDC, which defines standard codes for diagnoses, symptoms, procedures, and treatments. This step ensures the structured data aligns with domain-wide medical standards, making it interoperable and ready for deeper analysis.

The JSON format should capture the hierachies described in the structure below.




2. Basic Analytics and Visualizations:
Using Apache Spark, perform comprehensive data analysis on the encounter data and create visualizations that reveal meaningful patterns. Your analysis must include:
- COVID-19 Case Demographics: Case breakdown by age ranges ([0-5], [6-10], [11-17], [18-30], [31-50], [51-70], [71+])
- Cumulative case count of Covid between the earliest case observed in the dataset and last case observed
- Symptoms for all COVID-19 patients versus patients that admitted into the intensive care unit due to COVID.
- Rank medications by frequency of prescription
- Analyze medication patterns across different demographic groups (e.g., top 3 per age group)
- Identify and plot co-morbidity information from the patient records (e.g., hypertension, obesity, prediabetes, etc.) provided in the dataset.
- An independent group analysis: You need to develop and execute THREE original analyses that provide meaningful insights about COVID-19 patterns in this dataset. For each analysis:
  - Clearly state your analytical question/hypothesis
  - Justify why this analysis is valuable
  - Show your Spark code and methodology
  - Present results with appropriate visualizations


### Part 1 - Extracting information from 11 encounter notes

In [21]:
Patient_note = """
AMBULATORY ENCOUNTER NOTE
Date of Service: March 2, 2020 15:45-16:30

DEMOGRAPHICS:
Name: Jeffrey Greenfelder
DOB: 1/16/2005
Gender: Male
Address: 428 Wiza Glen Unit 91, Springfield, Massachusetts 01104
Insurance: Guardian
MRN: 055ae6fc-7e18-4a39-8058-64082ca6d515

PERTINENT MEDICAL HISTORY:
- Obesity

Recent Visit: Well child visit (2/23/2020)
Immunizations: Influenza vaccine (2/23/2020)

Recent Baseline (2/23/2020):
Height: 155.0 cm
Weight: 81.2 kg
BMI: 33.8 kg/m² (99.1th percentile)
BP: 123/80 mmHg
HR: 92/min
RR: 13/min

SUBJECTIVE:
Adolescent patient presents with multiple symptoms including:
- Cough
- Sore throat
- Severe fatigue
- Muscle pain
- Joint pain
- Fever
Never smoker. Symptoms began recently.

OBJECTIVE:
Vitals:
Temperature: 39.3°C (102.7°F)
Heart Rate: 131.1/min
Blood Pressure: 120/73 mmHg
Respiratory Rate: 27.6/min
O2 Saturation: 75.8% on room air
Weight: 81.2 kg

Laboratory/Testing:
Comprehensive Respiratory Panel:
- Influenza A RNA: Negative
- Influenza B RNA: Negative
- RSV RNA: Negative
- Parainfluenza virus 1,2,3 RNA: Negative
- Rhinovirus RNA: Negative
- Human metapneumovirus RNA: Negative
- Adenovirus DNA: Negative
- SARS-CoV-2 RNA: Positive

ASSESSMENT:
1. Suspected COVID-19 with severe symptoms
2. Severe hypoxemia requiring immediate intervention
3. Tachycardia (HR 131)
4. High-grade fever
5. Risk factors:
   - Obesity (BMI 33.8)
   - Adolescent age

PLAN:
1. Face mask provided for immediate oxygen support
2. Infectious disease care plan initiated
3. Close monitoring required due to:
   - Severe hypoxemia
   - Tachycardia
   - Age and obesity risk factors
4. Parent/patient education on:
   - Home isolation protocols
   - Warning signs requiring emergency care
   - Return precautions
5. Follow-up plan:
   - Daily monitoring during acute phase
   - Virtual check-ins as needed

Encounter Duration: 45 minutes
Encounter Type: Ambulatory
Provider: ID# e2c226c2-3e1e-3d0b-b997-ce9544c10528
Facility: 5103c940-0c08-392f-95cd-446e0cea042a
"""

A lot of the data is not structured. However in healthcare, everything has a code. (Ex: severe, minimal)

Break down symptoms for all patients vs. patients in ICU

Use language model to strucutre unstructures data
- Give everything a code (ex: same code for all people admitted to the ICU)
- Example structure is given below
- Use pydantic to build a class with inheritance that will mimic a structured json file

Define pydantic object --> pass to model with langchain

Want the following example answer using langchain:

In [None]:
{
    "encounter" : {
        "date":
        "time":
        "provider_id":
        "facility_id":
    },
    {
        "medications": [ {
            "code":
            "description":
        }, {}, {}]
    }
}

But the medication codes are in a different file from the encounters notes that only contain the medication names

Search by similarity at the embedding level of medicine name

After searching for the codes relevant to the medicine, we pass this in as an input variable using langchain. Give subset from database that may map to descriptions of medications
- encode description and database
- use faiss to match description with most similar codes in database
- provide that subset to LLM through langchain (top k most likely matches)
- this will handle misspellings since the llm can find the most likely match to the subset out of the top k most likely

Database is CODE, Description structure. Use first prompt to extract medication names from the description. Second prompt will compare description from encounter note to databse of medication names and encoding

#### 1st deliverable: come up with a model that allows me to take all the information --> pass to llm --> return commpleted object

extract 2 or 3 things

In [3]:
%%capture
!pip install langchain -U
!pip install langchain-openai -U
!pip install sentence_transformers
!pip install faiss-cpu
!pip install --force-reinstall -v openai==1.55.3
!pip install --upgrade "httpx<0.28"

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
!cp -r /content/drive/MyDrive/assignment3_data /content/data

### Let's try to the parse just the medication and demogrphic information for now

In [5]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

In [6]:
from pydantic import BaseModel, Field
from typing import List

class Address(BaseModel):
    city: str = Field(description=" the city where the patient lives. Should be under DEMOGRAPHICS header")
    state: str = Field(description=" the state where the patient lives. Should be under DEMOGRAPHICS header")

class Demographics(BaseModel):
    name: str = Field(description=" the name of the patient")
    date_of_birth: str = Field(description=" the date of birth of the patient")
    age: int = Field(description=" the age of the patient")
    gender: str = Field(description=" the gender of the patient")
    address: Address = Field(description=" the address of the patient")
    insurance: str = Field(description=" the insurance of the patient")

class Medication(BaseModel):
    code: str = Field(description=" the code of the medication")
    description: str = Field(description=" the description of the medication")

class PatientRecord(BaseModel):
    demographics: Demographics
    medications: List[Medication]

In [7]:
from langchain.output_parsers import PydanticOutputParser

In [126]:
Patient_Record = PydanticOutputParser(pydantic_object=PatientRecord)

In [20]:
structured_model =  chat_model.with_structured_output(PatientRecord) #json mode would make the model expect a json object as input

In [22]:
parser_object = structured_model.invoke(Patient_note)

In [23]:
parser_object

PatientRecord(demographics=Demographics(name='Jeffrey Greenfelder', date_of_birth='1/16/2005', age=15, gender='Male', address=Address(city='Springfield', state='Massachusetts'), insurance='Guardian'), medications=[Medication(code='N/A', description='COVID-19 treatment plan with oxygen support')])

### Creating Medical Code FAISS Index

Now that we have a good base of information, we want to be able to fill the code field for each medication using a FAISS nearest neighbor search and another query to select from the top 5 nearest neighbors

In [13]:
import faiss
import pandas as pd
import numpy as np
import openai

#### 1. Generate Embeddings

In [12]:
%%capture
from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

In [130]:
# Helper function to encode a line form a csv file
def get_embedding(text, model):
    text = text.replace("\n", " ")
    return model.encode(text)

In [203]:
csv_file = "/content/data/medications_assignment_1.csv"

def create_embeddings(csv_file, type_emb):
  """
    Create embeddings for medical data and save them in a FAISS index.

    Args:
        csv_file (str): Path to the medical CSV file to create embeddings from.
        type_emb (str): Type of embedding for file naming.

    Returns:
        faiss_index: A FAISS index object containing the embeddings.

    Description:
        This function generates embeddings for each row in the 'description' column
        of the provided CSV. The goal is to encode the description column to enable
        fast nearest neighbor queries using FAISS, which helps in retrieving the most
        relevant codes for lookup operations efficiently.
    """
  lines = []
  for i, line in enumerate(open(csv_file)):
      if i > 1:
          # Remove any leading/trailing whitespace
          line = line.strip()

          # Split the line into parts
          parts = line.split(",")

          # Check if there are at least 2 parts before accessing
          if len(parts) > 1:
            # want to encode the descriptions field
            embed = model.encode(parts[1].rstrip())
            lines.append(embed)
          else:
            continue

  embeds = np.array(lines)
  faiss.normalize_L2(embeds) # normalize so that euclidean distance can be used

  index = faiss.IndexFlatIP(384)
  index.add(embeds)

  faiss_index_path = f"/content/{type_emb}_embeddings.index"
  faiss.write_index(index, faiss_index_path)

  print(f"FAISS index saved to {faiss_index_path}")
  print("Process complete!")

  return index

medication_index = create_embeddings(csv_file, "medication")

FAISS index saved to /content/medication_embeddings.index
Process complete!


In [132]:
some_med = "piperacillin 400 MG / tazobactam 500 MG Inetion"
embed_some_med = model.encode(some_med)
embed_some_med

medication_index.search(np.array([embed_some_med]), k=3)

(array([[0.87212795, 0.87212795, 0.59823257]], dtype=float32),
 array([[ 31,  30, 106]]))

In [136]:
mapping_df = pd.read_csv("/content/data/medications_assignment_1.csv")
query_description = "medroxyPROGESTERone"

def get_nearest_neighbors(query_description, index, mapping_df):
  """
    Search for top 5 most likely descriptions and codes that match query_description

    Args:
        query_description (str): description query
        index (str): FAISS index
        mapping_df (pandas df): for looking up codes

    Returns:
        nearest (dict): top 5 nearest descriptions and codes to query_description

    Description:
        This function conducts a FAISS search in order to search up the
        5 closest descriptions to the query description. Then we can
        extract the codes for each of the closest descriptions and return
        a dict of the closest descriptions and codes.
    """

  # Get the embedding for the query
  query_embedding = get_embedding(query_description, model).astype("float32")  # FAISS requires float32

  # Perform the search in the FAISS index
  k = 5  # Number of nearest neighbors to retrieve
  distances, indices = index.search(np.array([query_embedding]), k)
  nearest = {}

  #print(indices)
  # Get the closest match code
  for i in (indices.flatten()):
    if i != -1:  # Ensure a valid result
        nearest[i] = {}
        matched_code = mapping_df.iloc[i]["CODE"]
        nearest[i]["CODE"] = matched_code
        matched_description = mapping_df.iloc[i]["DESCRIPTION"]
        nearest[i]["DESCRIPTION"] = matched_description
        #print(f"Matched Code: {matched_code}")
        #print(f"Matched Description: {matched_description}")
    else:
        print("No match found.")
        return None

  return nearest

nearest = get_nearest_neighbors(query_description, medication_index, mapping_df)

print("Top_5_nearest_neighbors:")
print(nearest)

Top_5_nearest_neighbors:
{0: {'CODE': 1000126, 'DESCRIPTION': '1 ML medroxyPROGESTERone acetate 150 MG/ML Injection'}, 22: {'CODE': 1367439, 'DESCRIPTION': 'NuvaRing 0.12/0.015 MG per 24HR 21 Day Vaginal Ring'}, 163: {'CODE': 978950, 'DESCRIPTION': 'Natazia 28 Day Pack'}, 119: {'CODE': 583214, 'DESCRIPTION': 'PACLitaxel 100 MG Injection'}, 118: {'CODE': 562366, 'DESCRIPTION': 'desflurane 1000 MG/ML Inhalation Solution'}}


### Implementing Nearest Neighbor Selector

The chat model will help us decide which nearest neighbor is the closest

In [118]:
from typing import Optional
from pydantic import BaseModel, Field

# Pydantic class to ensure the model only returns the code
class Selection(BaseModel):
    CODE: str = Field(description="the chosen 'CODE' field from nearest neighbors")

We create a new model with the purpose of choosing from the top 5 nearest descriptions

In [119]:
neighbor_selector = ChatOpenAI(temperature=0)

In [120]:
neighbor_selector_structured = neighbor_selector.with_structured_output(Selection)

In [121]:
prompt = """
You are to assist in selecting the most accurate nearest neighbor\n
Based on the description: {description_to_match}\n
Select the nearest neighbor with the closest description in the following object:\n
{nearest_neighbors}\n
Return the CODE of the closest description**
"""

In [122]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(prompt)

chain = prompt | neighbor_selector_structured

In [123]:
returned = chain.invoke(
    {
        "description_to_match": "medroxyPROGESTERone",
        "nearest_neighbors": str(nearest),
    }
)
returned

Selection(CODE='1000126')

In [124]:
returned = chain.invoke(
    {
        "description_to_match": "Natazia",
        "nearest_neighbors": str(nearest),
    }
)
returned

Selection(CODE='978950')

In [207]:
returned.CODE

'978950'

### Implement for all given encounter notes

In [144]:
# Function to get a list of all of the encounter notes in the encounter_notes directory
def get_file_names(directory):
  # List all files and directories in the given directory
  entries = os.listdir(directory)
  # Filter out directories and keep only file names
  files = [file for file in entries if os.path.isfile(os.path.join(directory, file))]
  return files


encounters_path = "/content/data/encounter_notes"
file_names = get_file_names(encounters_path)
print(file_names)

['df6b563d-1ff4-4833-9af8-84431e641e9c.txt', 'b9fd2dd8-181b-494b-ab15-e9f286d668d9.txt', '28658715-b770-4576-9a81-fbb2282a98ea.txt', 'f0f3bc8d-ef38-49ce-a2bd-dfdda982b271.txt', '199c586f-af16-4091-9998-ee4cfc02ee7a.txt~', 'ae9efba3-ddc4-43f9-a781-f72019388548.txt', 'd22592ac-552f-4ecd-a63d-7663d77ce9ba.txt', '353016ea-a0ff-4154-85bb-1cf8b6cedf20.txt', '055ae6fc-7e18-4a39-8058-64082ca6d515.txt', '199c586f-af16-4091-9998-ee4cfc02ee7a.txt', 'f73d6f41-0091-4485-8b43-9d38eb98fb36.txt']


Remove Duplicate File

In [145]:
file_names.remove('199c586f-af16-4091-9998-ee4cfc02ee7a.txt~')

In [146]:
file_names

['df6b563d-1ff4-4833-9af8-84431e641e9c.txt',
 'b9fd2dd8-181b-494b-ab15-e9f286d668d9.txt',
 '28658715-b770-4576-9a81-fbb2282a98ea.txt',
 'f0f3bc8d-ef38-49ce-a2bd-dfdda982b271.txt',
 'ae9efba3-ddc4-43f9-a781-f72019388548.txt',
 'd22592ac-552f-4ecd-a63d-7663d77ce9ba.txt',
 '353016ea-a0ff-4154-85bb-1cf8b6cedf20.txt',
 '055ae6fc-7e18-4a39-8058-64082ca6d515.txt',
 '199c586f-af16-4091-9998-ee4cfc02ee7a.txt',
 'f73d6f41-0091-4485-8b43-9d38eb98fb36.txt']

In [154]:
def extract_file_contents(directory, file_list):
  """
  Extract file contents in text format from a list of files in a directory.

  Args:
      directory (str): Path to the directory containing the files.
      file_list (list): List of file names to extract contents from.

  Returns:
      file_contents (list): List of extracted file contents.

  """
  file_contents = []
  for file_name in file_list:
      file_path = os.path.join(directory, file_name)

      try:
        with open(file_path, 'r', encoding='utf-8') as file:
            file_content = file.read()
            file_contents.append(file_content)
      except FileNotFoundError:
        print(f"The file '{file_path}' does not exist.")
      except Exception as e:
        print(f"An error occurred: {e}")

  return file_contents

all_encounters = extract_file_contents(encounters_path, file_names)
all_encounters[0]

'URGENT CARE ENCOUNTER NOTE\nDate of Service: March 13, 2020 16:12-17:11\n\nDEMOGRAPHICS:\nName: Ms. Brown\nDOB: 9/29/1982 (37y/o)\nGender: Female\nAddress: Boston, MA\nInsurance: Medicare/Medicaid\nMRN: df6b563d-1ff4-4833-9af8-84431e641e9c\n\nPERTINENT MEDICAL HISTORY:\n- Pulmonary emphysema (diagnosed 2015)\n- Hypertension (since 2000)\n- Multiple allergies (bee venom, grass/tree pollen, fish)\nCurrent Medications:\n- Hydrochlorothiazide 12.5 MG daily\n- Fluticasone/Salmeterol 250/50 mcg inhaler BID\nLast Visit: Routine wellness check (3/11/2020)\nImmunizations: Influenza vaccine received 3/11/2020\n\nSUBJECTIVE:\nPatient presents with acute onset of symptoms including fever, cough, severe fatigue, and complete loss of taste. Symptoms began approximately 48 hours ago. Patient reports worsening shortness of breath over the last 24 hours. Denies recent travel. No known COVID contacts. Never smoker. Has been compliant with maintenance inhalers for emphysema.\n\nOBJECTIVE:\nVitals:\nTemp

#### Parse All file outputs

We want to parse all of the file outputs and save them as json objects so that we have a check point before we try to fill in the codes

We also want to redefine the schema to include more than just medications

Since there are so many fields, I set them all to optional because there is a good chance that the model misses one

In [190]:
class EncounterType(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None

class Encounter(BaseModel):
    date: Optional[str] = None
    time: Optional[str] = None
    provider_id: Optional[str] = None
    facility_id: Optional[str] = None
    encounter_type: Optional[EncounterType] = None

class Address(BaseModel):
    city: Optional[str] = None
    state: Optional[str] = None

class Demographics(BaseModel):
    name: Optional[str] = None
    date_of_birth: Optional[str] = None
    age: Optional[str] = None
    gender: Optional[str] = None
    address: Optional[Address] = None
    insurance: Optional[str] = None

class Condition(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None

class Medication(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None

class Immunization(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None
    date: Optional[str] = None

class VitalMeasurement(BaseModel):
    code: Optional[str] = None
    value: Optional[float] = None
    unit: Optional[str] = None

class BloodPressure(BaseModel):
    systolic: Optional[VitalMeasurement]
    diastolic: Optional[VitalMeasurement]

class CurrentVitals(BaseModel):
    temperature: Optional[VitalMeasurement] = None
    heart_rate: Optional[VitalMeasurement] = None
    blood_pressure: Optional[BloodPressure] = None
    respiratory_rate: Optional[VitalMeasurement] = None
    oxygen_saturation: Optional[VitalMeasurement] = None
    weight: Optional[VitalMeasurement] = None

class BaselineVitals(BaseModel):
    date: Optional[str] = None
    height: Optional[VitalMeasurement] = None
    weight: Optional[VitalMeasurement] = None
    bmi: Optional[VitalMeasurement] = None
    bmi_percentile: Optional[VitalMeasurement] = None


class Vitals(BaseModel):
    current: Optional[CurrentVitals]
    baseline: Optional[BaselineVitals]

class RespiratoryTest(BaseModel):
    code: Optional[str] = None
    result: Optional[str] = None

class RespiratoryPanel(BaseModel):
    influenza_a: Optional[RespiratoryTest] = None
    influenza_b: Optional[RespiratoryTest] = None
    rsv: Optional[RespiratoryTest] = None
    parainfluenza_1: Optional[RespiratoryTest] = None
    parainfluenza_2: Optional[RespiratoryTest] = None
    parainfluenza_3: Optional[RespiratoryTest] = None
    rhinovirus: Optional[RespiratoryTest] = None
    metapneumovirus: Optional[RespiratoryTest] = None
    adenovirus: Optional[RespiratoryTest] = None


class Covid19Test(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None
    result: Optional[str] = None


class Laboratory(BaseModel):
    covid19: Optional[Covid19Test] = None
    respiratory_panel: Optional[RespiratoryPanel] = None


class Procedure(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None
    date: Optional[str] = None
    reasonCode: Optional[str] = None
    reasonDescription: Optional[str] = None


class CarePlan(BaseModel):
    code: Optional[str] = None
    description: Optional[str] = None
    start: Optional[str] = None
    stop: Optional[str] = None
    reasonCode: Optional[str] = None
    reasonDescription: Optional[str] = None


class PatientRecord_Complete(BaseModel):
    demographics: Demographics
    encounter: Optional[Encounter] = None
    conditions: Optional[List[Condition]] = None
    medications: Optional[List[Medication]] = None
    immunizations: Optional[List[Immunization]] = None
    vitals: Optional[Vitals] = None
    laboratory: Optional[Laboratory] = None
    procedures: Optional[List[Procedure]] = None

In [191]:
structured_complete =  chat_model.with_structured_output(PatientRecord_Complete)

In [333]:
# Helper funciton to query llm and return pydantic object
def parse_note(model, input):
  parser_object = model.invoke(input)
  return parser_object

test = parse_note(structured_complete, all_encounters[0])
#test

In [334]:
#test.model_dump()

For some reason, the model is filling in the fields with fake codes. We can still replace these with the FAISS nearest neighbor search

Let's get the encounter dictionaries for all the encounter notes

In [195]:
dict_outputs = []
for i in range(len(all_encounters)):

  parsed_object = parse_note(structured_complete, all_encounters[i])

  dictionary = parsed_object.model_dump()
  dict_outputs.append(dictionary)


Lets save the output to a file

In [198]:
import json
with open('encounter_output.json', 'w') as f:
    json.dump(dict_outputs, f, indent=4)

In [199]:
!cp /content/encounter_output.json /content/drive/MyDrive/assignment3_data/

### Now we process the json file and fill in each of the fields that need codes with the available data from our csv files

Lets create an index of each csv file. We already have one for medications

In [200]:
encounter_type_file = "/content/data/encounters_types_assignment_1.csv"
encounter_type_index = create_embeddings(encounter_type_file, "encounter_type")

FAISS index saved to /content/encounter_type_embeddings.index
Process complete!


In [204]:
immunization_file = "/content/data/immunizations_assignment_1.csv"
immunizations_index = create_embeddings(immunization_file, "immunization")

FAISS index saved to /content/immunization_embeddings.index
Process complete!


In [205]:
vitals = "/content/data/observations_assignment_1.csv"
vitals_index = create_embeddings(vitals, "vitals")

FAISS index saved to /content/vitals_embeddings.index
Process complete!


Using these indexes, the medications, immunizations, vitals, and encounter types can be filled.

In [206]:
medications_mapping_df = pd.read_csv("/content/data/medications_assignment_1.csv")
immunizations_mapping_df = pd.read_csv("/content/data/immunizations_assignment_1.csv")
vitals_mapping_df = pd.read_csv("/content/data/observations_assignment_1.csv")
encounter_types_mapping_df = pd.read_csv("/content/data/encounters_types_assignment_1.csv")

#test = get_nearest_neighbors(query_description, vitals, mapping_df)

Rename vitals_mapping_df column to match other data frames with'CODE' and 'DESCRIPTION' headers

In [292]:
vitals_mapping_df = vitals_mapping_df.rename(columns={"10230-1": "CODE", "Left ventricular Ejection fraction":"DESCRIPTION"})
vitals_mapping_df.head(2)

Unnamed: 0,CODE,DESCRIPTION
0,10480-2,Estrogen+Progesterone receptor Ag [Presence] i...
1,10834-0,Globulin


Now we want to write a function that fills the codes a single json object by doing searches on just one of the indices and querying the selection llm

In [208]:
# Helper function to query open ai llm and return the CODE picked from top 5 nearest neighbors
def select_neighbor(chain, input, nearest):
  returned = chain.invoke(
    {
      "description_to_match": input,
      "nearest_neighbors": str(nearest),
    }
  )
  return returned.CODE

Since so many of the fields in the pydantic schema have different structure so three differnt functions were created to handle these cases

In [299]:
# helper function that calls select_neighbor and nearest neighbor
# This was the only required function for getting the encounter_type code
def get_code(description, index, mapping_df, chain):
  #print(description)
  nearest = get_nearest_neighbors(description, index, mapping_df)
  #print(nearest)
  code = select_neighbor(chain, description, nearest)

  return code

In [300]:
def process_list_objs(objs, field_type, index, mapping_df, chain):
  """
  Process a list of objects with code fields (medications, immunizations)

  Args:
      objs (list):list of objs
      index (str): FAISS index
      mapping_df (pandas df): for looking up codes
      chain: langchain chain with prompt and model

  Returns:
      processed objs: list of objs with codes added

  Description:
      Using thr 5 closest codes from the given FAISS index,
      we can query our open ai llm to help us choose one
      from the top 5. We fill the code field with the chosen code
      and return the processed list of objs.
  """

  processed_objs = []
  for obj in objs:

      #Fall back to key if there is no decription field
      description = obj.get("description", obj.get("key"))

      # Get code from llm and FAISS nearest neighbor
      code = get_code(description, index, mapping_df, chain)
      obj["code"] = code

      processed_objs.append(obj)

  return processed_objs

In [303]:
def process_dict_objs(objs, field_type, index, mapping_df, chain):
  """
  Process dictionary with code fields (vitals.current, vitals.baseline)

  Args:
      objs (dict): dict of dicts with codes
      index (str): FAISS index
      mapping_df (pandas df): for looking up codes
      chain: langchain chain with prompt and model

  Returns:
      processed objs: dict of dicts with codes added

  Description:
      Using thr 5 closest codes from the given FAISS index,
      we can query our open ai llm to help us choose one
      from the top 5. We fill the code field with the chosen code
      and return the processed dict of dicts.
  """

  processed_objs = {}

  keys = list(objs.keys())
  for key in keys:

    # skip date field (in vitals.baseline)
    if key == 'date':
      processed_objs[key] = objs[key]
      continue

    # skip reassignment if the field is not populated
    if objs[key] == None:
      processed_objs[key] = objs[key]
      continue

    description = key

    code = get_code(description, index, mapping_df, chain)
    objs[key]["code"] = code

    processed_objs[key] = objs[key]

  return processed_objs

In [328]:
def fill_codes(objs, field_type, index, mapping_df, chain):
  """
  Process a field in an encounter_note dict

  Args:
      objs: could be list of dicts, dict of dicts, or single dict
      index (str): FAISS index
      mapping_df (pandas df): for looking up codes
      chain: langchain chain with prompt and model

  Returns:
      processed objs: processed object that will reassign the code field

  Description:
      This function includes logic to decide which processing function
      to use to fill in the codes. This separation was needed because
      of the hierarchy of the encounter_note schema.
  """

  # Process vitals field
  if field_type == "vitals":
    current = objs["current"]
    baseline = objs["baseline"]

    # either current and baseline can exist without the other
    # this must be accounted for by returning None if already None
    if current is not None:
      processed_current = process_dict_objs(current, field_type, index, mapping_df, chain)
    else:
       processed_current = None

    if baseline is not None:
      processed_baseline = process_dict_objs(baseline, field_type, index, mapping_df, chain)
    else:
      processed_baseline = None

    return {"current": processed_current, "baseline": processed_baseline}

  # Processed encounter_type
  elif field_type == "encounter_type":
    description = objs["description"]
    code = get_code(description, index, mapping_df, chain)
    objs["code"] = code

    return objs

  # Process medications or immunizations
  else:

    # objs should be a list of objs in this case
    processed = process_list_objs(objs, field_type, index, mapping_df, chain)

    return processed


Lets test out each of code assignment fields

In [260]:
test_medications = test.model_dump()['medications']
test_medications

[{'code': '313426', 'description': 'Hydrochlorothiazide 12.5 MG'},
 {'code': '400436',
  'description': 'Fluticasone/Salmeterol 250/50 mcg inhaler'}]

In [261]:
result = fill_codes(test_medications, "medications", medication_index, medications_mapping_df, chain)
result

[{'code': '997501', 'description': 'Hydrochlorothiazide 12.5 MG'},
 {'code': '895994',
  'description': 'Fluticasone/Salmeterol 250/50 mcg inhaler'}]

In [262]:
test_immunizations = test.model_dump()['immunizations']
test_immunizations

[{'code': 'CVX-140', 'description': 'Influenza vaccine', 'date': '3/11/2020'}]

In [231]:
a = get_nearest_neighbors("Influenza vaccine", immunizations_index, immunizations_mapping_df)
a

{6: {'CODE': 133, 'DESCRIPTION': 'Pneumococcal conjugate PCV 13'},
 10: {'CODE': 3, 'DESCRIPTION': 'MMR'},
 15: {'CODE': 62, 'DESCRIPTION': 'HPV  quadrivalent'},
 1: {'CODE': 113, 'DESCRIPTION': 'Td (adult) preservative free'},
 11: {'CODE': 33,
  'DESCRIPTION': 'pneumococcal polysaccharide vaccine  23 valent'}}

In [263]:
result2 = fill_codes(test_immunizations, "immunizations", immunizations_index, immunizations_mapping_df, chain)
result2

[{'code': '113', 'description': 'Influenza vaccine', 'date': '3/11/2020'}]

Looks like the search is not perfect, causing the model to return a weird result for 'Influenza vaccine'

In [233]:
test_encounter = test.model_dump()['encounter']['encounter_type']
test_encounter

{'code': 'Ambulatory/Urgent Care', 'description': 'Urgent Care'}

In [264]:
result3 = fill_codes(test_encounter, "encounter_type", encounter_type_index, encounter_types_mapping_df, chain)
result3

{'code': '50849002', 'description': 'Urgent Care'}

In [280]:
test_vitals = test.model_dump()['vitals']
test_vitals

{'current': {'temperature': {'code': '8310-5', 'value': 40.6, 'unit': 'Cel'},
  'heart_rate': {'code': '8867-4', 'value': 179.0, 'unit': '/min'},
  'blood_pressure': {'systolic': {'code': '8480-6',
    'value': 106.0,
    'unit': 'mmHg'},
   'diastolic': {'code': '8462-4', 'value': 78.0, 'unit': 'mmHg'}},
  'respiratory_rate': {'code': '9279-1', 'value': 24.0, 'unit': '/min'},
  'oxygen_saturation': {'code': '20564-1', 'value': 83.6, 'unit': '%'},
  'weight': {'code': '29463-7', 'value': 59.9, 'unit': 'kg'}},
 'baseline': {'date': '3/11/2020',
  'height': None,
  'weight': {'code': '29463-7', 'value': 59.9, 'unit': 'kg'},
  'bmi': None,
  'bmi_percentile': None}}

In [320]:
result4 = fill_codes(test_vitals, "vitals", vitals_index, vitals_mapping_df, chain)
result4

{'current': {'temperature': {'code': '2947-0', 'value': 40.6, 'unit': 'Cel'},
  'heart_rate': {'code': '8462-4', 'value': 179.0, 'unit': '/min'},
  'blood_pressure': {'systolic': {'code': '8480-6',
    'value': 106.0,
    'unit': 'mmHg'},
   'diastolic': {'code': '8462-4', 'value': 78.0, 'unit': 'mmHg'},
   'code': '8462-4'},
  'respiratory_rate': {'code': '92142-9', 'value': 24.0, 'unit': '/min'},
  'oxygen_saturation': {'code': '2703-7', 'value': 83.6, 'unit': '%'},
  'weight': {'code': '59557-9', 'value': 59.9, 'unit': 'kg'}},
 'baseline': {'date': '3/11/2020',
  'height': None,
  'weight': {'code': '59557-9', 'value': 59.9, 'unit': 'kg'},
  'bmi': None,
  'bmi_percentile': None}}

### Full Implementation on all dict objects

In [317]:
with open('/content/encounter_output.json', 'r') as file:
    encounter_outputs = json.load(file)

In [323]:
#encounter_outputs[0]

In [329]:
def code_encounter_note(encounter_note, chain):
  """
  Process an entire encounter note and create a copy with codes added

  Args:
      encounter_note (dict): dict owith Encounter_Note_Complete structure
      chain: langchain chain with prompt and model

  Returns:
      copy (dict): copy of encounter_note with codes added

  Description:
      We process each of the fields with codes only if they exist. A copy is created
      so that the original dictionary is not mutated.
  """

  copy = encounter_note.copy()

  encounter_medications = encounter_note["medications"]
  encounter_immunizations = encounter_note["immunizations"]
  encounter_encounter_type = encounter_note["encounter"]["encounter_type"]
  encounter_vitals = encounter_note["vitals"]

  # Only process if exists

  if encounter_medications is not None:
    copy["medications"] = fill_codes(encounter_medications, "medications", medication_index, medications_mapping_df, chain)

  if encounter_immunizations is not None:
    copy["immunizations"] = fill_codes(encounter_immunizations, "immunizations", immunizations_index, immunizations_mapping_df, chain)

  if encounter_encounter_type is not None:
    copy["encounter"]["encounter_type"] = fill_codes(encounter_encounter_type, "encounter_type", encounter_type_index, encounter_types_mapping_df, chain)

  if encounter_vitals is not None:
    copy["vitals"] = fill_codes(encounter_vitals, "vitals", vitals_index, vitals_mapping_df, chain)

  return copy


In [322]:
#test = code_encounter_note(encounter_outputs[0], chain)
#test

In [330]:
processed_encounter_notes = []
for i in range(len(encounter_outputs)):
  processed_encounter_notes.append(code_encounter_note(encounter_outputs[i], chain))

Save Proccesed notes in json file

In [331]:
import json
with open('encounter_output_processed.json', 'w') as f:
    json.dump(processed_encounter_notes, f, indent=4)

In [332]:
!cp /content/encounter_output_processed.json /content/drive/MyDrive/assignment3_data/

### Part 2 will be done on databricks