## Final Assignment Overview: Working with Patient Records and Encounter Notes

In this final assignment, we’ll focus on patient records related to COVID-19 encounters. Our task is to analyze, process, and transform the data while applying the concepts we’ve covered throughout this course. Here's a detailed breakdown of the assignment:

What Are Encounter Notes?
An encounter note is a record that captures details about a patient’s visit with a doctor. It includes both structured and semi-structured information that is crucial for understanding the context of the visit. Here’s what an encounter note typically looks like:


The encounter contains:

* General encounter information: 

  * When the encounter took place: Date and time of the visit.
  * Demographics: Patient’s age, gender, and unique medical record identifier.
  * Encounter details: The reason for the visit, diagnosis, and any associated costs.


Semi-Structured Notes:

These notes mirror how doctors organize their thoughts and observations during an encounter. They generally follow a SOAP format:

* **Subjective**: The patient’s subjective description of their symptoms, feelings, and medical concerns.
* **Objective**: The doctor’s objective findings, including test results, measurements, or physical examination outcomes.
* **Assessment**: The doctor’s evaluation or diagnosis based on subjective and objective information.
* **Plan**: The proposed treatment plan, including medications, follow-ups, or other interventions.

While some encounter notes might include additional details, the majority conform to this semi-structured format, making them ideal for analysis and transformation.

* Goals for the Assignment

## 1. Transforming Encounter Notes:

Using an LLM to convert semi-structured encounter notes into a JSON format that organizes the information into structured fields. The JSON will include details such as demographics, encounter specifics, and the SOAP components of the note. Subsequently, you will need to transform the JSON data into a Parquet file, which is not only suitable for analysis in Spark but also ideal for storage later.
Here we will use the ML classificaition to assing the objective and assessment semi-structured fields into standardized, structured fields. The medical taxonomy for this task will be the one provided by the CDC, which defines standard codes for diagnoses, symptoms, procedures, and treatments. This step ensures the structured data aligns with domain-wide medical standards, making it interoperable and ready for deeper analysis.

The JSON format should capture the hierachies described in the structure below. 

### From Explanation:
- There is a code related to each symptom – this will help us turn the Subjective description into structured data.
- We want everything to have a code associated with it. We will use an LLM. Ask it to give us some structured data. 
- **We have to generate a plan of what the structure will look like.**


## 2. Basic Analytics and Visualizations:
Using Apache Spark, perform comprehensive data analysis on the encounter data and create visualizations that reveal meaningful patterns. Your analysis must include:
- COVID-19 Case Demographics: Case breakdown by age ranges ([0-5], [6-10], [11-17], [18-30], [31-50], [51-70], [71+])
- Cumulative case count of Covid between the earliest case observed in the dataset and last case observed

- Symptoms for all COVID-19 patients versus patients that admitted into the intensive care unit due to COVID.
- *This will use encounters_assignment_1.csv and encounters_types_assignment_1.csv: Intensive care unit has a specific encounter code; then we can*

- Rank medications by frequency of prescription
- Analyze medication patterns across different demographic groups (e.g., top 3 per age group)
- Identify and plot co-morbidity information from the patient records (e.g., hypertension, obesity, prediabetes, etc.) provided in the dataset. 

- An independent group analysis: You need to develop and execute **THREE original analyses** that provide meaningful insights about COVID-19 patterns in this dataset. For each analysis:
  - Clearly state your analytical question/hypothesis
  - Justify why this analysis is valuable
  - Show your Spark code and methodology
  - Present results with appropriate visualizations
  
The analyses should be actionable, informative, and helpful with regard to the dataset.

# Plan for what the structure should look like after putting it into an LLM:
- We should turn this template into a JSON file.
- We use a Pydantic Model with structured output and an encounter. Each of these is a Pydantic Class.
- We also want the data to contain codes for things like symptoms or encounter types instead of the English descriptions. *How can we do this when the codes are in the csv files? How can we provide the codes?*
- We should provide the codes as part of the prompt. But we need two prompts; for example:
1. LLM finds the medication names in the file. (more generally, extract stuff NOT dependent on your database)
- Use FAISS similarity, find the 10 most similar items for your description by their embeddings, not text, then get those entries in your "database" csv, and include them in your context. This is Retrieval Augmented Generation. 
2. Submit your second prompt with the code and description in the context (it is a SUBSET of your entire database, relevant for your specific file), so you're saving tokens, and then it can provide that as part of your Medications object with the code.

In [1]:
from pydantic import BaseModel, Field
from typing import List, Optional

class Medication(BaseModel):
    code: str 
    description: str

class Address(BaseModel):
    city: str
    state: str

class Demographics(BaseModel):
    name: str 
    date_of_birth: str 
    age: str
    gender: str
    address: Address
    insurance: str

ModuleNotFoundError: No module named 'pydantic'

In [None]:
# from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Use prompt prefix with the Pydantic output parser instructions with the input variable being the encounter text file
prompt =

api_key = ""
model = OpenAI()
chain = LLMChain()

In [None]:
EncounterType:
    code
    description

Encounter:
    date
    time
    type: EncounterType
    provider_id
    facility_id

Address:
    city
    state

Demographics:
    name
    date_of_birth
    age
    gender
    address: Address
    insurance

Condition:
    code
    description

Medication:
    code
    description

Immunization:
    code
    description
    date: date

VitalMeasurement:
    code
    value: float
    unit

BloodPressure:
    systolic: VitalMeasurement
    diastolic: VitalMeasurement

CurrentVitals:
    temperature: VitalMeasurement
    heart_rate: VitalMeasurement
    blood_pressure: BloodPressure
    respiratory_rate: VitalMeasurement
    oxygen_saturation: VitalMeasurement
    weight: VitalMeasurement

BaselineVitals:
    date: date
    height: VitalMeasurement
    weight: VitalMeasurement
    bmi: VitalMeasurement
    bmi_percentile: VitalMeasurement

Vitals:
    current: CurrentVitals
    baseline: BaselineVitals

RespiratoryTest:
    code
    result

RespiratoryPanel:
    influenza_a: RespiratoryTest
    influenza_b: RespiratoryTest
    rsv: RespiratoryTest
    parainfluenza_1: RespiratoryTest
    parainfluenza_2: RespiratoryTest
    parainfluenza_3: RespiratoryTest
    rhinovirus: RespiratoryTest
    metapneumovirus: RespiratoryTest
    adenovirus: RespiratoryTest

Covid19Test:
    code
    description
    result

Laboratory:
    covid19: Covid19Test
    respiratory_panel: RespiratoryPanel

Procedure:
    code
    description
    date: date
    reasonCode
    reasonDescription

CarePlan:
    code
    description
    start: date
    stop: date
    reasonCode
    reasonDescription

PatientRecord:
    encounter: Encounter
    demographics: Demographics
    conditions: List[Condition]
    medications: List[Medication]
    immunizations: List[Immunization]
    vitals: Vitals
    laboratory: Laboratory
    procedures: List[Procedure]


In [None]:
# Example of what we want ChatGPT to output
{
    "encounter": {
        "date": 
        "time":
        "provider_id":
        "encounter_id":
    },
    …
    "medications": [
        {},
        {}
    ],
}