# Enhanced Aggregation Script for EHR Data based on Deep Learning Research
<b>Title: Multi-layer Representation Learning for Medical Concepts</b>
The paper introduces Med2Vec, a method for learning vector representations of medical codes and patients from EHR data using deep learning techniques. Instead of aggregating patient data into a single summary statistic, Med2Vec captures the sequential and co-occurrence patterns of medical events over time.

</Strong>Key Methodologies:</strong>

Embedding Medical Codes:

Uses embedding layers to represent medical codes (diagnoses, procedures, medications) in a continuous vector space.
Captures similarities between codes based on their co-occurrence in patient visits.
Sequential Modeling:

Utilizes Recurrent Neural Networks (RNNs) or simpler feed-forward models to capture the temporal sequence of patient visits.
Models patient history as a sequence of visits, each containing multiple medical codes.
Patient Representation:

Generates patient-level embeddings by aggregating visit embeddings.
Provides a dense representation that retains temporal and sequential information.
Implementation in the Updated Script:

Visit-Level Aggregation:

Instead of aggregating all events at the patient level, the script now aggregates data at each visit.
Each patient's history is represented as a sequence of visits, preserving temporal order.
Medical Code Embeddings:

Assigns unique identifiers to medical codes (conditions, medications, procedures).
Prepares the data for embedding layers in deep learning models.
Sequence Data Preparation:

Structures the data to be compatible with models like RNNs or Transformers.
Each patient's data is a list of visits, where each visit is a list of medical codes.


In [2]:
# optimized_aggregation_script_with_observations.py

import pandas as pd
import numpy as np
from datetime import datetime
from collections import defaultdict
import os

# ------------------------------
# 1. Load Data
# ------------------------------

# 1.1 Load Patients Data
patients = pd.read_csv(r"E:\DataGen\synthea\output\csv\patients.csv", usecols=[
    'Id', 'BIRTHDATE', 'DEATHDATE', 'GENDER', 'RACE', 'ETHNICITY',
    'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE', 'INCOME'
])

# Convert BIRTHDATE and DEATHDATE to datetime without timezone
patients['BIRTHDATE'] = pd.to_datetime(patients['BIRTHDATE']).dt.tz_localize(None)
patients['DEATHDATE'] = pd.to_datetime(patients['DEATHDATE']).dt.tz_localize(None)

# Calculate Age
current_date = datetime.now()
patients['AGE'] = (current_date - patients['BIRTHDATE']).dt.days / 365.25
patients['AGE'] = patients['AGE'].fillna(0)

# Calculate if patient is deceased
patients['DECEASED'] = patients['DEATHDATE'].notnull().astype(int)

# Drop unnecessary columns
patients.drop(columns=['BIRTHDATE', 'DEATHDATE'], inplace=True)

# ------------------------------
# 2. Prepare Visit-Level Data
# ------------------------------

# 2.1 Load Encounters Data (Visits)
encounters = pd.read_csv(r"E:\DataGen\synthea\output\csv\encounters.csv", usecols=[
    'Id', 'PATIENT', 'ENCOUNTERCLASS', 'START', 'STOP', 'REASONCODE', 'REASONDESCRIPTION'
])

# Convert START and STOP to datetime without timezone
encounters['START'] = pd.to_datetime(encounters['START']).dt.tz_localize(None)
encounters['STOP'] = pd.to_datetime(encounters['STOP']).dt.tz_localize(None)

# Sort encounters by patient and start date
encounters.sort_values(by=['PATIENT', 'START'], inplace=True)

# ------------------------------
# 3. Aggregate Codes at Each Visit
# ------------------------------

# 3.1 Load Conditions Data
conditions = pd.read_csv(r"E:\DataGen\synthea\output\csv\conditions.csv", usecols=[
    'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'
])

# 3.2 Load Medications Data
medications = pd.read_csv(r"E:\DataGen\synthea\output\csv\medications.csv", usecols=[
    'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'
])

# 3.3 Load Procedures Data
procedures = pd.read_csv(r"E:\DataGen\synthea\output\csv\procedures.csv", usecols=[
    'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'
])

# 3.4 Load Observations Data
observations = pd.read_csv(r"E:\DataGen\synthea\output\csv\observations.csv", usecols=[
    'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'
])

# 3.5 Combine All Codes into a Single DataFrame
codes = pd.concat([
    conditions[['ENCOUNTER', 'CODE', 'DESCRIPTION']].assign(TYPE='condition'),
    medications[['ENCOUNTER', 'CODE', 'DESCRIPTION']].assign(TYPE='medication'),
    procedures[['ENCOUNTER', 'CODE', 'DESCRIPTION']].assign(TYPE='procedure'),
    observations[['ENCOUNTER', 'CODE', 'DESCRIPTION']].assign(TYPE='observation')
], ignore_index=True)

# ------------------------------
# 4. Map Codes to Unique Identifiers
# ------------------------------

# Handle missing codes
codes['CODE'] = codes['CODE'].fillna('UNKNOWN')

# Create a unified code system
codes['UNIQUE_CODE'] = codes['TYPE'] + '_' + codes['CODE'].astype(str)

# Generate a mapping from UNIQUE_CODE to integer IDs
unique_codes = codes['UNIQUE_CODE'].unique()
code_to_id = {code: idx for idx, code in enumerate(unique_codes)}
id_to_code = {idx: code for code, idx in code_to_id.items()}

# Map codes to IDs
codes['CODE_ID'] = codes['UNIQUE_CODE'].map(code_to_id)

# ------------------------------
# 5. Build Visit Sequences for Each Patient (Optimised)
# ------------------------------

# Create a mapping from ENCOUNTER to CODE_IDs
encounter_code_map = codes.groupby('ENCOUNTER')['CODE_ID'].apply(list).to_dict()

# Create a mapping from PATIENT to ENCOUNTER IDs
patient_encounter_map = encounters.groupby('PATIENT')['Id'].apply(list).to_dict()

# Initialize a dictionary to hold patient sequences
patient_sequences = {}

# Build sequences
for patient_id, encounter_ids in patient_encounter_map.items():
    patient_visits = []
    for visit_id in encounter_ids:
        visit_codes = encounter_code_map.get(visit_id, [])
        if visit_codes:
            patient_visits.append(visit_codes)
    if patient_visits:
        patient_sequences[patient_id] = patient_visits


'''
# Initialize a dictionary to hold patient sequences
patient_sequences = defaultdict(list)

# Iterate over encounters to build sequences
for patient_id, group in encounters.groupby('PATIENT'):
    patient_visits = []
    for _, encounter in group.iterrows():
        visit_id = encounter['Id']
        visit_codes = codes.loc[codes['ENCOUNTER'] == visit_id, 'CODE_ID'].tolist()
        if visit_codes:
            patient_visits.append(visit_codes)
    if patient_visits:
        patient_sequences[patient_id] = patient_visits
'''
# ------------------------------
# 6. Prepare Data for Deep Learning Models
# ------------------------------

# Convert patient_sequences to a DataFrame
patient_sequence_df = pd.DataFrame([
    {'PATIENT': patient_id, 'SEQUENCE': visits}
    for patient_id, visits in patient_sequences.items()
])

# Merge with patient demographics
patient_data = patients.merge(patient_sequence_df, how='inner', left_on='Id', right_on='PATIENT')

# Drop redundant 'PATIENT' column
patient_data.drop(columns=['PATIENT'], inplace=True)

# ------------------------------
# 7. Save Processed Data
# ------------------------------

# Create a directory to save data if it doesn't exist
output_dir = 'Data'
os.makedirs(output_dir, exist_ok=True)

# Save code mappings
code_mappings = pd.DataFrame(list(code_to_id.items()), columns=['UNIQUE_CODE', 'CODE_ID'])
code_mappings.to_csv(os.path.join(output_dir, 'code_mappings.csv'), index=False)

# Save patient data with sequences
patient_data.to_pickle(os.path.join(output_dir, 'patient_data_sequences.pkl'))

print("Optimized aggregation complete. Data saved for deep learning models.")


Optimized aggregation complete. Data saved for deep learning models.
