## Convert Disease Trajectory Data to Delphi Binary Format

This notebook converts disease trajectory data (with age at diagnosis) to the binary format required by Delphi.

**Expected Input Format:**
- CSV file with patient ID column (e.g., 'eid') 
- Disease columns containing age at diagnosis (in years, float)
- Optional: demographic columns (sex, BMI, smoking, alcohol)

**Output Format:**
- Binary file with uint32 records: [patient_id, days_from_birth, label_index]
- Sorted by patient_id, label type, then time

In [None]:
import pandas as pd
import numpy as np
import re
from pathlib import Path
from tqdm import tqdm

# =============================================================================
# CONFIGURATION - Modify these settings for your data
# =============================================================================

# Input file path - your disease trajectory or preprocessed data file
input_file = '../../data/preprocessed/disease_trajectory.csv'  # Change this to your file

# Demographic data file (if demographics are in a separate file, set this; otherwise None)
# Set to None if demographics are already in input_file
demographic_file = '../../data/ukb_respiratory_cohort_total.csv'

# Patient ID column name in your data
patient_id_col = 'eid'  # Common alternatives: 'patient_id', 'id', 'subject_id'

# Labels file for Delphi (defines label indices)
labels_file = 'labels.csv'

# Output settings
output_folder = 'ukb_respiratory_data'
train_proportion = 0.8  # 80% train, 20% validation

# =============================================================================
# DATA FORMAT CONFIGURATION
# =============================================================================

# Choose your data format:
# 'age_columns' - columns contain age at diagnosis (like disease_trajectory.csv)
# 'date_columns' - columns contain diagnosis dates (need DOB to calculate age)
data_format = 'age_columns'

# For 'age_columns' format: how are age values represented?
# 'years' - age in years (e.g., 65.5)
# 'days' - age in days from birth
age_unit = 'years'

# Column name patterns to identify disease/event columns
# These regex patterns help auto-detect which columns are disease columns
disease_column_patterns = [
    r'.*_age$',           # Columns ending with '_age' (like disease_trajectory.csv)
    r'^date_.*',          # Columns starting with 'date_'
    r'^icd_.*',           # Columns starting with 'icd_'
]

# =============================================================================
# OPTIONAL DEMOGRAPHIC COLUMNS (set to None if not available)
# =============================================================================
# UK Biobank field IDs (from natural_text_conversion.py):
#   - Sex: p31 (0=Female, 1=Male)
#   - Birth Year: p34
#   - BMI: p21001
#   - Smoking: p20116 (0=never, 1=previous, 2=current, -3=prefer not to answer)
#   - Alcohol: p1558 (1=daily, 2=3-4x/week, 3=1-2x/week, 4=1-3x/month, 5=special, 6=never)
#   - Ethnicity: p21000
#   - Height: p50
#
# Note: UK Biobank columns may have suffixes like _i0, _i1 (instance) or _a0, _a1 (array)
# Examples: p31_i0, p21001_i0_a0, etc.
# =============================================================================

# Sex column (expects: 0=Female, 1=Male, or 'Female'/'Male')
# UK Biobank field: p31 (with possible suffix like _i0)
sex_col = 'p31'  # Found in processed_data_complete.csv

# Birth year column (used to calculate age if data has dates instead of ages)
# UK Biobank field: p34
birth_year_col = 'p34'  # Found in processed_data_complete.csv

# Death date at death column
death_col = 'p40000_i0'  # 'date_of_death'

# BMI column (will be categorized: <22=low, 22-28=mid, >28=high)
# UK Biobank field: p21001
bmi_col = 'p21001_i0'  # Found in processed_data_complete.csv

# Smoking status column
# UK Biobank field: p20116
# Expected values: 0=never smoked, 1=previous smoker, 2=current smoker
smoking_col = 'p20116_i0'  # Found in processed_data_complete.csv

# Alcohol intake frequency column
# UK Biobank field: p1558
# Expected values: 1=daily, 2=3-4x/week, 3=1-2x/week, 4=1-3x/month, 5=special occasions, 6=never
alcohol_col = 'p1558_i0'  # Found in processed_data_complete.csv

# Ethnicity column (optional, for reference)
# UK Biobank field: p21000
ethnicity_col = 'p21000_i0'  # Found in processed_data_complete.csv

# =============================================================================
# ADVANCED: Custom column to ICD10 mapping
# =============================================================================

# If your column names don't contain ICD10 codes, provide a custom mapping
# Format: {'column_name': 'ICD10_code', ...}
# Leave as None to auto-extract ICD10 codes from column names
custom_column_to_icd_mapping = None

# Example:
# custom_column_to_icd_mapping = {
#     'diabetes_age': 'E10',
#     'hypertension_age': 'I10',
#     'asthma_age': 'J45',
# }

print(f"Configuration loaded:")
print(f"  Input file: {input_file}")
print(f"  Demographic file: {demographic_file}")
print(f"  Patient ID column: {patient_id_col}")
print(f"  Data format: {data_format}")
print(f"  Age unit: {age_unit}")
print(f"  Output folder: {output_folder}")
print(f"\nDemographic columns configured:")
print(f"  Sex: {sex_col}")
print(f"  BMI: {bmi_col}")
print(f"  Smoking: {smoking_col}")
print(f"  Alcohol: {alcohol_col}")
print(f"  Death: {death_col}")


Configuration loaded:
  Input file: ../../../data/preprocessed/disease_trajectory.csv
  Demographic file: ../../../data/ukb_respiratory_cohort_total.csv
  Patient ID column: eid
  Data format: age_columns
  Age unit: years
  Output folder: ukb_respiratory_data

Demographic columns configured:
  Sex: p31
  BMI: p21001_i0
  Smoking: p20116_i0
  Alcohol: p1558_i0
  Death: p40000_i0


## Load Labels and Build Mapping Functions

This cell:
1. Loads the Delphi labels file to get label indices
2. Creates functions to extract ICD10 codes from column names
3. Maps column names to label indices

In [5]:
# Load labels file and create label dictionary
# Labels file format: one label per line, first word is the label name
label_dict = {}
label_names = []
with open(labels_file, 'r') as f:
    for i, line in enumerate(f):
        label_name = line.strip().split(' ')[0]
        label_dict[label_name] = i - 1  # -1 because first label (Padding) is index -1
        label_names.append(label_name)

print(f"Loaded {len(label_names)} labels from {labels_file}")
print(f"First 15 labels: {label_names[:15]}")
print(f"Sample mappings: Padding={label_dict.get('Padding')}, Female={label_dict.get('Female')}, Male={label_dict.get('Male')}")


def extract_icd10_from_column(col_name):
    """
    Extract ICD10 code from column name.
    Handles various formats like:
    - 'date_e10_first_reported_..._age' -> 'E10'
    - 'date_j45_first_reported_..._age' -> 'J45'
    - 'icd_E10' -> 'E10'
    - 'E10_diabetes' -> 'E10'
    """
    # Pattern to match ICD10 codes (letter followed by 2 digits)
    # Look for patterns like _e10_, _E10_, e10_, E10_
    patterns = [
        r'[_\s]([a-zA-Z]\d{2})[_\s]',  # _E10_ or _e10_
        r'^([a-zA-Z]\d{2})[_\s]',       # E10_ at start
        r'[_\s]([a-zA-Z]\d{2})$',       # _E10 at end
        r'date_([a-zA-Z]\d{2})_',       # date_E10_
    ]
    
    for pattern in patterns:
        match = re.search(pattern, col_name, re.IGNORECASE)
        if match:
            return match.group(1).upper()
    return None


def get_label_index(col_name, custom_mapping=None):
    """
    Get the label index for a column name.
    
    Args:
        col_name: The column name from your data
        custom_mapping: Optional dict mapping column names to ICD10 codes
    
    Returns:
        Label index or None if not found
    """
    # First check custom mapping
    if custom_mapping and col_name in custom_mapping:
        icd_code = custom_mapping[col_name]
    else:
        # Try to extract ICD10 code from column name
        icd_code = extract_icd10_from_column(col_name)
    
    if icd_code is None:
        return None
    
    # Look up in label dictionary
    return label_dict.get(icd_code)


# Test the extraction on sample column names
test_cols = [
    'date_e10_first_reported_diabetes_mellitus_130708_age',
    'date_j45_first_reported_asthma_131494_age',
    'date_i10_first_reported_essential_primary_hypertension_131286_age'
]
print("\nTest ICD10 extraction:")
for col in test_cols:
    icd = extract_icd10_from_column(col)
    label_idx = get_label_index(col)
    print(f"  {col[:50]}... -> ICD10: {icd}, Label index: {label_idx}")


Loaded 1270 labels from labels.csv
First 15 labels: ['Padding', 'No', 'Female', 'Male', 'BMI_low', 'BMI_mid', 'BMI_high', 'Smoking_low', 'Smoking_mid', 'Smoking_high', 'Alcohol_low', 'Alcohol_mid', 'Alcohol_high', 'A00', 'A01']
Sample mappings: Padding=-1, Female=1, Male=2

Test ICD10 extraction:
  date_e10_first_reported_diabetes_mellitus_130708_a... -> ICD10: E10, Label index: 213
  date_j45_first_reported_asthma_131494_age... -> ICD10: J45, Label index: 602
  date_i10_first_reported_essential_primary_hyperten... -> ICD10: I10, Label index: 498


## Load Data and Identify Disease Columns

This cell:
1. Loads your data file
2. Identifies disease/event columns based on naming patterns
3. Maps columns to Delphi label indices
4. Shows which columns will be processed

In [14]:
# Load the data file
print(f"Loading data from: {input_file}")
df = pd.read_csv(input_file, low_memory=False)
print(f"Loaded {len(df)} patients with {len(df.columns)} columns")

# Ensure patient ID column exists
if patient_id_col not in df.columns:
    raise ValueError(f"Patient ID column '{patient_id_col}' not found in data. "
                     f"Available columns: {list(df.columns[:10])}...")

# Load and merge demographic data if separate file is specified
if demographic_file:
    print(f"\nLoading demographic data from: {demographic_file}")
    # Only load the columns we need to save memory
    demo_cols_to_load = [patient_id_col]
    for col in [sex_col, birth_year_col, bmi_col, smoking_col, alcohol_col, ethnicity_col, death_col]:
        if col:
            demo_cols_to_load.append(col)
    
    # Read only required columns
    df_demo = pd.read_csv(demographic_file, usecols=demo_cols_to_load, low_memory=False)
    print(f"  Loaded {len(df_demo)} patients with demographic columns: {demo_cols_to_load[1:]}")
    
    # Merge demographic data
    df = df.merge(df_demo, on=patient_id_col, how='left')
    print(f"  Merged demographic data into main dataframe")

# Identify disease columns based on patterns
def is_disease_column(col_name):
    """Check if column matches disease column patterns."""
    if col_name == patient_id_col:
        return False
    for pattern in disease_column_patterns:
        if re.match(pattern, col_name, re.IGNORECASE):
            return True
    return False

all_columns = df.columns.tolist()
disease_columns = [col for col in all_columns if is_disease_column(col)]
print(f"\nFound {len(disease_columns)} potential disease columns")

# Map columns to label indices
column_to_label = {}
unmapped_columns = []
for col in disease_columns:
    label_idx = get_label_index(col, custom_column_to_icd_mapping)
    if label_idx is not None:
        column_to_label[col] = label_idx
    else:
        unmapped_columns.append(col)

print(f"Successfully mapped {len(column_to_label)} columns to Delphi labels")
if unmapped_columns:
    print(f"Warning: {len(unmapped_columns)} columns could not be mapped (no matching ICD10 code in labels)")
    print(f"  First 5 unmapped: {unmapped_columns[:5]}")

# Show some mappings
print("\nSample column-to-label mappings:")
for i, (col, label_idx) in enumerate(list(column_to_label.items())[:5]):
    icd = extract_icd10_from_column(col)
    print(f"  {col[:60]}... -> {icd} -> label {label_idx}")    

Loading data from: ../../../data/preprocessed/disease_trajectory.csv
Loaded 133842 patients with 1031 columns

Loading demographic data from: ../../../data/ukb_respiratory_cohort_total.csv
  Loaded 133842 patients with demographic columns: ['p31', 'p34', 'p21001_i0', 'p20116_i0', 'p1558_i0', 'p21000_i0', 'p40000_i0']
  Merged demographic data into main dataframe

Found 1030 potential disease columns
Successfully mapped 1028 columns to Delphi labels
  First 5 unmapped: ['date_e40_first_reported_kwashiorkor_130750_age', 'date_p77_first_reported_necrotising_enterocolitis_of_foetus_and_newborn_132408_age']

Sample column-to-label mappings:
  date_a00_first_reported_cholera_130000_age... -> A00 -> label 12
  date_a01_first_reported_typhoid_and_paratyphoid_fevers_13000... -> A01 -> label 13
  date_a02_first_reported_other_salmonella_infections_130004_a... -> A02 -> label 14
  date_a03_first_reported_shigellosis_130006_age... -> A03 -> label 15
  date_a04_first_reported_other_bacterial_intest

## Convert Data to Delphi Binary Format

This cell:
1. Converts age values to days from birth
2. Creates records in [patient_id, days, label_index] format
3. Optionally adds demographic labels (sex, BMI, smoking, alcohol)
4. Sorts and exports to binary format

In [17]:
# Convert age to days from birth
def age_to_days(age_value, unit='years'):
    """Convert age to days from birth."""
    if pd.isna(age_value):
        return None
    if unit == 'years':
        return int(age_value * 365.25)
    elif unit == 'days':
        return int(age_value)
    else:
        raise ValueError(f"Unknown age unit: {unit}")

# Build data list: [patient_id, days_from_birth, label_index]
print("Converting disease data to Delphi format...")
data_list = []

for col, label_idx in tqdm(column_to_label.items()):
    # Get patient IDs and age values for non-null entries
    mask = df[col].notna()
    if mask.sum() == 0:
        continue
    
    patient_ids = df.loc[mask, patient_id_col].values
    ages = df.loc[mask, col].values
    
    # Convert ages to days
    days = np.array([age_to_days(a, age_unit) for a in ages])
    
    # Filter out any None values
    valid_mask = ~pd.isna(days)
    patient_ids = patient_ids[valid_mask]
    days = days[valid_mask].astype(int)
    
    if len(patient_ids) == 0:
        continue
    
    # Create records
    labels = np.full(len(patient_ids), label_idx, dtype=int)
    records = np.column_stack([patient_ids, days, labels])
    data_list.append(records)

print(f"Processed {len(data_list)} disease columns")

# Add demographic labels if columns are specified
# =============================================================================
# Sex labels: Female=2, Male=3 (based on labels.csv)
# UK Biobank p31: 0=Female, 1=Male
if sex_col and sex_col in df.columns:
    print(f"Adding sex data from column: {sex_col}")
    sex_data = df[[patient_id_col, sex_col]].dropna()
    if len(sex_data) > 0:
        # Map sex values to label indices
        # Assuming 0=Female, 1=Male or 'Female'/'Male' strings
        sex_mapping = {
            0: label_dict.get('Female', 2),
            1: label_dict.get('Male', 3),
            'Female': label_dict.get('Female', 2),
            'Male': label_dict.get('Male', 3),
            'F': label_dict.get('Female', 2),
            'M': label_dict.get('Male', 3),
        }
        sex_labels = sex_data[sex_col].map(sex_mapping)
        valid_mask = sex_labels.notna()
        if valid_mask.sum() > 0:
            records = np.column_stack([
                sex_data.loc[valid_mask, patient_id_col].values,
                np.zeros(valid_mask.sum(), dtype=int),  # Time 0 for demographics
                sex_labels[valid_mask].values.astype(int)
            ])
            data_list.append(records)
            print(f"  Added {len(records)} sex records")

# BMI labels: BMI_low=3, BMI_mid=4, BMI_high=5 (based on labels.csv)
# UK Biobank p21001: BMI value (categorized: <22=low, 22-28=mid, >28=high)
if bmi_col and bmi_col in df.columns:
    print(f"Adding BMI data from column: {bmi_col}")
    bmi_data = df[[patient_id_col, bmi_col]].dropna()
    if len(bmi_data) > 0:
        # Categorize BMI: <22=low, 22-28=mid, >28=high
        bmi_values = bmi_data[bmi_col].values
        bmi_labels = np.where(bmi_values > 28, label_dict.get('BMI_high', 5),
                     np.where(bmi_values > 22, label_dict.get('BMI_mid', 4),
                              label_dict.get('BMI_low', 3)))
        records = np.column_stack([
            bmi_data[patient_id_col].values,
            np.zeros(len(bmi_data), dtype=int),
            bmi_labels
        ])
        data_list.append(records)
        print(f"  Added {len(records)} BMI records")

# Smoking labels: Smoking_low=6, Smoking_mid=7, Smoking_high=8
# UK Biobank p20116: 0=never smoked, 1=previous smoker, 2=current smoker
if smoking_col and smoking_col in df.columns:
    print(f"Adding smoking data from column: {smoking_col}")
    smoke_data = df[[patient_id_col, smoking_col]].dropna()
    smoke_data = smoke_data[smoke_data[smoking_col] != -3]  # Remove "prefer not to answer"
    if len(smoke_data) > 0:
        smoke_values = smoke_data[smoking_col].values
        # Map smoking status: 0=never(low), 1=previous(mid), 2=current(high)
        smoke_labels = np.where(smoke_values == 2, label_dict.get('Smoking_high', 8),   # current smoker
                       np.where(smoke_values == 1, label_dict.get('Smoking_mid', 7),    # previous smoker
                                label_dict.get('Smoking_low', 6)))                       # never smoked
        records = np.column_stack([
            smoke_data[patient_id_col].values,
            np.zeros(len(smoke_data), dtype=int),
            smoke_labels
        ])
        data_list.append(records)
        print(f"  Added {len(records)} smoking records")

# Alcohol labels: Alcohol_low=9, Alcohol_mid=10, Alcohol_high=11
# UK Biobank p1558: 1=daily, 2=3-4x/week, 3=1-2x/week, 4=1-3x/month, 5=special, 6=never
if alcohol_col and alcohol_col in df.columns:
    print(f"Adding alcohol data from column: {alcohol_col}")
    alcohol_data = df[[patient_id_col, alcohol_col]].dropna()
    alcohol_data = alcohol_data[alcohol_data[alcohol_col] != -3]  # Remove "prefer not to answer"
    if len(alcohol_data) > 0:
        alc_values = alcohol_data[alcohol_col].values
        # Map alcohol: 1=daily(high), 2-3=regular(mid), 4-6=occasional/never(low)
        alc_labels = np.where(alc_values == 1, label_dict.get('Alcohol_high', 11),       # daily
                     np.where(alc_values < 4, label_dict.get('Alcohol_mid', 10),         # 3-4x or 1-2x per week
                              label_dict.get('Alcohol_low', 9)))                          # monthly/special/never
        records = np.column_stack([
            alcohol_data[patient_id_col].values,
            np.zeros(len(alcohol_data), dtype=int),
            alc_labels
        ])
        data_list.append(records)
        print(f"  Added {len(records)} alcohol records")

# Death label if available
# Death dates are in format 'YYYY-MM-DD', need to convert to days from birth using birth year
if death_col and death_col in df.columns:
    print(f"Adding death data from column: {death_col}")
    # Need birth year to calculate age at death
    if birth_year_col and birth_year_col in df.columns:
        death_data = df[[patient_id_col, death_col, birth_year_col]].dropna()
        if len(death_data) > 0:
            # Parse death dates and calculate days from birth
            death_dates = pd.to_datetime(death_data[death_col], errors='coerce')
            birth_years = death_data[birth_year_col].values
            
            # Calculate age at death in days: (death_year - birth_year) * 365.25 + day_of_year
            death_days = ((death_dates.dt.year - birth_years) * 365.25 + death_dates.dt.dayofyear).values
            
            valid_mask = ~pd.isna(death_days) & (death_days > 0)
            if valid_mask.sum() > 0:
                records = np.column_stack([
                    death_data.loc[valid_mask, patient_id_col].values,
                    death_days[valid_mask].astype(int),
                    np.full(valid_mask.sum(), label_dict.get('Death', 1))
                ])
                data_list.append(records)
                print(f"  Added {len(records)} death records")
    else:
        print(f"  Warning: birth_year_col '{birth_year_col}' not found, cannot calculate age at death")

# Combine all data
if not data_list:
    raise ValueError("No data records created! Check your column mappings.")

data = np.vstack(data_list)
print(f"\nTotal records before filtering: {len(data)}")

# Sort: by patient_id, then by label (death events last), then by time
# This ensures proper ordering for Delphi training
max_label = data[:, 2].max()
data = data[np.lexsort((data[:, 1], data[:, 2] == max_label, data[:, 0]))]

# Filter out negative times (invalid data)
data = data[data[:, 1] >= 0]
print(f"Records after removing negative times: {len(data)}")

# Remove duplicates (same patient, same label)
data_df = pd.DataFrame(data, columns=['patient_id', 'days', 'label'])
data_df = data_df.drop_duplicates(subset=['patient_id', 'label'])
data = data_df.values

print(f"Records after removing duplicates: {len(data)}")

# Convert to uint32 for binary output
data = data.astype(np.uint32)

# Save full dataset
output_file = f"../{output_folder}/full.bin"
data.tofile(output_file)
print(f"\nSaved full dataset: {output_file} ({len(data)} records)")

# Split into train and validation
unique_ids = sorted(set(data[:, 0]))
split_idx = int(len(unique_ids) * train_proportion)
train_ids = set(unique_ids[:split_idx])

train_mask = np.array([pid in train_ids for pid in data[:, 0]])
train_data = data[train_mask]
val_data = data[~train_mask]

train_file = f"../{output_folder}/train.bin"
val_file = f"../{output_folder}/val.bin"

train_data.tofile(train_file)
val_data.tofile(val_file)

print(f"Saved training data: {train_file} ({len(train_data)} records, {len(train_ids)} patients)")
print(f"Saved validation data: {val_file} ({len(val_data)} records, {len(unique_ids) - len(train_ids)} patients)")

# Summary statistics
print("\n=== Summary ===")
print(f"Total patients: {len(unique_ids)}")
print(f"Total disease events: {len(data)}")
print(f"Unique labels used: {len(set(data[:, 2]))}")
print(f"Train/Val split: {train_proportion*100:.0f}% / {(1-train_proportion)*100:.0f}%")


Converting disease data to Delphi format...


100%|██████████| 1028/1028 [00:24<00:00, 42.01it/s] 


Processed 998 disease columns
Adding sex data from column: p31
  Added 133842 sex records
Adding BMI data from column: p21001_i0
  Added 132583 BMI records
Adding smoking data from column: p20116_i0
  Added 132813 smoking records
Adding alcohol data from column: p1558_i0
  Added 133308 alcohol records
Adding death data from column: p40000_i0
  Added 34951 death records

Total records before filtering: 2785572
Records after removing negative times: 2785572
Records after removing duplicates: 2785572

Saved full dataset: ../ukb_respiratory_data/full.bin (2785572 records)
Saved training data: ../ukb_respiratory_data/train.bin (2229038 records, 107073 patients)
Saved validation data: ../ukb_respiratory_data/val.bin (556534 records, 26769 patients)

=== Summary ===
Total patients: 133842
Total disease events: 2785572
Unique labels used: 1010
Train/Val split: 80% / 20%


## Verify Output (Optional)


In [18]:
# Verify the output by reading back the binary file
print("=== Verification ===\n")

# Read back the data
verify_data = np.fromfile(output_file, dtype=np.uint32).reshape(-1, 3)
print(f"Read {len(verify_data)} records from {output_file}")

# Show sample records
print("\nFirst 10 records [patient_id, days_from_birth, label_index]:")
print(verify_data[:10])

# Show records for a single patient
sample_patient = verify_data[0, 0]
patient_records = verify_data[verify_data[:, 0] == sample_patient]
print(f"\nAll records for patient {sample_patient}:")
print(patient_records)

# Decode labels for the sample patient
print(f"\nDecoded events for patient {sample_patient}:")
for record in patient_records:
    pid, days, label_idx = record
    # Find label name
    label_name = label_names[label_idx + 1] if label_idx >= -1 else "Unknown"  # +1 because index -1 is Padding
    age_years = days / 365.25
    print(f"  Age {age_years:.1f} years ({days} days): {label_name}")

# Show label distribution
print("\nLabel distribution (top 20):")
unique_labels, counts = np.unique(verify_data[:, 2], return_counts=True)
sorted_idx = np.argsort(-counts)[:20]
for idx in sorted_idx:
    label_idx = unique_labels[idx]
    count = counts[idx]
    label_name = label_names[label_idx + 1] if label_idx >= -1 else "Unknown"
    print(f"  {label_name}: {count} records")


=== Verification ===

Read 2785572 records from ../ukb_respiratory_data/full.bin

First 10 records [patient_id, days_from_birth, label_index]:
[[1000113       0       2]
 [1000113       0       4]
 [1000113       0       7]
 [1000113       0      11]
 [1000113   18452     583]
 [1000113   20874     444]
 [1000113   22525     832]
 [1000113   23723     679]
 [1000113   23723     785]
 [1000113   25256    1268]]

All records for patient 1000113:
[[1000113       0       2]
 [1000113       0       4]
 [1000113       0       7]
 [1000113       0      11]
 [1000113   18452     583]
 [1000113   20874     444]
 [1000113   22525     832]
 [1000113   23723     679]
 [1000113   23723     785]
 [1000113   25256    1268]]

Decoded events for patient 1000113:
  Age 0.0 years (0 days): Male
  Age 0.0 years (0 days): BMI_mid
  Age 0.0 years (0 days): Smoking_mid
  Age 0.0 years (0 days): Alcohol_high
  Age 50.5 years (18452 days): J18
  Age 57.1 years (20874 days): H33
  Age 61.7 years (22525 days): M