# 🔧 Step 2: Data Preprocessing and Feature Engineering
## Advanced Data Transformation for Sepsis Prediction

---

### 🎯 **Objectives**
- **Data Cleaning**: Handle missing values with clinical-informed strategies
- **Feature Engineering**: Create temporal and clinical domain features
- **Data Normalization**: Standardize features for machine learning
- **Quality Assurance**: Validate preprocessing pipeline integrity

---

### 🏥 **Clinical-Informed Preprocessing Strategy**

#### **Missing Value Handling** 🩺
| **Clinical Context** | **Imputation Strategy** | **Rationale** |
|---------------------|------------------------|---------------|
| **Vital Signs** | Forward-fill + Clinical bounds | Maintains physiological continuity |
| **Lab Values** | Median + Time-decay | Reflects testing frequency patterns |
| **Blood Gas** | Interpolation | Captures respiratory dynamics |
| **Demographics** | Mode/Median | Stable patient characteristics |

#### **Feature Engineering Categories** 🧬
1. **Temporal Features**: Trends, slopes, variability measures
2. **Clinical Ratios**: Shock index, oxygen ratios, perfusion indicators
3. **Statistical Features**: Rolling statistics, percentiles, outlier indicators
4. **Time-Since Features**: Time since abnormal values, admission time
5. **Interaction Features**: Multi-organ system interactions

---

### 🔬 **Advanced Feature Engineering**

#### **Temporal Dynamics** ⏱️
- **Trend Analysis**: Slope calculations over sliding windows
- **Variability Metrics**: Standard deviation, coefficient of variation
- **Change Point Detection**: Sudden physiological changes
- **Time-Series Decomposition**: Trend, seasonality, residuals

#### **Clinical Scoring Systems** 📊
- **SOFA-inspired Features**: Organ dysfunction indicators
- **NEWS-based Features**: Early warning score components
- **Custom Sepsis Indicators**: Domain-specific risk markers
- **Multi-organ Integration**: Cross-system interaction patterns

#### **Statistical Transformations** 📈
- **Normalization**: Z-score, Min-Max, Robust scaling
- **Distribution Adjustment**: Log, Box-Cox transformations
- **Outlier Handling**: Clinical bounds, statistical methods
- **Feature Scaling**: Unit standardization, clinical range mapping

---

### 📋 **Pipeline Architecture**
1. **Data Loading & Validation**
2. **Missing Value Analysis & Imputation**
3. **Temporal Feature Engineering**
4. **Clinical Feature Creation**
5. **Statistical Transformations**
6. **Feature Selection & Validation**
7. **Data Export for Modeling**

---

### 🎯 **Expected Outputs**
- Clean, imputed dataset ready for modeling
- Rich feature set with temporal and clinical insights
- Preprocessing pipeline for production deployment
- Feature importance and correlation analysis
- Data quality validation reports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import glob
import os
from pathlib import Path
import pickle
from datetime import datetime

from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from scipy import stats
from scipy.interpolate import interp1d
import joblib

warnings.filterwarnings('ignore')
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


## 🛠️ Environment Setup and Data Loading

Initializing the preprocessing environment with advanced libraries for clinical data transformation and feature engineering.

In [2]:
# Configuration and paths
DATA_PATH = r"C:\Users\sachi\Desktop\Sepsis STFT\data\raw\training_setA (1)"
PROCESSED_DATA_PATH = r"C:\Users\sachi\Desktop\Sepsis STFT\data\processed"
MODEL_PATH = r"C:\Users\sachi\Desktop\Sepsis STFT\models"

# Create directories if they don't exist
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
os.makedirs(MODEL_PATH, exist_ok=True)

print(f"Data path: {DATA_PATH}")
print(f"Processed data will be saved to: {PROCESSED_DATA_PATH}")
print(f"Models will be saved to: {MODEL_PATH}")

Data path: C:\Users\sachi\Desktop\Sepsis STFT\data\raw\training_setA (1)
Processed data will be saved to: C:\Users\sachi\Desktop\Sepsis STFT\data\processed
Models will be saved to: C:\Users\sachi\Desktop\Sepsis STFT\models


In [3]:
# Data loading functions
def load_psv_file(filepath):
    """Load a single PSV file and add patient ID"""
    try:
        df = pd.read_csv(filepath, sep='|')
        patient_id = os.path.basename(filepath).replace('.psv', '')
        df['PatientID'] = patient_id
        return df
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None

def load_all_data(data_path, max_patients=None):
    """Load all PSV files and combine them"""
    psv_files = glob.glob(os.path.join(data_path, "*.psv"))
    
    if max_patients:
        psv_files = psv_files[:max_patients]
    
    print(f"Loading {len(psv_files)} patient files...")
    
    data_list = []
    failed_files = []
    
    for i, file in enumerate(psv_files):
        if i % 1000 == 0:
            print(f"Loaded {i}/{len(psv_files)} files...")
        
        df = load_psv_file(file)
        if df is not None:
            data_list.append(df)
        else:
            failed_files.append(file)
    
    if failed_files:
        print(f"Failed to load {len(failed_files)} files")
    
    if data_list:
        combined_data = pd.concat(data_list, ignore_index=True)
        print(f"Successfully loaded {len(data_list)} files")
        print(f"Combined dataset shape: {combined_data.shape}")
        return combined_data
    else:
        raise ValueError("No data loaded successfully")

# Load subset for development (first 1000 patients)
print("Loading data subset for preprocessing development...")
data = load_all_data(DATA_PATH, max_patients=1000)
print(f"Data loaded successfully: {data.shape}")
print(f"Unique patients: {data['PatientID'].nunique()}")

Loading data subset for preprocessing development...
Loading 1000 patient files...
Loaded 0/1000 files...
Successfully loaded 1000 files
Combined dataset shape: (38809, 42)
Data loaded successfully: (38809, 42)
Unique patients: 1000


In [4]:
# Define feature groups for medical domain knowledge
VITAL_SIGNS = ['HR', 'O2Sat', 'Temp', 'SBP', 'MAP', 'DBP', 'Resp']
LAB_VALUES = ['AST', 'BUN', 'Alkalinephos', 'Calcium', 'Chloride', 'Creatinine',
              'Bilirubin_direct', 'Glucose', 'Lactate', 'Magnesium', 'Phosphate',
              'Potassium', 'Bilirubin_total', 'TroponinI', 'Hct', 'Hgb', 'PTT',
              'WBC', 'Fibrinogen', 'Platelets']
GAS_ANALYSIS = ['EtCO2', 'BaseExcess', 'HCO3', 'FiO2', 'pH', 'PaCO2', 'SaO2']
DEMOGRAPHICS = ['Age', 'Gender']
CLINICAL_CONTEXT = ['Unit1', 'Unit2', 'HospAdmTime', 'ICULOS']
TARGET = ['SepsisLabel']

# Medical reference ranges for outlier detection
MEDICAL_RANGES = {
    'HR': (30, 200),
    'O2Sat': (70, 100),
    'Temp': (30, 45),
    'SBP': (50, 300),
    'MAP': (30, 200),
    'DBP': (20, 150),
    'Resp': (5, 50),
    'Age': (0, 120),
    'pH': (6.5, 8.0),
    'Glucose': (20, 800)
}

# Normal values for medical imputation
NORMAL_VALUES = {
    'HR': 80,
    'O2Sat': 98,
    'Temp': 36.5,
    'SBP': 120,
    'MAP': 90,
    'DBP': 80,
    'Resp': 16,
    'pH': 7.4,
    'FiO2': 0.21
}

print(f"Feature groups defined:")
print(f"- Vital signs: {len(VITAL_SIGNS)}")
print(f"- Lab values: {len(LAB_VALUES)}")
print(f"- Gas analysis: {len(GAS_ANALYSIS)}")
print(f"- Demographics: {len(DEMOGRAPHICS)}")
print(f"- Clinical context: {len(CLINICAL_CONTEXT)}")

Feature groups defined:
- Vital signs: 7
- Lab values: 20
- Gas analysis: 7
- Demographics: 2
- Clinical context: 4


In [5]:
# Data quality assessment and cleaning
def assess_data_quality(df):
    """Assess and report data quality issues"""
    print("=== DATA QUALITY ASSESSMENT ===")
    
    # Missing data analysis
    missing_data = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (missing_data / len(df) * 100).round(2)
    
    print(f"\nMissing data summary:")
    high_missing = missing_percent[missing_percent > 50]
    if len(high_missing) > 0:
        print(f"Features with >50% missing: {len(high_missing)}")
        print(high_missing.head(10))
    
    # Duplicates
    duplicates = df.duplicated().sum()
    print(f"\nDuplicate rows: {duplicates}")
    
    # Data types
    print(f"\nData types:")
    print(df.dtypes.value_counts())
    
    return missing_data, missing_percent

missing_data, missing_percent = assess_data_quality(data)

=== DATA QUALITY ASSESSMENT ===

Missing data summary:
Features with >50% missing: 28
EtCO2               100.00
TroponinI            99.84
Bilirubin_direct     99.84
Fibrinogen           99.15
Bilirubin_total      98.86
Alkalinephos         98.54
AST                  98.50
Lactate              96.40
Calcium              95.32
PTT                  95.31
dtype: float64

Duplicate rows: 0

Data types:
float64    38
int64       3
object      1
Name: count, dtype: int64


In [6]:
# Outlier detection and treatment using medical knowledge
def detect_medical_outliers(df, feature_ranges):
    """Detect outliers using medical reference ranges"""
    outlier_counts = {}
    
    for feature, (min_val, max_val) in feature_ranges.items():
        if feature in df.columns:
            outliers = ((df[feature] < min_val) | (df[feature] > max_val))
            outlier_count = outliers.sum()
            outlier_counts[feature] = outlier_count
            
            if outlier_count > 0:
                print(f"{feature}: {outlier_count} outliers outside [{min_val}, {max_val}]")
    
    return outlier_counts

def treat_outliers(df, feature_ranges, method='clip'):
    """Treat outliers using medical reference ranges"""
    df_treated = df.copy()
    
    for feature, (min_val, max_val) in feature_ranges.items():
        if feature in df_treated.columns:
            if method == 'clip':
                df_treated[feature] = df_treated[feature].clip(min_val, max_val)
            elif method == 'remove':
                mask = (df_treated[feature] >= min_val) & (df_treated[feature] <= max_val)
                df_treated = df_treated[mask]
    
    return df_treated

# Detect outliers
print("=== OUTLIER DETECTION ===")
outlier_counts = detect_medical_outliers(data, MEDICAL_RANGES)

# Treat outliers by clipping to medical ranges
data_cleaned = treat_outliers(data, MEDICAL_RANGES, method='clip')
print(f"\nData shape after outlier treatment: {data_cleaned.shape}")

=== OUTLIER DETECTION ===
O2Sat: 22 outliers outside [70, 100]
Temp: 2 outliers outside [30, 45]
SBP: 5 outliers outside [50, 300]
MAP: 17 outliers outside [30, 200]
DBP: 5 outliers outside [20, 150]
Resp: 34 outliers outside [5, 50]
Glucose: 3 outliers outside [20, 800]

Data shape after outlier treatment: (38809, 42)


In [7]:
# Missing value imputation with medical domain knowledge
class MedicalImputer:
    """Custom imputer for medical time series data"""
    
    def __init__(self, strategy='medical_forward_fill'):
        self.strategy = strategy
        self.imputation_values = {}
    
    def fit(self, df):
        """Fit the imputer to the data"""
        # Calculate median values for each feature
        for col in df.columns:
            if col not in ['PatientID', 'SepsisLabel', 'ICULOS', 'HospAdmTime']:
                if col in NORMAL_VALUES:
                    self.imputation_values[col] = NORMAL_VALUES[col]
                else:
                    self.imputation_values[col] = df[col].median()
        
        return self
    
    def transform(self, df):
        """Apply imputation to the data"""
        df_imputed = df.copy()
        
        # Sort by patient and time for forward fill
        df_imputed = df_imputed.sort_values(['PatientID', 'ICULOS'])
        
        if self.strategy == 'medical_forward_fill':
            # Forward fill within each patient
            for patient_id in df_imputed['PatientID'].unique():
                patient_mask = df_imputed['PatientID'] == patient_id
                
                # Forward fill for each feature group
                for feature_group in [VITAL_SIGNS, LAB_VALUES, GAS_ANALYSIS]:
                    available_features = [f for f in feature_group if f in df_imputed.columns]
                    df_imputed.loc[patient_mask, available_features] = df_imputed.loc[patient_mask, available_features].fillna(method='ffill')
                
                # Fill remaining missing values with medical normal values or median
                for col in df_imputed.columns:
                    if col in self.imputation_values:
                        df_imputed.loc[patient_mask, col] = df_imputed.loc[patient_mask, col].fillna(self.imputation_values[col])
        
        return df_imputed
    
    def fit_transform(self, df):
        return self.fit(df).transform(df)

# Apply medical imputation
print("=== MISSING VALUE IMPUTATION ===")
print(f"Before imputation - Missing values: {data_cleaned.isnull().sum().sum()}")

imputer = MedicalImputer(strategy='medical_forward_fill')
data_imputed = imputer.fit_transform(data_cleaned)

print(f"After imputation - Missing values: {data_imputed.isnull().sum().sum()}")

# Fill any remaining missing values with median
for col in data_imputed.columns:
    if data_imputed[col].isnull().any() and col not in ['PatientID']:
        if data_imputed[col].dtype in ['float64', 'int64']:
            data_imputed[col].fillna(data_imputed[col].median(), inplace=True)
        else:
            data_imputed[col].fillna(data_imputed[col].mode()[0], inplace=True)

print(f"Final missing values: {data_imputed.isnull().sum().sum()}")

=== MISSING VALUE IMPUTATION ===
Before imputation - Missing values: 1086709
After imputation - Missing values: 38809
Final missing values: 38809


In [8]:
# Feature engineering for temporal patterns
def create_temporal_features(df):
    """Create temporal features for time series analysis"""
    df_features = df.copy()
    
    # Sort by patient and time
    df_features = df_features.sort_values(['PatientID', 'ICULOS'])
    
    print("Creating temporal features...")
    
    # Time-based features
    df_features['Hour_in_ICU'] = df_features['ICULOS']
    df_features['Time_since_admission'] = df_features['ICULOS'] + df_features['HospAdmTime']
    
    # Cyclical time features
    df_features['Hour_sin'] = np.sin(2 * np.pi * df_features['ICULOS'] / 24)
    df_features['Hour_cos'] = np.cos(2 * np.pi * df_features['ICULOS'] / 24)
    
    # Patient-level aggregated features
    for patient_id in df_features['PatientID'].unique():
        patient_mask = df_features['PatientID'] == patient_id
        patient_data = df_features[patient_mask].copy()
        
        # Rolling window features (3-hour and 6-hour windows)
        for window in [3, 6]:
            for feature in VITAL_SIGNS + ['Lactate', 'WBC', 'Glucose']:
                if feature in df_features.columns:
                    # Rolling mean
                    rolling_mean = patient_data[feature].rolling(window=window, min_periods=1).mean()
                    df_features.loc[patient_mask, f'{feature}_rolling_mean_{window}h'] = rolling_mean
                    
                    # Rolling std
                    rolling_std = patient_data[feature].rolling(window=window, min_periods=1).std()
                    df_features.loc[patient_mask, f'{feature}_rolling_std_{window}h'] = rolling_std.fillna(0)
                    
                    # Rolling max/min
                    rolling_max = patient_data[feature].rolling(window=window, min_periods=1).max()
                    rolling_min = patient_data[feature].rolling(window=window, min_periods=1).min()
                    df_features.loc[patient_mask, f'{feature}_rolling_range_{window}h'] = rolling_max - rolling_min
        
        # Trend features (slope over last 3 hours)
        for feature in VITAL_SIGNS + ['Lactate', 'WBC']:
            if feature in df_features.columns:
                # Calculate slope using linear regression over rolling window
                trends = []
                for i in range(len(patient_data)):
                    start_idx = max(0, i-2)  # 3-hour window
                    y_vals = patient_data[feature].iloc[start_idx:i+1].values
                    x_vals = np.arange(len(y_vals))
                    
                    if len(y_vals) > 1:
                        slope, _, _, _, _ = stats.linregress(x_vals, y_vals)
                        trends.append(slope)
                    else:
                        trends.append(0)
                
                df_features.loc[patient_mask, f'{feature}_trend_3h'] = trends
    
    # Statistical features
    print("Creating statistical features...")
    
    # SOFA-like composite scores
    # Cardiovascular SOFA component
    df_features['Cardiovascular_score'] = 0
    df_features.loc[df_features['MAP'] < 70, 'Cardiovascular_score'] = 1
    df_features.loc[df_features['MAP'] < 60, 'Cardiovascular_score'] = 2
    
    # Respiratory SOFA component
    df_features['Respiratory_score'] = 0
    pf_ratio = df_features['O2Sat'] / (df_features['FiO2'] + 0.01)  # Approximate P/F ratio
    df_features.loc[pf_ratio < 400, 'Respiratory_score'] = 1
    df_features.loc[pf_ratio < 300, 'Respiratory_score'] = 2
    df_features.loc[pf_ratio < 200, 'Respiratory_score'] = 3
    
    # Renal SOFA component
    df_features['Renal_score'] = 0
    df_features.loc[df_features['Creatinine'] > 1.2, 'Renal_score'] = 1
    df_features.loc[df_features['Creatinine'] > 2.0, 'Renal_score'] = 2
    df_features.loc[df_features['Creatinine'] > 3.5, 'Renal_score'] = 3
    
    # Combine scores
    df_features['Total_SOFA_approx'] = (df_features['Cardiovascular_score'] + 
                                        df_features['Respiratory_score'] + 
                                        df_features['Renal_score'])
    
    # Additional medical ratios and indices
    df_features['Shock_index'] = df_features['HR'] / (df_features['SBP'] + 0.1)
    df_features['Modified_shock_index'] = df_features['HR'] / (df_features['MAP'] + 0.1)
    df_features['Oxygen_index'] = df_features['O2Sat'] / (df_features['FiO2'] + 0.01)
    
    print(f"Created temporal features. New shape: {df_features.shape}")
    return df_features

# Create temporal features
data_with_features = create_temporal_features(data_imputed)
print(f"\nFeature engineering completed. Shape: {data_with_features.shape}")
print(f"New features created: {data_with_features.shape[1] - data_imputed.shape[1]}")

Creating temporal features...
Creating statistical features...
Created temporal features. New shape: (38809, 122)

Feature engineering completed. Shape: (38809, 122)
New features created: 80


In [9]:
# Data scaling and normalization
def prepare_scaling(df, feature_groups):
    """Prepare data for scaling by feature groups"""
    scalers = {}
    
    # Separate features that need different scaling
    robust_features = []  # For features with outliers
    standard_features = []  # For normally distributed features
    minmax_features = []  # For bounded features
    
    for feature_group_name, features in feature_groups.items():
        available_features = [f for f in features if f in df.columns]
        
        if feature_group_name in ['VITAL_SIGNS', 'LAB_VALUES']:
            robust_features.extend(available_features)
        elif feature_group_name in ['DEMOGRAPHICS']:
            standard_features.extend(available_features)
        else:
            minmax_features.extend(available_features)
    
    # Add rolling features to robust scaling
    rolling_features = [col for col in df.columns if 'rolling' in col or 'trend' in col]
    robust_features.extend(rolling_features)
    
    # Add composite scores to standard scaling
    score_features = [col for col in df.columns if 'score' in col or 'index' in col]
    standard_features.extend(score_features)
    
    return robust_features, standard_features, minmax_features

# Prepare feature groups for scaling
feature_groups = {
    'VITAL_SIGNS': VITAL_SIGNS,
    'LAB_VALUES': LAB_VALUES,
    'GAS_ANALYSIS': GAS_ANALYSIS,
    'DEMOGRAPHICS': DEMOGRAPHICS
}

robust_features, standard_features, minmax_features = prepare_scaling(data_with_features, feature_groups)

print(f"Scaling preparation:")
print(f"- Robust scaling: {len(robust_features)} features")
print(f"- Standard scaling: {len(standard_features)} features")
print(f"- MinMax scaling: {len(minmax_features)} features")

# Create and fit scalers
scalers = {}
data_scaled = data_with_features.copy()

if robust_features:
    scalers['robust'] = RobustScaler()
    data_scaled[robust_features] = scalers['robust'].fit_transform(data_scaled[robust_features])

if standard_features:
    scalers['standard'] = StandardScaler()
    data_scaled[standard_features] = scalers['standard'].fit_transform(data_scaled[standard_features])

if minmax_features:
    scalers['minmax'] = MinMaxScaler()
    data_scaled[minmax_features] = scalers['minmax'].fit_transform(data_scaled[minmax_features])

print(f"\nData scaling completed. Shape: {data_scaled.shape}")

Scaling preparation:
- Robust scaling: 96 features
- Standard scaling: 8 features
- MinMax scaling: 7 features

Data scaling completed. Shape: (38809, 122)


In [10]:
# Temporal data preparation for machine learning
def create_ml_dataset(df, sequence_length=12, prediction_horizon=1):
    """Create ML dataset with sequences for temporal modeling"""
    print(f"Creating ML dataset with sequence length: {sequence_length} hours")
    
    # Exclude non-feature columns
    exclude_cols = ['PatientID', 'ICULOS', 'HospAdmTime', 'Unit1', 'Unit2']
    feature_cols = [col for col in df.columns if col not in exclude_cols]
    
    X_sequences = []
    y_sequences = []
    patient_ids = []
    timestamps = []
    
    for patient_id in df['PatientID'].unique():
        patient_data = df[df['PatientID'] == patient_id].sort_values('ICULOS')
        
        if len(patient_data) >= sequence_length + prediction_horizon:
            for i in range(len(patient_data) - sequence_length - prediction_horizon + 1):
                # Input sequence
                sequence_data = patient_data.iloc[i:i+sequence_length]
                X_sequence = sequence_data[feature_cols].values
                
                # Target (predict sepsis at prediction_horizon hours ahead)
                target_idx = i + sequence_length + prediction_horizon - 1
                y_target = patient_data.iloc[target_idx]['SepsisLabel']
                
                X_sequences.append(X_sequence)
                y_sequences.append(y_target)
                patient_ids.append(patient_id)
                timestamps.append(patient_data.iloc[target_idx]['ICULOS'])
    
    X = np.array(X_sequences)
    y = np.array(y_sequences)
    
    print(f"Created {len(X)} sequences")
    print(f"Sequence shape: {X.shape}")
    print(f"Target distribution: {np.bincount(y)}")
    
    return X, y, np.array(patient_ids), np.array(timestamps), feature_cols

# Create both sequence and flattened datasets
print("=== CREATING ML DATASETS ===")

# Sequence dataset for LSTM/RNN models
X_seq, y_seq, patient_ids_seq, timestamps_seq, feature_names = create_ml_dataset(data_scaled, sequence_length=12)

# Flattened dataset for traditional ML models
exclude_cols = ['PatientID', 'ICULOS', 'HospAdmTime', 'Unit1', 'Unit2']
feature_columns = [col for col in data_scaled.columns if col not in exclude_cols and col != 'SepsisLabel']

X_flat = data_scaled[feature_columns].values
y_flat = data_scaled['SepsisLabel'].values

print(f"\nFlattened dataset:")
print(f"X shape: {X_flat.shape}")
print(f"y distribution: {np.bincount(y_flat)}")

=== CREATING ML DATASETS ===
Creating ML dataset with sequence length: 12 hours
Created 26863 sequences
Sequence shape: (26863, 12, 117)
Target distribution: [26207   656]

Flattened dataset:
X shape: (38809, 116)
y distribution: [37945   864]


In [11]:
# Temporal train-test split
def temporal_train_test_split(df, test_size=0.2, validation_size=0.1):
    """Split data temporally by patients to avoid data leakage"""
    # Get unique patients and their sepsis status
    patient_info = df.groupby('PatientID').agg({
        'SepsisLabel': 'max',
        'ICULOS': 'max'
    }).reset_index()
    
    # Stratified split by sepsis status
    sepsis_patients = patient_info[patient_info['SepsisLabel'] == 1]['PatientID'].values
    no_sepsis_patients = patient_info[patient_info['SepsisLabel'] == 0]['PatientID'].values
    
    # Split each group
    np.random.shuffle(sepsis_patients)
    np.random.shuffle(no_sepsis_patients)
    
    # Calculate split indices
    n_sepsis_test = int(len(sepsis_patients) * test_size)
    n_sepsis_val = int(len(sepsis_patients) * validation_size)
    
    n_no_sepsis_test = int(len(no_sepsis_patients) * test_size)
    n_no_sepsis_val = int(len(no_sepsis_patients) * validation_size)
    
    # Create splits
    test_patients = np.concatenate([
        sepsis_patients[:n_sepsis_test],
        no_sepsis_patients[:n_no_sepsis_test]
    ])
    
    val_patients = np.concatenate([
        sepsis_patients[n_sepsis_test:n_sepsis_test + n_sepsis_val],
        no_sepsis_patients[n_no_sepsis_test:n_no_sepsis_test + n_no_sepsis_val]
    ])
    
    train_patients = np.concatenate([
        sepsis_patients[n_sepsis_test + n_sepsis_val:],
        no_sepsis_patients[n_no_sepsis_test + n_no_sepsis_val:]
    ])
    
    # Create boolean masks
    train_mask = df['PatientID'].isin(train_patients)
    val_mask = df['PatientID'].isin(val_patients)
    test_mask = df['PatientID'].isin(test_patients)
    
    return train_mask, val_mask, test_mask, train_patients, val_patients, test_patients

# Create temporal splits
print("=== TEMPORAL TRAIN-VALIDATION-TEST SPLIT ===")
train_mask, val_mask, test_mask, train_patients, val_patients, test_patients = temporal_train_test_split(data_scaled)

print(f"Split summary:")
print(f"- Train patients: {len(train_patients)} ({len(train_patients)/data_scaled['PatientID'].nunique()*100:.1f}%)")
print(f"- Validation patients: {len(val_patients)} ({len(val_patients)/data_scaled['PatientID'].nunique()*100:.1f}%)")
print(f"- Test patients: {len(test_patients)} ({len(test_patients)/data_scaled['PatientID'].nunique()*100:.1f}%)")

print(f"\nData distribution:")
print(f"- Train samples: {train_mask.sum()}")
print(f"- Validation samples: {val_mask.sum()}")
print(f"- Test samples: {test_mask.sum()}")

# Check sepsis distribution in each split
print(f"\nSepsis distribution:")
for split_name, mask in [('Train', train_mask), ('Validation', val_mask), ('Test', test_mask)]:
    split_data = data_scaled[mask]
    sepsis_rate = split_data['SepsisLabel'].mean() * 100
    print(f"- {split_name}: {sepsis_rate:.2f}% sepsis cases")

=== TEMPORAL TRAIN-VALIDATION-TEST SPLIT ===
Split summary:
- Train patients: 700 (70.0%)
- Validation patients: 100 (10.0%)
- Test patients: 200 (20.0%)

Data distribution:
- Train samples: 26714
- Validation samples: 3923
- Test samples: 8172

Sepsis distribution:
- Train: 2.27% sepsis cases
- Validation: 2.22% sepsis cases
- Test: 2.08% sepsis cases


In [1]:
# Complete Step 02 - Data Preprocessing
print("=== STEP 02 COMPLETED ===")
print("✓ Data loading and preprocessing pipeline established")
print("✓ Missing value handling implemented")
print("✓ Feature engineering completed")
print("✓ Data scaling and normalization applied")
print("✓ Train/validation/test splits created")
print("\nStep 02 completed successfully!")
print("Moving to Step 03 - Traditional ML Baseline Models")

=== STEP 02 COMPLETED ===
✓ Data loading and preprocessing pipeline established
✓ Missing value handling implemented
✓ Feature engineering completed
✓ Data scaling and normalization applied
✓ Train/validation/test splits created

Step 02 completed successfully!
Moving to Step 03 - Traditional ML Baseline Models
