# Main Idea

The main idea here is to identify **normal periods** (negative samples) and **acute hypotensive episodes** (positive samples). The plan is to use lab tests/measurements taken from several hours to half an hour before these periods (this window can be adjusted later for optimal performance) as features for prediction.

## Key Measurements
The key measurements include:
- **Diastolic blood pressure** (220051)
- **Systolic blood pressure** (220050)
- **Heart rate** (220045)
- **SpO2** (220277)
- **MAP (Mean Arterial Pressure)** (220052)
- **Respiratory rate** (220210)

Additionally, some general features like **age**, **gender**, etc., will be included to form the dataset for training and prediction.

## Data Source and Processing
My data primarily comes from executing **SQL queries on Google BigQuery**. After that, the corresponding query results are downloaded as **CSV files**, which are then read into my code for use. If needed, we can modify the workflow to directly access the database and execute SQL queries in **Google Colab** to streamline the process later.


In [None]:
from IPython.display import display

import warnings

warnings.filterwarnings('ignore')

# Data Cleaning



- Removed outliers from the dataset using Tukey's method and Modified Z-score method.
Replaced outliers with NaN and then dropped them from the dataset.

In [None]:
#clean the "all map for patient having less than 60.csv" (used to extract samples) by removing outliers directly.

import numpy as np
import pandas as pd

# Tukey method to detect and replace outliers
def detect_outliers_tukey(df, value_column):
    Q1 = df[value_column].quantile(0.25)
    Q3 = df[value_column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Detect outliers
    outliers = (df[value_column] < lower_bound) | (df[value_column] > upper_bound)

    # Replace outliers with NaN
    df.loc[outliers, value_column] = np.nan
    return df

# Modified Z-score method to detect and replace outliers
def detect_outliers_z(df, value_column):
    median = df[value_column].median()
    mad = np.median(np.abs(df[value_column] - median))
    modified_z_score = 0.6745 * (df[value_column] - median) / mad

    # Detect outliers
    outliers = np.abs(modified_z_score) > 3.5

    # Replace outliers with NaN
    df.loc[outliers, value_column] = np.nan
    return df

# Detect and remove outliers using both methods
def clean_outliers(df, value_column='valuenum'):
    # First, use the Tukey method
    df = detect_outliers_tukey(df, value_column)
    # Then, use the modified Z-score method
    df = detect_outliers_z(df, value_column)
    # Remove rows with NaN values
    df = df.dropna(subset=[value_column])
    return df

# Load data
df = pd.read_csv('all map for patient having less than 60.csv')

# Clean outliers in the 'valuenum' column before filtering
df = clean_outliers(df, value_column='valuenum')

# Extracting Positive Samples (Acute Hypotensive Episodes)


- Identified periods where patients experienced acute hypotensive episodes based on specific criteria.
- Filtered the dataset to include only these periods.

In [None]:
#extract positive samples(periods of acute hypotensive episodes)
#The criteria for selection are: if there are two consecutive records（labtest） with values less than 60 mmHg, and the time difference between them is greater than 30 minutes, then this period is considered as an acute hypotensive episode.
#Additionally, I need to check whether the record（labtest） prior to these two records were greater than 60 mmHg to ensure that the hypotensive episode started precisely at this period, and it's not part of an ongoing event.

# Sort the dataframe by subject_id, hadm_id, stay_id, and charttime
df.sort_values(by=['subject_id', 'hadm_id','stay_id', 'charttime'], inplace=True)

# Convert charttime to datetime format
df['charttime'] = pd.to_datetime(df['charttime'])

# Calculate the time difference between the current and the next record, and create a new column for time difference
df['time_diff_next'] = df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(-1) - df['charttime']
df['time_diff'] = df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].diff()

# Columns that need to be checked for identical values
columns_to_check = ['subject_id', 'hadm_id', 'stay_id']

# Filtering condition
condition = (
    ((df['valuenum'] < 60) &  # The current value is less than 60
    (df['valuenum'].shift(-1) < 60) &  # The next value is also less than 60
    (df['valuenum'].rolling(window=2, min_periods=2).min().shift(1) >= 60) &  # The minimum value in the 2-window rolling period before is greater than or equal to 60
    (df['time_diff_next'] > pd.Timedelta(minutes=30)) &  # The time difference between the current and next record is greater than 30 minutes
    (df['time_diff_next'] < pd.Timedelta(hours=12)) &  # The time difference between the current and next record is less than 12 hours
    ((df['charttime'] - df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(2)) >= pd.Timedelta(hours=1)) &  # Time difference from 2 records before is greater than or equal to 1 hour
    ((df['charttime'] - df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(2)) <= pd.Timedelta(hours=24)) &  # Time difference from 2 records before is less than or equal to 24 hours
    (df[columns_to_check].shift(1) == df[columns_to_check]).all(axis=1) &  # Ensure the values for subject_id, hadm_id, stay_id remain the same in previous records
    (df[columns_to_check].shift(2) == df[columns_to_check]).all(axis=1) &  # Ensure the values for subject_id, hadm_id, stay_id remain the same 2 records back
    (df[columns_to_check].shift(-1) == df[columns_to_check]).all(axis=1))   # Ensure the values for subject_id, hadm_id, stay_id remain the same in the next record
)

# Filter the dataframe based on the above condition
filtered_df_1 = df[condition]

# Display the filtered result
print(filtered_df_1)

# Calculate the number of unique combinations of subject_id and hadm_id
unique_combinations_count = filtered_df_1.groupby(['subject_id', 'hadm_id']).ngroups

# Display the result
#print(f"Number of unique subject_id + hadm_id combinations: {unique_combinations_count}")
display(len(filtered_df_1))

In [None]:
#Simply check if the selected data is what I want

i = df.index[df['charttime'] == '2144-01-27 20:02:00'].tolist()[0]


previous_row = df.iloc[i-3] if i > 0 else None
current_row = df.iloc[i]
next_row = df.iloc[i+1] if i < len(df)-1 else None


display("previous record:")
display(previous_row)

display("\n current:")
display(current_row)

display("\n next record:")
display(next_row)

# Extracting Negative Samples (Normal Periods)


- Identified normal periods where measurements were consistently above 60 mmHg.
- Filtered the dataset to include only these normal periods.
- Selected only one sample per day to reduce the number of negative samples.

In [None]:
# extract negative samples(normal period)
#If a record, including the previous and subsequent records, has values all greater than 60 mmHg, we can consider this period as a normal period.

condition = (
    (df['valuenum'] >= 60) &
    (df['valuenum'].rolling(window=3, min_periods=3).min().shift(-1) >= 60) &
    (df['valuenum'].rolling(window=3, min_periods=3).min().shift(1) >= 60) &
    ((df['charttime'] - df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(3)) >= pd.Timedelta(hours=1.5)) &
    ((df['charttime'] - df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(3)) <= pd.Timedelta(hours=12)) &
    ((df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(-3))-df['charttime'] >= pd.Timedelta(hours=1.5)) &
    ((df.groupby(['subject_id', 'hadm_id','stay_id'])['charttime'].shift(-3))-df['charttime'] <= pd.Timedelta(hours=12)) &
    (df[columns_to_check].shift(1) == df[columns_to_check]).all(axis=1) &
    (df[columns_to_check].shift(2) == df[columns_to_check]).all(axis=1) &
    (df[columns_to_check].shift(3) == df[columns_to_check]).all(axis=1) &
    (df[columns_to_check].shift(-1) == df[columns_to_check]).all(axis=1) &
    (df[columns_to_check].shift(-2) == df[columns_to_check]).all(axis=1) &
    (df[columns_to_check].shift(-3) == df[columns_to_check]).all(axis=1)

)


filtered_df_0 = df[condition]


print(filtered_df_0)


unique_combinations_count = filtered_df_0.groupby(['subject_id', 'hadm_id']).ngroups


#print(f"Number of unique subject_id + hadm_id combinations:: {unique_combinations_count}")
print(len(filtered_df_0))

In [None]:
# Since there are too many negative samples (normal periods), only one sample per day will be selected
filtered_df_0['date'] = filtered_df_0['charttime'].dt.date

# Group by subject_id, hadm_id, stay_id, date and select the last record from each group
filtered_df_0_unique = filtered_df_0.groupby(['subject_id', 'hadm_id', 'stay_id', 'date']).last().reset_index()

# View the result
print(filtered_df_0_unique)

# Calculate the number of unique subject_id + hadm_id combinations
unique_combinations_count = filtered_df_0_unique.groupby(['subject_id', 'hadm_id']).ngroups
print(f"Number of unique subject_id + hadm_id combinations: {unique_combinations_count}")
print(len(filtered_df_0_unique))

# Balancing the Dataset


- Added labels to the positive (1) and negative (0) samples.
- Combined the datasets and performed downsampling to create a balanced dataset with equal numbers of positive and negative samples.

In [None]:
# Add label to the samples
filtered_df_1['label'] = 1
filtered_df_0_unique['label'] = 0

# Select the required columns
columns_to_keep = ['subject_id', 'hadm_id', 'stay_id', 'caregiver_id', 'charttime', 'label']

# Concatenate the two DataFrames by columns
combined_df = pd.concat([filtered_df_1[columns_to_keep], filtered_df_0_unique[columns_to_keep]])

# Reset the index
combined_df.reset_index(drop=True, inplace=True)

# View the combined result
combined_df

In [None]:
# Negative samples still far outnumber positive samples, so we perform downsampling
# To create balanced dataset with 8,000 samples (4,000 from each class)

# Sample 4,000 samples from the data with label 0
df_label_0 = combined_df[combined_df['label'] == 0].sample(n=4000, random_state=42)

# Sample 4,000 samples from the data with label 1
df_label_1 = combined_df[combined_df['label'] == 1].sample(n=4000, random_state=42)

# Concatenate the two sampled datasets
combined_sampled_df = pd.concat([df_label_0, df_label_1])

# Reset the index
combined_sampled_df.reset_index(drop=True, inplace=True)

# View the result
print("Combined sampled dataset shape:", combined_sampled_df.shape)
combined_sampled_df.head()

# Cleaning Feature Data (item3.csv)


- Cleaned the item3.csv file by detecting and handling outliers.
- Replaced outliers using interpolation for continuous data.
- Saved the cleaned data to new files for future use.


In [None]:
import numpy as np
import pandas as pd

# Function to detect and replace outliers using Tukey's method
def detect_outliers_tukey(df, value_column):
   Q1 = df[value_column].quantile(0.25)
   Q3 = df[value_column].quantile(0.75)
   IQR = Q3 - Q1
   lower_bound = Q1 - 1.5 * IQR
   upper_bound = Q3 + 1.5 * IQR

   # Detect outliers
   outliers = (df[value_column] < lower_bound) | (df[value_column] > upper_bound)

   # Replace outliers with NaN
   df.loc[outliers, value_column] = np.nan
   return df

# Function to detect and replace outliers using Modified Z-score method
def detect_outliers_z(df, value_column):
   median = df[value_column].median()
   mad = np.median(np.abs(df[value_column] - median))
   modified_z_score = 0.6745 * (df[value_column] - median) / mad

   # Detect outliers
   outliers = np.abs(modified_z_score) > 3.5

   # Replace outliers with NaN
   df.loc[outliers, value_column] = np.nan
   return df

# Group by 'itemid' to handle different measurements separately
def clean_outliers_by_itemid(df, value_column='valuenum'):
   df_cleaned = pd.DataFrame()

   # Process each itemid separately
   for itemid, group in df.groupby('itemid'):
       # Detect outliers using Tukey's method and Modified Z-score method
       group = detect_outliers_tukey(group, value_column)
       group = detect_outliers_z(group, value_column)

       # Fill NaN values (outliers) using linear interpolation
       group[value_column] = group[value_column].interpolate(method='linear')

       # Append cleaned group
       df_cleaned = pd.concat([df_cleaned, group], axis=0)
       print("1")

   return df_cleaned

# Example usage:
# Load your dataset
items_df = pd.read_csv('item3.csv')

# Clean outliers in your dataset
items_df_cleaned = clean_outliers_by_itemid(items_df)

# Save the cleaned data to a new file if needed
items_df_cleaned.to_csv('cleaned_item3.csv', index=False)

In [None]:
print(items_df.shape)
print(items_df_cleaned.shape)

In [None]:
#import numpy as np
#import pandas as pd
#
## Function to detect and replace outliers using Tukey's method
#def detect_outliers_tukey(df, value_column):
#    Q1 = df[value_column].quantile(0.25)
#    Q3 = df[value_column].quantile(0.75)
#    IQR = Q3 - Q1
#    lower_bound = Q1 - 1.5 * IQR
#    upper_bound = Q3 + 1.5 * IQR
#
#    # Detect outliers
#    outliers = (df[value_column] < lower_bound) | (df[value_column] > upper_bound)
#
#    # Replace outliers with NaN
#    df.loc[outliers, value_column] = np.nan
#    return df
#
## Function to detect and replace outliers using Modified Z-score method
#def detect_outliers_z(df, value_column):
#    median = df[value_column].median()
#    mad = np.median(np.abs(df[value_column] - median))
#    modified_z_score = 0.6745 * (df[value_column] - median) / mad
#
#    # Detect outliers
#    outliers = np.abs(modified_z_score) > 3.5
#
#    # Replace outliers with NaN
#    df.loc[outliers, value_column] = np.nan
#    return df
#
## Group by 'itemid' to handle different measurements separately
#def clean_outliers_by_itemid(df, value_column='valuenum'):
#    df_cleaned = pd.DataFrame()
#
#    # Process each itemid separately
#    for itemid, group in df.groupby('itemid'):
#        # Detect outliers using Tukey's method and Modified Z-score method
#        group = detect_outliers_tukey(group, value_column)
#        group = detect_outliers_z(group, value_column)
#
#        # Drop rows with NaN values (outliers)
#        group = group.dropna(subset=[value_column])
#
#        # Append cleaned group
#        df_cleaned = pd.concat([df_cleaned, group], axis=0)
#
#    return df_cleaned
#
## Example usage:
## Load your dataset
#items_df = pd.read_csv('item3.csv')
#
## Clean outliers in your dataset
#items_df_cleaned = clean_outliers_by_itemid(items_df)
#
## Save the cleaned data to a new file if needed
#items_df_cleaned.to_csv('cleaned_item3_2.csv', index=False)
#

# Feature Extraction


- Extracted statistical features (mean, max, min, median, etc.) for each measurement within a specified time window (from 6.5 to 0.5 hours before the sample time).
- Calculated 12 features consistent with the method used in the reference paper.
- Stored the features in a new DataFrame for merging.

In [None]:
 from scipy.stats import skew, kurtosis

 items_df = pd.read_csv('cleaned_item3.csv')

 # First, convert charttime to datetime format
 combined_sampled_df['charttime'] = pd.to_datetime(combined_sampled_df['charttime'])
 items_df['charttime'] = pd.to_datetime(items_df['charttime'])

 # Create an empty list to store the results
 new_features = []
 i = 0

 # Iterate through each row in combined_df
 for _, row in combined_sampled_df.iterrows():
     subject_id = row['subject_id']
     hadm_id = row['hadm_id']
     stay_id = row['stay_id']
     charttime = row['charttime']

     # Find matching records in items_df based on subject_id, hadm_id, and stay_id
     relevant_items = items_df[
         (items_df['subject_id'] == subject_id) &
         (items_df['hadm_id'] == hadm_id) &
         (items_df['stay_id'] == stay_id)
     ]

     # Filter events within the time window (from 0.5 hours to 6.5 hours before the current record)
     time_window_items = relevant_items[
         (relevant_items['charttime'] >= charttime - pd.Timedelta(hours=6.5)) &
         (relevant_items['charttime'] <= charttime - pd.Timedelta(hours=0.5))
     ]

     # Since multiple measurements may be taken within this time window, we choose to use statistics such as the mean, max, min, etc. (a total of 12 features),
     # This 12 features consistent with the method used in the reference paper.
     # Group by itemid and calculate statistics for valuenum (mean, max, min, etc.)
     grouped = time_window_items.groupby('itemid')['valuenum'].agg([
         'mean',                # Mean
         'max',                 # Maximum
         'min',                 # Minimum
         'median',              # Median
         'std',                 # Standard Deviation
         ('skewness', skew),    # Skewness (using scipy.stats.skew)
         ('kurtosis', kurtosis),# Kurtosis (using scipy.stats.kurtosis)
         ('q75', lambda x: np.percentile(x, 75)),  # 75th percentile
         ('q25', lambda x: np.percentile(x, 25)),  # 25th percentile
         ('mad', lambda x: np.mean(np.abs(x - np.mean(x)))),  # Mean Absolute Deviation
         ('range', lambda x: np.max(x) - np.min(x)),  # Range
         'var'                  # Variance
     ])

     # Store the result as a dictionary for easier merging with combined_df
     grouped_dict = grouped.to_dict('index')

     # Add to the new_features list to be merged later
     new_features.append(grouped_dict)

     # Print progress
     i += 1
     print(i)

 # Convert the new features to a DataFrame and merge with combined_df
 new_features_df = pd.DataFrame(new_features)
 final_df = pd.concat([combined_sampled_df.reset_index(drop=True), new_features_df.reset_index(drop=True)], axis=1)

# Adding Additional Features (Vasopressin and Ventilation Usage)


- Added binary features indicating whether vasopressin or ventilation was used within the specified time window before the sample time.
- Although initial tests showed these features might not significantly affect results, they were included for completeness.

In [None]:
# Add vasopressin and ventilation usage information as binary features (1 if used, 0 if not)
# Testing revealed that these two features do not significantly affect the results, so they might be discarded.

vasopressin_df = pd.read_csv('vasopressin.csv')
ventilation_df = pd.read_csv('ventilation.csv')

vasopressin_df['starttime'] = pd.to_datetime(vasopressin_df['starttime'])
vasopressin_df['endtime'] = pd.to_datetime(vasopressin_df['endtime'])
ventilation_df['starttime'] = pd.to_datetime(ventilation_df['starttime'])
ventilation_df['endtime'] = pd.to_datetime(ventilation_df['endtime'])



# Define a function to check if vasopressin or ventilation was given in the required time window
def check_treatment(row):
    charttime = row['charttime']
    stay_id = row['stay_id']

    # Time window: charttime - 1.5 hours to charttime - 0.5 hours
    time_start = charttime - pd.Timedelta(hours=6.5)
    time_end = charttime - pd.Timedelta(hours=0.5)

    # Check for vasopressin
    vasopressin_given = vasopressin_df[
        (vasopressin_df['stay_id'] == stay_id) &
        (vasopressin_df['starttime'] >= time_start) &
        (vasopressin_df['starttime'] <= time_end)
    ]

    # Check for ventilation
    ventilation_given = ventilation_df[
        (ventilation_df['stay_id'] == stay_id) &
        (ventilation_df['starttime'] >= time_start) &
        (ventilation_df['starttime'] <= time_end)
    ]

    # Assign 1 if given, 0 otherwise
    row['vasopressin'] = 1 if not vasopressin_given.empty else 0
    row['ventilation'] = 1 if not ventilation_given.empty else 0

    return row

# Apply the function to every row in final_df_cleaned
final_df = final_df.apply(check_treatment, axis=1)

# Adding General Features (Age, Gender, APSIII, SOFA)


- Merged additional patient data such as age, gender, APSIII scores, and SOFA scores into the main dataset.
- This provided more context and potential predictive power for the model.

In [None]:
#add general features like age, gender...

apsiii_df = pd.read_csv('apsiii.csv')
age_df= pd.read_csv('age.csv')
gender_df= pd.read_csv('gender.csv')
sofa_df= pd.read_csv('sofa.csv')

final_df = pd.merge(final_df, apsiii_df, on=['subject_id', 'hadm_id', 'stay_id'], how='left')
final_df = pd.merge(final_df, age_df, on=['subject_id', 'hadm_id'], how='left')
final_df = pd.merge(final_df, gender_df, on=['subject_id'], how='left')
final_df = pd.merge(final_df, sofa_df, on=['subject_id', 'hadm_id', 'stay_id'], how='left')

# show results
display(final_df)

# Preparing the Final Data for Modeling
What We Did:

- Dropped rows with null values to ensure data quality.
- Converted categorical variables (e.g., gender) into numerical format suitable for modeling.

In [None]:
# drop null values
final_df_cleaned = final_df.dropna()

# Convert the values in the gender column: "male" to 1 and "female" to 0
final_df_cleaned['gender'] = final_df_cleaned['gender'].map({'M': 1, 'F': 0})


display(final_df_cleaned)

# Extracting Statistical Features for Each Lab Test


- Extracted 12 statistical features for each lab test measurement.
- These features include mean, max, min, median, standard deviation, skewness, kurtosis, percentiles, etc.
- Removed the original nested data structures after extracting the features.

In [None]:
# Extract 12 feature values for each lab test
#diastolic-220051  systolic-220050   Heart Rate-220045  SpO2-220277   map-220052   Respiratory Rate-220210
def extract_features(item_series):
    return pd.Series({
        'mean': item_series.get('mean'),
        'max': item_series.get('max'),
        'min': item_series.get('min'),
        'median': item_series.get('median'),
        'std': item_series.get('std'),
        'skewness': item_series.get('skewness'),
        'kurtosis': item_series.get('kurtosis'),
        'q75': item_series.get('q75'),
        'q25': item_series.get('q25'),
        'mad': item_series.get('mad'),
        'range': item_series.get('range'),
        'var': item_series.get('var')
    })


final_df_cleaned[['220051_mean', '220051_max', '220051_min', '220051_median', '220051_std', '220051_skewness',
                  '220051_kurtosis', '220051_q75', '220051_q25', '220051_mad', '220051_range', '220051_var']] = \
    final_df_cleaned[220051].apply(extract_features)


final_df_cleaned[['220050_mean', '220050_max', '220050_min', '220050_median', '220050_std', '220050_skewness',
                  '220050_kurtosis', '220050_q75', '220050_q25', '220050_mad', '220050_range', '220050_var']] = \
    final_df_cleaned[220050].apply(extract_features)


final_df_cleaned[['220045_mean', '220045_max', '220045_min', '220045_median', '220045_std', '220045_skewness',
                  '220045_kurtosis', '220045_q75', '220045_q25', '220045_mad', '220045_range', '220045_var']] = \
    final_df_cleaned[220045].apply(extract_features)


final_df_cleaned[['220277_mean', '220277_max', '220277_min', '220277_median', '220277_std', '220277_skewness',
                  '220277_kurtosis', '220277_q75', '220277_q25', '220277_mad', '220277_range', '220277_var']] = \
    final_df_cleaned[220277].apply(extract_features)


final_df_cleaned[['220052_mean', '220052_max', '220052_min', '220052_median', '220052_std', '220052_skewness',
                  '220052_kurtosis', '220052_q75', '220052_q25', '220052_mad', '220052_range', '220052_var']] = \
    final_df_cleaned[220052].apply(extract_features)

final_df_cleaned[['220210_mean', '220210_max', '220210_min', '220210_median', '220210_std', '220210_skewness',
                  '220210_kurtosis', '220210_q75', '220210_q25', '220210_mad', '220210_range', '220210_var']] = \
    final_df_cleaned[220210].apply(extract_features)


final_df_cleaned.drop(columns=[220051, 220050, 220045, 220277, 220052, 220210], inplace=True)


#print(final_df_cleaned)

final_df_cleaned = final_df_cleaned.dropna()


display(final_df_cleaned)

#Prepare Data

In [None]:
# Exclude non-feature columns
non_feature_columns = ['label', 'subject_id', 'hadm_id', 'stay_id', 'caregiver_id', 'charttime','anchor_age']

# Features (X) and target (y)
X = final_df_cleaned.drop(columns=non_feature_columns)
y = final_df_cleaned['label']

# Ensure there are no missing values
X = X.dropna()
y = y.loc[X.index]

print("Features shape:", X.shape)
print("Target shape:", y.shape)

# Feature Selection using ReliefF

- apply the ReliefF algorithm to rank the features
- fs_relief.feature_importances_ contains the importance scores assigned to each feature by ReliefF.
- create a DataFrame to map each feature to its score.
- sort the features in descending order of importance.

In [None]:
!pip install skrebate

In [None]:
from skrebate import ReliefF

# Initialize ReliefF
fs_relief = ReliefF(n_neighbors=100, n_features_to_select=X.shape[1])

# Fit the model
fs_relief.fit(X.values, y.values)

# Get feature scores
relief_scores = fs_relief.feature_importances_

# Create a DataFrame for easy viewing
relief_feature_scores = pd.DataFrame({'Feature': X.columns, 'ReliefF_Score': relief_scores})

# Sort the features by score
relief_feature_scores.sort_values(by='ReliefF_Score', ascending=False, inplace=True)

# Display the top features
print("Top features according to ReliefF:")
print(relief_feature_scores.head(20))

# Feature Selection using Mutual Information

In [None]:
from sklearn.feature_selection import mutual_info_classif

# Compute mutual information scores
mi_scores = mutual_info_classif(X, y)

# Create a DataFrame for easy viewing
mi_feature_scores = pd.DataFrame({'Feature': X.columns, 'MI_Score': mi_scores})

# Sort the features by score
mi_feature_scores.sort_values(by='MI_Score', ascending=False, inplace=True)

# Display the top features
print("Top features according to Mutual Information:")
print(mi_feature_scores.head(20))

# Feature Selection using Gini Index

use a Decision Tree Classifier to compute feature importances based on the Gini Index.

- The Decision Tree Classifier computes feature importances based on how much each feature decreases the impurity (Gini Index) in the tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize Decision Tree Classifier
dt_clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Fit the model
dt_clf.fit(X, y)

# Get feature importances
gini_importances = dt_clf.feature_importances_

# Create a DataFrame for easy viewing
gini_feature_importances = pd.DataFrame({'Feature': X.columns, 'Gini_Importance': gini_importances})

# Sort the features by importance
gini_feature_importances.sort_values(by='Gini_Importance', ascending=False, inplace=True)

# Display the top features
print("Top features according to Gini Importance:")
print(gini_feature_importances.head(20))

# Combine Feature Scores

In [None]:
# Merge the feature scores on the 'Feature' column
feature_scores = relief_feature_scores.merge(mi_feature_scores, on='Feature')
feature_scores = feature_scores.merge(gini_feature_importances, on='Feature')

print("Combined Feature Scores:")
print(feature_scores.head())

normalize the scores to ensure they are on the same scale before combining them.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
feature_scores[['ReliefF_Score', 'MI_Score', 'Gini_Importance']] = scaler.fit_transform(
    feature_scores[['ReliefF_Score', 'MI_Score', 'Gini_Importance']])

print("Normalized Feature Scores:")
print(feature_scores.head())

compute the combined score by summing the normalized scores.

In [None]:
# Compute the combined score
feature_scores['Combined_Score'] = (feature_scores['ReliefF_Score'] +
                                    feature_scores['MI_Score'] +
                                    feature_scores['Gini_Importance'])

# Rank features based on the combined score
feature_scores.sort_values(by='Combined_Score', ascending=False, inplace=True)

# Reset index
feature_scores.reset_index(drop=True, inplace=True)


print("Top features according to Combined Score:")
print(feature_scores[['Feature', 'Combined_Score']].head(20))

In [None]:
# Calculate the number of features to select (e.g., top 39%)
num_features = int(len(feature_scores) * 0.61)

print(f"Number of features to select: {num_features}")

In [None]:
# Get the list of selected features
selected_features = feature_scores['Feature'].values[:num_features]

print("Selected features:")
print(selected_features)

In [None]:
# Create a new DataFrame with selected features
X_selected = X[selected_features]

print("Shape of X_selected:", X_selected.shape)

# Split the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# Stratify ensures the class distribution remains balanced
X_train, X_test, y_train, y_test = train_test_split(
    X_selected, y, test_size=0.3, random_state=42, stratify=y)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

# Train and Evaluate Models

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize the model
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
lr_clf.fit(X_train, y_train)

# Predict on test set
y_pred_lr = lr_clf.predict(X_test)
y_pred_prob_lr = lr_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
auc_lr = roc_auc_score(y_test, y_pred_prob_lr)

print("Logistic Regression Performance:")
print(f'Accuracy: {accuracy_lr:.3f}')
print(f'Precision: {precision_lr:.3f}')
print(f'Recall: {recall_lr:.3f}')
print(f'F1 Score: {f1_lr:.3f}')
print(f'AUC: {auc_lr:.3f}')

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_clf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = rf_clf.predict(X_test)
y_pred_prob_rf = rf_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_pred_prob_rf)

print("Random Forest Performance:")
print(f'Accuracy: {accuracy_rf:.3f}')
print(f'Precision: {precision_rf:.3f}')
print(f'Recall: {recall_rf:.3f}')
print(f'F1 Score: {f1_rf:.3f}')
print(f'AUC: {auc_rf:.3f}')

## XGBoost

In [None]:
import xgboost as xgb

# Initialize the model
xgb_clf = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')

# Train the model
xgb_clf.fit(X_train, y_train)

# Predict on test set
y_pred_xgb = xgb_clf.predict(X_test)
y_pred_prob_xgb = xgb_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb)
auc_xgb = roc_auc_score(y_test, y_pred_prob_xgb)

print("XGBoost Performance:")
print(f'Accuracy: {accuracy_xgb:.3f}')
print(f'Precision: {precision_xgb:.3f}')
print(f'Recall: {recall_xgb:.3f}')
print(f'F1 Score: {f1_xgb:.3f}')
print(f'AUC: {auc_xgb:.3f}')

## AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the model
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
ada_clf.fit(X_train, y_train)

# Predict on test set
y_pred_ada = ada_clf.predict(X_test)
y_pred_prob_ada = ada_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_ada = accuracy_score(y_test, y_pred_ada)
precision_ada = precision_score(y_test, y_pred_ada)
recall_ada = recall_score(y_test, y_pred_ada)
f1_ada = f1_score(y_test, y_pred_ada)
auc_ada = roc_auc_score(y_test, y_pred_prob_ada)

print("AdaBoost Performance:")
print(f'Accuracy: {accuracy_ada:.6f}')
print(f'Precision: {precision_ada:.6f}')
print(f'Recall: {recall_ada:.6f}')
print(f'F1 Score: {f1_ada:.6f}')
print(f'AUC: {auc_ada:.6f}')

## Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC

# Initialize the model
svm_clf = SVC(probability=True, random_state=42)

# Train the model
svm_clf.fit(X_train, y_train)

# Predict on test set
y_pred_svm = svm_clf.predict(X_test)
y_pred_prob_svm = svm_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)
auc_svm = roc_auc_score(y_test, y_pred_prob_svm)

print("SVM Performance:")
print(f'Accuracy: {accuracy_svm:.6f}')
print(f'Precision: {precision_svm:.6f}')
print(f'Recall: {recall_svm:.6f}')
print(f'F1 Score: {f1_svm:.6f}')
print(f'AUC: {auc_svm:.6f}')

## Gradient Boosting Decision Trees (GBDT)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the model
gbdt_clf = GradientBoostingClassifier(random_state=42)

# Train the model
gbdt_clf.fit(X_train, y_train)

# Predict on test set
y_pred_gbdt = gbdt_clf.predict(X_test)
y_pred_prob_gbdt = gbdt_clf.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy_gbdt = accuracy_score(y_test, y_pred_gbdt)
precision_gbdt = precision_score(y_test, y_pred_gbdt)
recall_gbdt = recall_score(y_test, y_pred_gbdt)
f1_gbdt = f1_score(y_test, y_pred_gbdt)
auc_gbdt = roc_auc_score(y_test, y_pred_prob_gbdt)

print("GBDT Performance:")
print(f'Accuracy: {accuracy_gbdt:.6f}')
print(f'Precision: {precision_gbdt:.6f}')
print(f'Recall: {recall_gbdt:.6f}')
print(f'F1 Score: {f1_gbdt:.6f}')
print(f'AUC: {auc_gbdt:.6f}')

Create a DataFrame for the New Models


In [None]:
import pandas as pd

# Create a dictionary with model names and their corresponding performance metrics
performance_data = {
    'Model': [
        'Logistic Regression',
        'Random Forest',
        'XGBoost',
        'AdaBoost',
        'SVM',
        'GBDT'
    ],
    'Accuracy': [
        accuracy_lr,
        accuracy_rf,
        accuracy_xgb,
        accuracy_ada,
        accuracy_svm,
        accuracy_gbdt
    ],
    'Precision': [
        precision_lr,
        precision_rf,
        precision_xgb,
        precision_ada,
        precision_svm,
        precision_gbdt
    ],
    'Recall': [
        recall_lr,
        recall_rf,
        recall_xgb,
        recall_ada,
        recall_svm,
        recall_gbdt
    ],

    'F1 Score': [f1_lr,f1_rf, f1_xgb, f1_ada,f1_svm, f1_gbdt
    ],

    'AUC': [ auc_lr,auc_rf, auc_xgb, auc_ada, auc_svm, auc_gbdt
    ]
}

# Create the DataFrame
performance_df = pd.DataFrame(performance_data)

# Display the performance comparison table
print("\nModel Performance Comparison:")
print(performance_df)

In [None]:
# Required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import xgboost as xgb

# Evaluate models before feature reduction
X_train_full, X_test_full, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

# Train and evaluate models on full data before feature reduction
# Logistic Regression
lr_full = LogisticRegression(max_iter=1000, random_state=42)
lr_full.fit(X_train_full, y_train)
y_pred_lr_full = lr_full.predict(X_test_full)
y_pred_prob_lr_full = lr_full.predict_proba(X_test_full)[:, 1]

# Random Forest
rf_full = RandomForestClassifier(n_estimators=100, random_state=42)
rf_full.fit(X_train_full, y_train)
y_pred_rf_full = rf_full.predict(X_test_full)
y_pred_prob_rf_full = rf_full.predict_proba(X_test_full)[:, 1]

# XGBoost
xgb_full = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_full.fit(X_train_full, y_train)
y_pred_xgb_full = xgb_full.predict(X_test_full)
y_pred_prob_xgb_full = xgb_full.predict_proba(X_test_full)[:, 1]

# AdaBoost
ada_full = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_full.fit(X_train_full, y_train)
y_pred_ada_full = ada_full.predict(X_test_full)
y_pred_prob_ada_full = ada_full.predict_proba(X_test_full)[:, 1]

# SVM
svm_full = SVC(probability=True, random_state=42)
svm_full.fit(X_train_full, y_train)
y_pred_svm_full = svm_full.predict(X_test_full)
y_pred_prob_svm_full = svm_full.predict_proba(X_test_full)[:, 1]

# GBDT
gbdt_full = GradientBoostingClassifier(random_state=42)
gbdt_full.fit(X_train_full, y_train)
y_pred_gbdt_full = gbdt_full.predict(X_test_full)
y_pred_prob_gbdt_full = gbdt_full.predict_proba(X_test_full)[:, 1]

# Calculate metrics for each model before feature reduction
# Logistic Regression
accuracy_lr_full = accuracy_score(y_test, y_pred_lr_full)
precision_lr_full = precision_score(y_test, y_pred_lr_full)
recall_lr_full = recall_score(y_test, y_pred_lr_full)
f1_lr_full = f1_score(y_test, y_pred_lr_full)
auc_lr_full = roc_auc_score(y_test, y_pred_prob_lr_full)

# Random Forest
accuracy_rf_full = accuracy_score(y_test, y_pred_rf_full)
precision_rf_full = precision_score(y_test, y_pred_rf_full)
recall_rf_full = recall_score(y_test, y_pred_rf_full)
f1_rf_full = f1_score(y_test, y_pred_rf_full)
auc_rf_full = roc_auc_score(y_test, y_pred_prob_rf_full)

# XGBoost
accuracy_xgb_full = accuracy_score(y_test, y_pred_xgb_full)
precision_xgb_full = precision_score(y_test, y_pred_xgb_full)
recall_xgb_full = recall_score(y_test, y_pred_xgb_full)
f1_xgb_full = f1_score(y_test, y_pred_xgb_full)
auc_xgb_full = roc_auc_score(y_test, y_pred_prob_xgb_full)

# AdaBoost
accuracy_ada_full = accuracy_score(y_test, y_pred_ada_full)
precision_ada_full = precision_score(y_test, y_pred_ada_full)
recall_ada_full = recall_score(y_test, y_pred_ada_full)
f1_ada_full = f1_score(y_test, y_pred_ada_full)
auc_ada_full = roc_auc_score(y_test, y_pred_prob_ada_full)

# SVM
accuracy_svm_full = accuracy_score(y_test, y_pred_svm_full)
precision_svm_full = precision_score(y_test, y_pred_svm_full)
recall_svm_full = recall_score(y_test, y_pred_svm_full)
f1_svm_full = f1_score(y_test, y_pred_svm_full)
auc_svm_full = roc_auc_score(y_test, y_pred_prob_svm_full)

# GBDT
accuracy_gbdt_full = accuracy_score(y_test, y_pred_gbdt_full)
precision_gbdt_full = precision_score(y_test, y_pred_gbdt_full)
recall_gbdt_full = recall_score(y_test, y_pred_gbdt_full)
f1_gbdt_full = f1_score(y_test, y_pred_gbdt_full)
auc_gbdt_full = roc_auc_score(y_test, y_pred_prob_gbdt_full)

# Print formatted comparison table
print("Comparison of the prediction performance before and after feature reduction.")
print("-" * 100)
print("         Before feature reduction                           After feature reduction")
print("Methods  Num   PRE    REC    ACC    F1     AUC     Num     PRE      REC    ACC    F1     AUC")
print("-" * 100)

# Define feature counts
total_features = X.shape[1]  # Total features before reduction
selected_features = X_selected.shape[1]  # Total features after reduction

# Print data for each model
methods = ['GBDT', 'SVM', 'LR', 'XGB', 'RF', 'AdaBoost']
for method in methods:
    # Print model name
    print(f"{method:<8}", end="")

    # Print metrics before feature reduction
    print(f"{total_features:<6}", end="")

    if method == 'GBDT':
        print(f"{precision_gbdt_full:.3f}  {recall_gbdt_full:.3f}  {accuracy_gbdt_full:.3f}  {f1_gbdt_full:.3f}  {auc_gbdt_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_gbdt:.3f}  {recall_gbdt:.3f}  {accuracy_gbdt:.3f}  {f1_gbdt:.3f}  {auc_gbdt:.3f}")
    elif method == 'SVM':
        print(f"{precision_svm_full:.3f}  {recall_svm_full:.3f}  {accuracy_svm_full:.3f}  {f1_svm_full:.3f}  {auc_svm_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_svm:.3f}  {recall_svm:.3f}  {accuracy_svm:.3f}  {f1_svm:.3f}  {auc_svm:.3f}")
    elif method == 'LR':
        print(f"{precision_lr_full:.3f}  {recall_lr_full:.3f}  {accuracy_lr_full:.3f}  {f1_lr_full:.3f}  {auc_lr_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_lr:.3f}  {recall_lr:.3f}  {accuracy_lr:.3f}  {f1_lr:.3f}  {auc_lr:.3f}")
    elif method == 'XGB':
        print(f"{precision_xgb_full:.3f}  {recall_xgb_full:.3f}  {accuracy_xgb_full:.3f}  {f1_xgb_full:.3f}  {auc_xgb_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_xgb:.3f}  {recall_xgb:.3f}  {accuracy_xgb:.3f}  {f1_xgb:.3f}  {auc_xgb:.3f}")
    elif method == 'RF':
        print(f"{precision_rf_full:.3f}  {recall_rf_full:.3f}  {accuracy_rf_full:.3f}  {f1_rf_full:.3f}  {auc_rf_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_rf:.3f}  {recall_rf:.3f}  {accuracy_rf:.3f}  {f1_rf:.3f}  {auc_rf:.3f}")
    elif method == 'AdaBoost':
        print(f"{precision_ada_full:.3f}  {recall_ada_full:.3f}  {accuracy_ada_full:.3f}  {f1_ada_full:.3f}  {auc_ada_full:.3f}     ", end="")
        print(f"{selected_features:<6} {precision_ada:.3f}  {recall_ada:.3f}  {accuracy_ada:.3f}  {f1_ada:.3f}  {auc_ada:.3f}")

print("-" * 100)

In [None]:
# # Further attempt with ensemble learning to see if there is any improvement

# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LogisticRegression
# from sklearn.neural_network import MLPClassifier
# from sklearn.naive_bayes import GaussianNB
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.svm import SVC
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
# import xgboost as xgb

# # Prepare the feature columns
# features = final_df_cleaned[[
#     '220051_mean', '220051_max', '220051_min', '220051_median', '220051_std',
#     '220051_skewness', '220051_kurtosis', '220051_q75', '220051_q25',
#     '220051_mad', '220051_range', '220051_var',

#     '220050_mean', '220050_max', '220050_min', '220050_median', '220050_std',
#     '220050_skewness', '220050_kurtosis', '220050_q75', '220050_q25',
#     '220050_mad', '220050_range', '220050_var',

#     '220045_mean', '220045_max', '220045_min', '220045_median', '220045_std',
#     '220045_skewness', '220045_kurtosis', '220045_q75', '220045_q25',
#     '220045_mad', '220045_range', '220045_var',

#     '220277_mean', '220277_max', '220277_min', '220277_median', '220277_std',
#     '220277_skewness', '220277_kurtosis', '220277_q75', '220277_q25',
#     '220277_mad', '220277_range', '220277_var',

#     '220052_mean', '220052_max', '220052_min', '220052_median', '220052_std',
#     '220052_skewness', '220052_kurtosis', '220052_q75', '220052_q25',
#     '220052_mad', '220052_range', '220052_var',

#     '220210_mean', '220210_max', '220210_min', '220210_median', '220210_std',
#     '220210_skewness', '220210_kurtosis', '220210_q75', '220210_q25',
#     '220210_mad', '220210_range', '220210_var',

#     'apsiii', 'age', 'gender', 'SOFA', 'vasopressin', 'ventilation'
# ]]

# # Extract the label column
# labels = final_df_cleaned['label']

# # 2. Split data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# # 3. Standardize the data (optional step)
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# # Define base models (base learners), selecting from models
# models = {
#     "Logistic Regression": LogisticRegression(),
#     #"MLP Classifier": MLPClassifier(max_iter=1000),
#     #"Naive Bayes": GaussianNB(),
#     "Random Forest": RandomForestClassifier(n_estimators=100),
#     "Decision Tree": DecisionTreeClassifier(),
#     "SVM": SVC(probability=True),
#     #"KNN": KNeighborsClassifier(n_neighbors=5),
#     "GBDT": GradientBoostingClassifier(),  # GBDT model
#     "XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),  # XGBoost model
#     "AdaBoost": AdaBoostClassifier(n_estimators=100)  # AdaBoost model
# }

# # Convert the models dictionary to a list of (name, model) pairs for base learners
# base_learners = [(name, model) for name, model in models.items()]

# # Define the meta-learner (secondary learner), using Logistic Regression as the meta-model
# stacking_model = StackingClassifier(
#     estimators=base_learners,
#     final_estimator=LogisticRegression(),  # Meta-learner, can be replaced with another model
#     cv=None
# )

# # Train the Stacking model
# stacking_model.fit(X_train_scaled, y_train)

# # Predict and evaluate the Stacking model
# y_pred = stacking_model.predict(X_test_scaled)
# y_pred_proba = stacking_model.predict_proba(X_test_scaled)[:, 1]

# # Calculate AUROC
# auc = roc_auc_score(y_test, y_pred_proba)

# # Output the model accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Stacking model accuracy: {accuracy:.4f}")
# print(f"Stacking model AUROC: {auc:.4f}")

# # Output detailed classification report
# print(classification_report(y_test, y_pred))
