# Pre-Processing Pipelines

## Clinical Risk Factor Pre-Processing

BSA Calculation is $\sqrt{{Weight (kg) * Height (cm)}\over 3600}$ : https://www.registerednursern.com/body-surface-area-calculations-nursing-review/

BMI Calculation is $Weight (kg) \over Height^{2} (m)$ : https://www.registerednursern.com/bmi-calculation-formula-explained/

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

%matplotlib inline

crf_path = "data/CRFs.csv"

In [21]:
df = pd.read_csv(crf_path)
df.head()

Unnamed: 0,Record,Gender,Age,Weight,Height,BSA,BMI,Smoker,SBP,DBP,IMT MAX,LVMi,EF,Vascular event
0,1911,M,56,105,180,2.29,32.41,yes,140.0,80.0,4.0,123.0,66.0,none
1,2012,M,72,83,169,1.97,29.06,no,130.0,75.0,,121.0,69.0,none
2,2019,F,80,80,165,1.91,29.38,no,177.0,75.0,2.5,164.0,56.0,none
3,2020,M,77,88,178,2.09,27.77,no,140.0,85.0,2.7,115.0,67.0,none
4,2025,F,66,80,174,1.97,26.42,no,110.0,65.0,1.5,98.0,66.0,none


### All Values Dataframe Pre-Processing

In [22]:
df_all_vals = df.copy()

In [23]:
df_all_vals.replace('n/a', np.nan, inplace=True)
df_all_vals.head()

Unnamed: 0,Record,Gender,Age,Weight,Height,BSA,BMI,Smoker,SBP,DBP,IMT MAX,LVMi,EF,Vascular event
0,1911,M,56,105,180,2.29,32.41,yes,140.0,80.0,4.0,123.0,66.0,none
1,2012,M,72,83,169,1.97,29.06,no,130.0,75.0,,121.0,69.0,none
2,2019,F,80,80,165,1.91,29.38,no,177.0,75.0,2.5,164.0,56.0,none
3,2020,M,77,88,178,2.09,27.77,no,140.0,85.0,2.7,115.0,67.0,none
4,2025,F,66,80,174,1.97,26.42,no,110.0,65.0,1.5,98.0,66.0,none


In [24]:
num_imputer = SimpleImputer(strategy='mean')
df_all_vals[['IMT MAX', 'LVMi', 'EF']] = num_imputer.fit_transform(df_all_vals[['IMT MAX', 'LVMi', 'EF']])

### Dropped Values Dataframe Pre-Processing

Focuses on using values besides IMT MAX, LVMi, and EF

In [25]:
df = df.drop(columns=['IMT MAX','LVMi','EF'])
df.head()

Unnamed: 0,Record,Gender,Age,Weight,Height,BSA,BMI,Smoker,SBP,DBP,Vascular event
0,1911,M,56,105,180,2.29,32.41,yes,140.0,80.0,none
1,2012,M,72,83,169,1.97,29.06,no,130.0,75.0,none
2,2019,F,80,80,165,1.91,29.38,no,177.0,75.0,none
3,2020,M,77,88,178,2.09,27.77,no,140.0,85.0,none
4,2025,F,66,80,174,1.97,26.42,no,110.0,65.0,none


For all values dataframe: Standardize the n/a value for operability with Python/Numpy

Gender, Smoker and Vascular event values need to be encoded

In [26]:
df['Gender'] = df['Gender'].str.upper().map({'M':0, 'F': 1})
df['Smoker'] = df['Smoker'].str.upper().map({'NO': 0, 'YES': 1})
df['Vascular event'] = df['Vascular event'].astype('category')
df.head()

Unnamed: 0,Record,Gender,Age,Weight,Height,BSA,BMI,Smoker,SBP,DBP,Vascular event
0,1911,0,56,105,180,2.29,32.41,1,140.0,80.0,none
1,2012,0,72,83,169,1.97,29.06,0,130.0,75.0,none
2,2019,1,80,80,165,1.91,29.38,0,177.0,75.0,none
3,2020,0,77,88,178,2.09,27.77,0,140.0,85.0,none
4,2025,1,66,80,174,1.97,26.42,0,110.0,65.0,none


In [27]:
num_imputer = SimpleImputer(strategy='mean')
df[['SBP','DBP']] = num_imputer.fit_transform(df[['SBP','DBP']])

#### Feature Engineering

Creating two new features - Pulse Pressure and BMI Category. 

Pulse pressure represents the difference between the systolic and diastolic blood pressure - Can be an indicator of health issues prior to symptom development and show risk for certain diseases or conditions. 

BMI category represents the BMI categories as set out by the centre for disease control (CDC) in the US. Categories are based on bins of numbers (0->18.5, >18.5->25, >25->30, 30->Infinity) representing different category classifications (Underweight, Normal, Overweight, Obese) with the final category having different classes based on the values above the minimum threshold.

In [28]:
df['Pulse Pressure'] = df['SBP'] - df['DBP']
# df['BMI Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 25, 30, np.inf], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

Pulse Pressure: https://my.clevelandclinic.org/health/body/21629-pulse-pressure

BMI Categories: https://www.cdc.gov/bmi/adult-calculator/bmi-categories.html

#### Numerical Features Normalization

Using Standard scaler here, could also try a min-max scaler.

In [29]:
scaler = StandardScaler()
num_cols = ['Age', 'Weight', 'Height', 'SBP', 'DBP', 'Pulse Pressure', 'BSA', 'BMI']
df[num_cols] = scaler.fit_transform(df[num_cols])

#### Mapping and One-Hot encoding For BMI-Mapping

In [30]:
# bmi_mapping = {'Underweight': 0, 'Normal': 1, 'Overweight': 2, 'Obese': 3}
# df['BMI Category'] = df['BMI Category'].map(bmi_mapping)
# df = pd.get_dummies(df, columns=['BMI Category'], prefix='BMI')

#### Train-Testing Datasets Splitting

Splitting the target attribute and rest of features from each other.

Splitting the dataset into training and testing sets with an 8:2 split. This could be altered after data synthesisation and dataset balancing

In [31]:
X = df.drop(columns=['Record', 'Vascular event'])
y = df['Vascular event']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

#### Feature Selection

Feature importance has been calculated by dividing all values by the max value to show the most important values. All other importance values are a percent of the max value

##### K-Best Feature Selection

In [32]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=7)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
selected_features = X_train.columns[selector.get_support()]
print("Selected features:", selected_features)
feature_importances = selector.scores_[selector.get_support()]
feature_importances_df = pd.DataFrame({
    'Feature': selected_features,
    'Importance': feature_importances
})
feature_importances_df['Importance'] = feature_importances_df['Importance'] / feature_importances_df['Importance'].max()
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)
feature_importances_df.head()

ValueError: could not convert string to float: 'Obese'

##### Recursive Feature Elimination Feature Selection

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=7)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)
selected_features_rfe = X_train.columns[rfe.support_]
print("Selected features using RFE:", selected_features_rfe)
model.fit(X_train_rfe, y_train)
feature_importances_rfe = model.coef_[0]
feature_importances_df_rfe = pd.DataFrame({
    'Feature': selected_features_rfe,
    'Importance': feature_importances_rfe
})
feature_importances_df_rfe['Importance'] = feature_importances_df_rfe['Importance'] / feature_importances_df_rfe['Importance'].max()
feature_importances_df_rfe = feature_importances_df_rfe.sort_values(by='Importance', ascending=False)
feature_importances_df_rfe.head()

Selected features using RFE: Index(['Gender', 'Age', 'BMI', 'SBP', 'DBP', 'BMI_1', 'BMI_3'], dtype='object')


Unnamed: 0,Feature,Importance
3,SBP,1.0
0,Gender,0.796224
2,BMI,0.546185
1,Age,0.475559
5,BMI_1,0.352893


Some values in the RFE method are showing as negative values. Possibly need to review how the importance of features are measured in RFE.

##### Selection From Model Training Using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import pandas as pd

rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
model = SelectFromModel(rf, prefit=True)
X_train_rf = model.transform(X_train)
X_test_rf = model.transform(X_test)
selected_features_rf = X_train.columns[model.get_support()]
print("Selected features using RandomForest:", selected_features_rf)
feature_importances = rf.feature_importances_
feature_importances_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})
feature_importances_df['Importance'] = feature_importances_df['Importance'] / feature_importances_df['Importance'].max()
feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)
feature_importances_df.head()

Selected features using RandomForest: Index(['Age', 'Weight', 'Height', 'BSA', 'BMI', 'SBP', 'Pulse Pressure'], dtype='object')




Unnamed: 0,Feature,Importance
9,Pulse Pressure,1.0
7,SBP,0.845408
5,BMI,0.661961
1,Age,0.613875
4,BSA,0.574247


#### Model Testing

Testing the pre-processed data with a standard ML SVM model

In [None]:
from sklearn import svm
from sklearn.metrics import classification_report, accuracy_score

# Initialize the model
model = svm.SVC(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8571428571428571
Classification Report:
                        precision    recall  f1-score   support

myocardial infarction       0.00      0.00      0.00         2
                 none       0.86      1.00      0.92        24
               stroke       0.00      0.00      0.00         1
              syncope       0.00      0.00      0.00         1

             accuracy                           0.86        28
            macro avg       0.21      0.25      0.23        28
         weighted avg       0.73      0.86      0.79        28



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Dataset needs to be balanced and data needs to be synthesized in order to get a good accuracy score

## ECG Signal Pre-Processing

WFDB Documentation: https://wfdb.readthedocs.io/en/latest/index.html

In [19]:
import wfdb

ecg_data_path = "dataset"