# Train.ipynb
Notebook that can kickstart training your own model.This is a notebook that was done on 7/25/2024 aligned with Leo and JJ

#### For importing important libraries/packages

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

#### Read the data

In [3]:
df = pd.read_excel('data/CARES_data.xlsx')

#### About the Dataframe

In [6]:
# Check the shape of the dataframe
'''
90 785 patients and 32 columns, later to realize only 5 are dependent variables, and 27 are independent variables
'''
print(df.shape)

# Show the columns of the dataframe
print(df.columns)

# Print the first 5 rows of the dataframe
df.head()

(90785, 32)
Index(['Indexno', 'AGE', 'GENDER', 'RCRI score', 'Anemia category',
       'PreopEGFRMDRD', 'GradeofKidneydisease', 'DaysbetweenDeathandoperation',
       '@30daymortality', 'Preoptransfusionwithin30days', 'Intraop',
       'Postopwithin30days', 'Transfusionintraandpostop', 'AnaestypeCategory',
       'PriorityCategory', 'TransfusionIntraandpostopCategory', 'AGEcategory',
       'AGEcategoryOriginal', 'Mortality', 'thirtydaymortality',
       'SurgRiskCategory', 'RaceCategory', 'CVARCRICategory',
       'IHDRCRICategory', 'CHFRCRICategory', 'DMinsulinRCRICategory',
       'CreatinineRCRICategory', 'GradeofKidneyCategory',
       'Anemiacategorybinned', 'RDW15.7', 'ASAcategorybinned', 'ICUAdmgt24h'],
      dtype='object')


Unnamed: 0,Indexno,AGE,GENDER,RCRI score,Anemia category,PreopEGFRMDRD,GradeofKidneydisease,DaysbetweenDeathandoperation,@30daymortality,Preoptransfusionwithin30days,...,CVARCRICategory,IHDRCRICategory,CHFRCRICategory,DMinsulinRCRICategory,CreatinineRCRICategory,GradeofKidneyCategory,Anemiacategorybinned,RDW15.7,ASAcategorybinned,ICUAdmgt24h
0,2,48,FEMALE,,,,BLANK,,NO,0,...,#NULL!,#NULL!,#NULL!,#NULL!,no,#NULL!,#NULL!,#NULL!,I,no
1,5,36,FEMALE,,none,,BLANK,,NO,0,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,,<= 15.7,I,no
2,6,64,FEMALE,,mild,152.53857,g1,,NO,0,...,#NULL!,#NULL!,#NULL!,#NULL!,#NULL!,G1,Mild,<= 15.7,I,no
3,9,73,MALE,,moderate,117.231496,g1,,NO,0,...,#NULL!,#NULL!,#NULL!,#NULL!,no,G1,Moderate/Severe,<= 15.7,I,no
4,10,73,MALE,0.0,mild,98.651255,g1,59.0,NO,0,...,no,no,no,no,no,G1,Mild,>15.7,II,no


#### Data Preparation

Paper only uses "Outcome measures were death within 30 days after surgery and ICU admission". But we have DaysBetweenDeathAndOperation though

Additionally, @30daymortality and thirtydaymortality are actually the same thing by definition (verified by JJ, Leo, JS)

In [12]:
# Seperate the data into X and Y

# Y is the target variable
target_variables = ['DaysbetweenDeathandoperation', '@30daymortality', 'thirtydaymortality', 'Mortality','ICUAdmgt24h']

#Y is the target variable colunns
Y = df[target_variables]

# X are the feature columns
X = df.drop(target_variables, axis=1)

# Check the shape of the X and Y
print('X_i, Features Shape are:',X.shape)
print('Y_i, Target Variables Shape are:',Y.shape)

X_i, Features Shape are: (90785, 27)
Y_i, Target Variables Shape are: (90785, 5)


Assume all data preparation have been done, we can do a train test val split,
where train is for training the model, test is for hyperparameter tuning, and val is for model validation on unseen set

Note: The paper did not do a test set for hyperparameter tuning. But Leo and JS agreed to have this split since the requirement pdf specified it.

In [15]:
# split the data into training, testing and validation sets

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.2, 
                                                    random_state=2024)

# Split the training data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, 
                                                  test_size=0.2, 
                                                  random_state=2024)

In [17]:
# Let's check the shape of all the sets
print('X_train Shape:',X_train.shape)
print('Y_train Shape:',Y_train.shape)
print('---------------------------------')
print('X_test Shape:',X_test.shape)
print('Y_test Shape:',Y_test.shape)
print('---------------------------------')
print('X_val Shape:',X_val.shape)
print('Y_val Shape:',Y_val.shape)

X_train Shape: (58102, 27)
Y_train Shape: (58102, 5)
---------------------------------
X_test Shape: (18157, 27)
Y_test Shape: (18157, 5)
---------------------------------
X_val Shape: (14526, 27)
Y_val Shape: (14526, 5)
