<div class="alert alert-block alert-danger"> Be careful before running this notebook since the new model will be overwritten on the previous one. Change the model's name before running</div>


In [1]:
import pandas as pd

import random
from auxiliary_functions import write_joblib, load_joblib
from transformations import normaliztion, unbalanced
from models import XGBmodel, RFmodel, LGBMmodel, Voting, Meta

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
 
from collections import Counter




In [2]:
lag = 40
path_to_read = '/Users/luisescobar/Documents/Thesis/DataSets/Dictionary/03_01_Train_Val_Test'
path_to_save = f'/Users/luisescobar/Documents/Thesis/Models/TOP5_exp/xgb/lag_{lag}min'
#path_to_save = '/Users/luisescobar/Documents/Thesis/Models'
file_name_block = 'blocking_80.pkl'
file_name_no_block = 'no_blocking_80.pkl'

dict_block = load_joblib(path_to_read, file_name_block)
dict_no_block = load_joblib(path_to_read, file_name_no_block)

In [3]:
random.seed(42)
loaded_dict = {**dict_block, **dict_no_block}

items = list(loaded_dict.items())
random.shuffle(items)
loaded_dict = dict(items)

In [4]:
print(len(dict_block))
print(len(dict_no_block))
print(len(loaded_dict))

72
564
636


<div class="alert alert-block alert-danger"> Line 8 and 9 must be commented in case we want to include 'DeltaP', 'TMP', 'TMPa'</div>





In [5]:
#Create a single DataFrame ready to train
dataframes_list = list(loaded_dict.values())

# Concatenate all DataFrames vertically. Each Data Frame starts with zero so a 
combined_df = pd.concat(dataframes_list, ignore_index=False)

# Drop highly correlated columns
#d_columns = ['Q_Blood_Pump','Q_Replacement','Q_Dialysate','Q_PBP','Q_Patient_Fluid_Removal', 'DeltaP', 'TMPa']
d_columns = ['P_Effluent','Q_Blood_Pump','Q_Replacement','Q_Dialysate','Q_PBP', 'Q_Patient_Fluid_Removal', 'TMPa']
combined_df.drop(d_columns, axis=1, inplace=True)

#combined_df = pd.concat(dataframes_list, ignore_index=True)
combined_df = combined_df.reset_index(drop=True)

combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1276750 entries, 0 to 1276749
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   P_Access    1276750 non-null  int64  
 1   P_Filter    1276750 non-null  int64  
 2   P_Return    1276750 non-null  int64  
 3   DeltaP      1276750 non-null  int64  
 4   TMP         1276750 non-null  float64
 5   Clotting_2  1276750 non-null  int64  
dtypes: float64(1), int64(5)
memory usage: 58.4 MB


Construct the model we will try with Random Forest and XGBoost
For this we could use an autocorrelation matrix to see until which point lag value has prediction power
Autocorrelation is not a good approach since we are not using time dependent data
Run the experiment for three lag values and see what happen
10 min
30 min
40 min

The use of 
df_lagged = combined_df.dropna()
Does not erase information from the original Data Frame?
it does but on the new DataFrame df_lagged

In [6]:
#Baseline model
#The baseline model classifies all the points as zero, since most of them are zero. Therefore, it never recognizes blocking events
count_of_ones = combined_df['Clotting_2'].sum()
accuracy_baseline = (len(combined_df)-count_of_ones)/len(combined_df)
print(f'Total number of blocking points {count_of_ones}')
print(f'Baseline model accuracy {accuracy_baseline}')

Total number of blocking points 4867
Baseline model accuracy 0.9961879772860779


Parameters that changed during the model deployment:
* test
* balance
* unbalanced_percentage
* test_size on:
* X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)
* values I used 0.2, 0.1, 0.05

In [7]:
my_dict = {}
test = False
balance = True
model = 'XGB'  # Must be 'RF', 'XGB' or 'LGBM' or 'VOTE' or 'META'
file_name = 'xgb_model.pkl'
unbalanced_method = 'adasyn'
unbalanced_percentage = 0.1
test_size = 0.1

# Lag values
#lags = [10, 30, 40]
lags = [lag]


# Step 3: Iterate through each lag and create lagged features
for lag in lags:
    print(f"\nTraining model for lag {lag}")

    # Create lagged features for each column except the target column
    for column in combined_df.columns:
        if column != 'Clotting_2':  # Skip the target column
            combined_df[f'{column}_lag_{lag}'] = combined_df[column].shift(lag)
    
    # Remove rows with NaN values (due to shifting)
    df_lagged = combined_df.dropna()

    # Prepare features (X) and target (y)
    # Drop original columns and only use lagged features
    lagged_columns = [col for col in df_lagged.columns if 'lag_' in col]
    X = df_lagged[lagged_columns]
    y = df_lagged['Clotting_2']  # Target variable
    
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True)

    if test is True:
        # Use a smaller sample for testing purposes
        X_train = X_train.sample(frac=0.01, random_state=42)  # Use only frac% of the data
        y_train = y_train.loc[X_train.index]
        X_test = X_test.sample(frac=0.01, random_state=42)
        y_test = y_test.loc[X_test.index]
        print('**** Running test ****')
    else:
        print('**** This in NOT a test ****')
    
    # The scaler is saved in path_to_save
    X_train_scaled, X_test_scaled = normaliztion(X_train, X_test, path_to_save)

    if balance is True:
        print(f'Balance percentage {unbalanced_percentage}')
        method = unbalanced(unbalanced_method, unbalanced_percentage)
        X_train_resampled, y_train_resampled = method.fit_resample(X_train_scaled, y_train)
        # Check class distribution after resampling
        print(f"Original class distribution: {Counter(y_train)}")
        print(f"Resampled class distribution: {Counter(y_train_resampled)}")
    else:
        #X_train_resampled =  X_train_scaled
        #y_train_resampled = y_train
        X_train_resampled, y_train_resampled = X_train_scaled, y_train
        print(f"Original class distribution: {Counter(y_train)}")
        

    

    if model == 'XGB':
        #Save the model
        model = XGBmodel(X_train_resampled, y_train_resampled)
        write_joblib(model, path_to_save, file_name)
        
    elif model == 'RF':               
        #Save the model
        model = RFmodel(X_train_resampled, y_train_resampled)
        write_joblib(model, path_to_save, file_name)
        
    elif model =='LGBM':
        #Save the model
        model = LGBMmodel(X_train_resampled, y_train_resampled)              
        write_joblib(model, path_to_save, file_name)

    elif model == 'VOTE':
        model = Voting(X_train_resampled, y_train_resampled)
        write_joblib(model, path_to_save, file_name)
    elif model =='META':
        model = Meta(X_train_resampled, y_train_resampled)
        write_joblib(model, path_to_save, file_name)

        
    y_pred = model.predict(X_test_scaled)
    
    my_dict[f'y_pred_{lag}'] = y_pred
    my_dict[f'y_test_{lag}'] = y_test.tolist()
    
    
    
    # ROC-AUC Score
    roc_auc = roc_auc_score(y_test, y_pred)
    
    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    # Classification Report
    class_report = classification_report(y_test, y_pred)
    
    # Print the metrics
    print(f"ROC-AUC Score: {roc_auc}")
    print(f"Confusion Matrix:\n{conf_matrix}")
    print(f"Classification Report:\n{class_report}")
    print(f'Total number of blocking points {y_test.sum()}')
    print(f'Length of the test vector {len(y_test)}')
    print(f'Accuracy baseline model {(len(y_test)-y_test.sum())/len(y_test)}')

    # Optionally, clean up the DataFrame to prepare for the next lag
    # (by dropping the lagged columns for the current lag)
    combined_df = combined_df.drop(columns=[f'{column}' for column in combined_df.columns if f'lag_{lag}' in column])


Training model for lag 40
**** This in NOT a test ****
Balance percentage 0.1
Original class distribution: Counter({0: 1144655, 1: 4384})
Resampled class distribution: Counter({0: 1144655, 1: 114278})
Fitting 3 folds for each of 50 candidates, totalling 150 fits
Best parameters found: {'alpha': 0.6454722959071678, 'colsample_bytree': 0.6708442717628196, 'lambda': 0.9404585843529143, 'learning_rate': 0.2961785731007762, 'max_delta_step': 0, 'max_depth': 8, 'min_child_weight': 3, 'n_estimators': 479, 'scale_pos_weight': 37, 'subsample': 0.9713274250350902}
ROC-AUC Score: 0.9680844632203867
Confusion Matrix:
[[126706    482]
 [    29    454]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    127188
           1       0.49      0.94      0.64       483

    accuracy                           1.00    127671
   macro avg       0.74      0.97      0.82    127671
weighted avg       1.00      1.00      1.00    127671

T

<div class="alert alert-block alert-info"> <b>NOTE</b> Evaluate how data looks after the training/test splitting </div>