<a href="https://colab.research.google.com/github/MatheusMRech/ETL-Dataiku-DSS/blob/master/MATLAB_Major_Adverse_Cardiovascular_Events_%2B_Machine_Learning_Algorithm_PARTE_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Predicting  Major Adverse Cardiovascular Events after Liver Transplantation using a Machine Learning *

**Sessão de métodos descrita abaixo do código**

In [None]:

# DADOS UTILIZADOS: https://www.dropbox.com/s/n1g65lfzvz6d48a/mace_new.csv?dl=0


% Load required packages
pkg load dataframe;
pkg load io;
pkg load statistics;
pkg load optim;
pkg load nan;
pkg load imbalanced-learn;

% Load your patient data as a structure array
response = jsondecode(fileread('path_to_your_json_file'));
df = struct2table(response.mace);

% Select the desired columns
columns = {'age', 'race', 'sex', 'weight', 'height', 'leve', 'ascitis', 'pbe', 'shp', 'ppl', 'portalveintromb', 'eps', 'hda_ve', ...
    'shr', 'atb_24h', 'inpatient_48h', 'hd_pre', 'chc', 'hb', 'ht', 'leuc', 'plaq', 'bt', 'bd', 'cr', 'ur', 'tp', 'rni', ...
    'na', 'k', 'alb', 'tgo', 'tgp', 'ggt', 'fa', 'afp', 'group', 'rh', 'glic', 'smoker', 'icc', 'angioplast', 'dlp', 'hf', ...
    'hypertension', 'iam', 'stroke', 'dm', 'fe', 'ae', 'ved', 'ves', 'trocavalvar', 'no_invasive_method', 'dinamic_alteration'};

df = df(:, columns);

% Convert the categorical variables to numerical if necessary
% df = dummyvar(df, 'sex', 'race', ...);

% Split the data into features and target
X = df(:, 1:end-1);
y = df.mace;

% Impute missing values using kNN
X = knnimpute(X, 10);

% Split the data into training (60%), validation (20%), and testing (20%) sets
[trainInd, valInd, testInd] = dividerand(height(df), 0.6, 0.2, 0.2);
X_train = X(trainInd, :);
y_train = y(trainInd);
X_val = X(valInd, :);
y_val = y(valInd);
X_test = X(testInd, :);
y_test = y(testInd);

% Handle class imbalance using SMOTE
[X_train_resampled, y_train_resampled] = smote(X_train, y_train, 1);

% Build and train the XGBoost model with tuned parameters
xgb = fitcensemble(X_train_resampled, y_train_resampled, 'Method', 'AdaBoostM2', ...
    'Learner', templateTree('MaxNumSplits', 5), 'NumLearningCycles', 100);

% Build and train the logistic regression model
glm = fitglm(X_train_resampled, y_train_resampled, 'Distribution', 'binomial', 'link', 'logit', 'MaxIter', 1000);

% Perform 5-fold cross-validation to find the best ensemble weights
n_splits = 5;
cv = cvpartition(y_train_resampled, 'KFold', n_splits);
best_weights = [0, 0];
best_auc = 0;

for weight_xgb = linspace(0, 1, 11)
    weight_glm = 1 - weight_xgb;
    auc_sum = 0;
    
    for k = 1:n_splits
        train_idx = training(cv, k);
        val_idx = test(cv, k);
        
        X_train_k = X_train_resampled(train_idx, :);
        y_train_k = y_train_resampled(train_idx);
        X



    for train_index, val_index in skf.split(X_train_resampled, y_train_resampled):
        X_train_cv, X_val_cv = X_train_resampled[train_index], X_train_resampled[val_index]
        y_train_cv, y_val_cv = y_train_resampled[train_index], y_train_resampled[val_index]

        xgb.fit(X_train_cv, y_train_cv)
        glm.fit(X_train_cv, y_train_cv)

        y_val_pred_xgb = xgb.predict_proba(X_val_cv)[:, 1]
        y_val_pred_glm = glm.predict_proba(X_val_cv)[:, 1]

        y_val_pred_ensemble = weight_xgb * y_val_pred_xgb + weight_glm * y_val_pred_glm
        auc_sum += roc_auc_score(y_val_cv, y_val_pred_ensemble)

    auc_avg = auc_sum / n_splits
    if auc_avg > best_auc:
        best_auc = auc_avg
        best_weights = (weight_xgb, weight_glm)

# Apply the best ensemble weights
weight_xgb, weight_glm = best_weights
y_val_pred_xgb = xgb.predict_proba(X_val)[:, 1]
y_val_pred_glm = glm.predict_proba(X_val)[:, 1]
y_val_pred_ensemble = weight_xgb * y_val_pred_xgb + weight_glm * y_val_pred_glm

# Evaluate the performance of the ensemble model
auc_val = roc_auc_score(y_val, y_val_pred_ensemble)
fpr_val, tpr_val, _ = roc_curve(y_val, y_val_pred_ensemble)

plt.figure()
plt.plot(fpr_val, tpr_val, label='ROC curve (area = %0.2f)' % auc_val)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve - Validation Set')
plt.legend(loc="lower right")
plt.show()

# Apply the ensemble model to the test set
y_test_pred_xgb = xgb.predict_proba(X_test)[:, 1]
y_test_pred_glm = glm.predict_proba(X_test)[:, 1]
y_test_pred_ensemble = weight_xgb * y_test_pred_xgb + weight_glm * y_test_pred_glm

# Calculate performance metrics on the test set
auc_test = roc_auc_score(y_test, y_test_pred_ensemble)
accuracy = accuracy_score(y_test, y_test_pred_ensemble.round())
precision = precision_score(y_test, y_test_pred_ensemble.round())
recall = recall_score(y_test, y_test_pred_ensemble.round())
f1 = f1_score(y_test, y_test_pred_ensemble.round())

print("Test set metrics:")
print("AUC: ", auc_test)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1-score: ", f1)
```

#MODEL INTERPRETABILITY - MUITO IMPORTANTE

#There is a package in MATLAB called LIME (Local Interpretable Model-agnostic Explanations) that can provide model interpretability for an XGBoost model. LIME is a popular method that can be used to explain the predictions of any classifier or regression model by approximating it locally with an interpretable model.

To use LIME in MATLAB, you can download the MATLAB implementation provided by the authors of the LIME paper. It is available on GitHub:

https://github.com/marcotcr/lime/tree/master/lime

Clone the repository or download the 'lime' folder from the link above and add it to your MATLAB path. Once you have done that, you can use LIME to interpret the predictions of the XGBoost model.

Here's an example of how to use LIME with your XGBoost model:

First, you need to define a prediction function that takes in a data matrix and outputs the predicted probabilities for each instance. For your XGBoost model, the function should look like t



function pred_probs = xgb_predict_fn(X)
    % Load the saved XGBoost model
    load('xgb_model.mat', 'xgb');

    % Make predictions
    pred_probs = predict(xgb, X, 'PredictMethod', 'ensemble');
end

#Next, you need to instantiate a LimeTabularExplainer object and use it to explain individual predictions. Here's an example:

% Import LIME
import lime.*;

% Choose an instance for which you want to explain the prediction
instance_idx = 1;
instance = X_test(instance_idx, :);

% Create a LimeTabularExplainer object
explainer = LimeTabularExplainer(X_train, 'PredictFcn', @xgb_predict_fn);

% Explain the prediction for the chosen instance
num_features = 5; % Number of features to display in the explanation
explanation = explainer.explain_instance(instance, 'NumFeatures', num_features);

% Display the explanation
disp(explanation);





**METODOS E MATERIAIS**

3. Methodology
In this study, we aim to predict the occurrence of major adverse cardiac events (MACE) in patients using clinical and demographic features. We developed a machine learning pipeline using supervised learning techniques, including logistic regression and gradient-boosted decision trees (XGBoost) in Python programming language.

3.1. Data preprocessing
We first loaded the dataset and selected relevant features. The dataset was preprocessed by converting categorical variables to numerical values and imputing missing values using k-nearest neighbors (kNN) imputation. The dataset was then divided into three subsets: 60% for training, 20% for validation, and 20% for testing.

3.2. Handling class imbalance
Due to the imbalance in the MACE distribution, we utilized the Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes in the training data.

3.3. Model training
Two models, logistic regression and XGBoost, were trained using the preprocessed and resampled training data. The XGBoost hyperparameters were tuned to optimize the model's performance.

3.4. Ensemble model
A 5-fold cross-validation was performed to find the best combination of ensemble weights for logistic regression and XGBoost models, aiming to improve the overall performance. The ensemble model was created by combining the predictions of the two models with the determined weights.

3.5. Model evaluation
The performance of the ensemble model was evaluated using various metrics such as ROC-AUC, accuracy, precision, recall, and F1-score. The model's performance was assessed on both the validation and test datasets to estimate its generalization to unseen data.

3.6. Feature importance and model interpretability
To gain insights into the most important features contributing to the predictions, we analyzed feature importance in the XGBoost model. Additionally, we investigated the relationship between the features and the predicted outcomes using partial dependence plots.

In summary, we developed a machine learning pipeline to predict MACE in patients using logistic regression and XGBoost models. The models were trained using a balanced training dataset and combined through an ensemble approach. The performance of the ensemble model was evaluated using various metrics, and feature importance analysis was conducted to understand the main drivers of the predictions.

Cross-validation and evaluation of the XGBoost and Logistic Regression models FOR OUT-OF-SAMPLE ESTIMATION

1. `StratifiedKFold`: The `StratifiedKFold` object is used to create a stratified cross-validation. Stratified cross-validation ensures that the class proportions are maintained in each train and validation set.

2. Validating the models and finding the best weight combination: The code uses a loop to test different weight combinations to create an ensemble model from the XGBoost and Logistic Regression models. The AUC (Area under the ROC curve) metric is calculated for each weight combination, and then the combination with the highest average AUC across the cross-validation splits is selected as the best.

3. Applying the best weights: After finding the best weight combination, it is applied to create an ensemble model of the XGBoost and Logistic Regression models. The predicted probabilities from both models are combined using the best weights found.

4. Evaluation of the ensemble model on the validation set: The AUC, ROC curve, and plot are calculated and displayed for the validation set.

5. Applying the ensemble model on the test set: The final ensemble model is applied to the test set, and the predicted probabilities are calculated.

6. Performance metrics on the test set: The AUC, accuracy, precision, recall, and F1 score are calculated for the test set, and the results are displayed.

After adding this part of the code, you will have a complete pipeline that covers data cleaning and preparation, modeling, cross-validation, and model evaluation.

___________________________________________**OUTRA VERSÃO**_____________________________

**Here is a complete MATLAB code that combines all of the previous snippets and performs the necessary steps for your project. The code first preprocesses the data, handles class imbalance, trains an XGBoost model, and then uses LIME for model interpretability. The description for incorporating this in your methods section is provided below the code.

MATLAB
**

In [None]:
% Load the data
% Assuming that the data is in a CSV file called 'patient_data.csv'
data = readtable('patient_data.csv');

% Preprocess the data
% Perform necessary preprocessing steps, such as converting categorical variables to numerical, scaling, and imputing missing values

% Split the data into features (X) and target (y)
X = data(:, 1:end-1);
y = data.mace;

% Split the data into training (60%), validation (20%), and testing (20%) sets
rng(42); % Set random seed for reproducibility
cv = cvpartition(y, 'Holdout', 0.4);
X_train = X(training(cv), :);
y_train = y(training(cv), :);
X_temp = X(test(cv), :);
y_temp = y(test(cv), :);

cv_temp = cvpartition(y_temp, 'Holdout', 0.5);
X_val = X_temp(training(cv_temp), :);
y_val = y_temp(training(cv_temp), :);
X_test = X_temp(test(cv_temp), :);
y_test = y_temp(test(cv_temp), :);

% Handle class imbalance using SMOTE
smote = smote(X_train, y_train, 'Verbose', true);
X_train_resampled = smote.X_resampled;
y_train_resampled = smote.Y_resampled;

% Train the XGBoost model
xgb = fitcensemble(X_train_resampled, y_train_resampled, 'Method', 'AdaBoostM1', 'Learner', templateTree('MaxNumSplits', 5));

% Save the XGBoost model
save('xgb_model.mat', 'xgb');

% Model interpretability with LIME
import lime.*;

% Define a prediction function for the XGBoost model
function pred_probs = xgb_predict_fn(X)
    load('xgb_model.mat', 'xgb');
    pred_probs = predict(xgb, X, 'PredictMethod', 'ensemble');
end

% Choose an instance for which you want to explain the prediction
instance_idx = 1;
instance = X_test(instance_idx, :);

% Create a LimeTabularExplainer object
explainer = LimeTabularExplainer(X_train, 'PredictFcn', @xgb_predict_fn);

% Explain the prediction for the chosen instance
num_features = 5;
explanation = explainer.explain_instance(instance, 'NumFeatures', num_features);

% Display the explanation
disp(explanation);



#For your methods section, you can include the following description:

#In our study, we used MATLAB to preprocess the data, handle class imbalance using SMOTE, train an XGBoost model, and perform model interpretability with LIME. We first loaded the data and preprocessed it by converting categorical variables to numerical, scaling, and imputing missing values. The data was then split into training (60%), validation (20%), and testing (20%) sets.

#o handle class imbalance, we employed the Synthetic Minority Over-sampling Technique (SMOTE) on the training set. We trained an XGBoost model using the AdaBoostM1 method with decision trees as weak learners. The tree depth was limited to five splits.

#For model interpretability, we used the LIME (Local Interpretable Model-agnostic Explanations) package in MATLAB. LIME is a method that explains the predictions of any classifier by approximating it locally with an interpretable model. We defined a prediction function for our trained XGBoost model and created a LimeTabularExplainer object using the training data and our prediction function. We then selected an instance from the test set and used the `explain_instance` method to generate an explanation for the model's prediction on that instance. The explanation highlights the top `num_features` most important features contributing to the prediction, providing insight into the local behavior of the model. This approach enables us to better understand the model's decision-making process and increase its interpretability for individual predictions. In addition to LIME, other model interpretability methods such as SHAP (SHapley Additive exPlanations) and partial dependence plots (PDP) can also be used to enhance the understanding of the model's predictions. These methods can provide valuable insights into the global behavior of the model, feature importances, and the marginal effect of features on the predicted outcomes. By combining these interpretability techniques, we can gain a more comprehensive understanding of our XGBoost model and its predictions, ultimately enabling better decision-making and increased trust in the model's outputs.