In [1]:
import numpy as np
from sklearn.metrics import roc_curve, auc
from joblib import load
import pandas as pd

One of the outputs of the machine learning pipeline is a .joblib file, which contains the trained classifier. This file typically encapsulates the trained model along with its associated metadata, such as hyperparameters or feature transformations. It's a serialized version of the trained model, allowing for easy storage, sharing, and reusability of the model within the pipeline or across different applications.

The line below shows how simple is to load the model.

In [2]:
clf = load('/home/sergiov/PycharmProjects/ICB_Response_Model/scripts/model.joblib')

After loading the model, we need to ensure that the dataset aligns with the model. This alignment ensures that the input features are in the same format and undergo the same transformations as the training data. These preprocessing steps may include:

Feature Scaling/Normalization: Ensuring that numerical features are scaled to the same range as during training (e.g., using Min-Max scaling or z-score normalization).

Handling Missing Values: Imputing missing values using the same strategy as during training, or ensuring that the dataset has no missing values if the model does not tolerate them.

Encoding Categorical Variables: Converting categorical variables into a format compatible with the model, such as one-hot encoding or label encoding.

Feature Selection: If feature selection was performed during training, ensuring that only the selected features are present in the dataset.

By aligning the dataset with the model in this manner, you can ensure that the input data is compatible with the model's expectations, maximizing its performance and predictive accuracy.

In [3]:
model_7_variables = ["TMB_zscore", "CCND1", "PD1.zscore", "PDL1.zscore", "HLA-I.GSVA", "IFNg_Ayers.GSVA",
                          "Stroma_EMT.GSVA", "T_cell_inflamed.GSVA", "TGF_beta.GSVA", "Macrophages M1",
                          "T cells CD4 memory activated", "T cells CD8", "T cells regulatory (Tregs)", "APM_8.GSVA",
                          "t.spec.lncRNA.GSVA"]

data = pd.read_csv('/datasets/sergio/Integrated_data/df_WES+RNA_response.csv')

# Keep only variables used during training phase and remove outliers
data = data[model_7_variables + ['Response']].dropna()
X = data[model_7_variables]

# Convert R/NR to binary
y_true = np.where(data['Response'] == 'R', 1, 0)

Once we have converted our dataset into a format that maximizes model performance we can just use it for predicting over the new dataset.

In [4]:
y_pred = clf.predict_proba(X)[:, 1]

fpr, tpr, _ = roc_curve(y_true, y_pred)
roc_auc = auc(fpr, tpr)

In [5]:
roc_auc

0.8188735573597409