# Make prediction using the classifiers trained with TCGA data 


*PredictThroughClassifierModel.ipynb* notebook accepts tumor RNA-seq data and classify the patient into a high- or low-risk group using the CL-XGBoost classfiers trained with TCGA data. The trained models are in Box folder CL4CaPro/CL4CaPro_Models (https://miami.box.com/s/ylmvqynbtchx5xhof0quaeu9w62mxaca). This notebook should be placed in the same path as the classifier models in the CL4CaPro_Models folder.


### Select a cancer type
To make prediction using the model trained with the data of a single type of cancer, assign the varialbe Cancer a name in the following list: BLCA, BRCA, CESC, COAD, HNSC, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV, PRAD, SARC, SKCM, STAD, THCA, and UCEC.

To make precition using the model trained with the data of a group of cancer types, assign the variable Cancer a name in the followling list: CCPRC, GG, and SM.

In [2]:
Cancer = 'BLCA'

### Place your  input file in the same path as this notebook

The format of the input file, e.g., *BLCA_predict_input.csv*, is as follows:
1. The data starts with six columns to collect clinical information.
2. The first column, named 'patient bar', contains identifying information for the patient (can be blank for both tasks).
3. The second column is labeled 'PFI', which denotes the PFI status—either censored ('0') or uncensored ('1').
4. The third column, 'PFItime', represents the progression-free interval time in days.
5. The fourth column, 'gen_id', refers to the type of cancer, such as BLCA, BRCA, etc.
6. The fifth and sixth columns can include any information or be blank.
7. The rest columns contain gene expression values and  the header of each column is the corresponding gene ID.

In [1]:
input_pth = 'BLCA_predict_input.csv' # your input data file

### Read Input and Check

In [None]:
import pandas as pd
input_df = pd.read_csv(input_pth)
input_df

### Add Risk Classifier Labels

In [None]:
import numpy as np

data_get = input_df[input_df.gen_id == Cancer]
n = 2
div_point = [1] * n
threshold = 3 * 365
data_get = data_get[(data_get.PFI == 1) | (data_get.PFItime > threshold)]
data_get.columns.values[4] = 'predicted_label'  # Rename the fifth column
data_get['predicted_label'] = 0                 # Initialize with 0 to all rows in the 'predicted_label' column
timelabel = []
for item in data_get['PFItime'].tolist():
    i = 0
    if item < threshold:
        timelabel.append(0)
    else:
        timelabel.append(1)

data_get['predicted_label'] = np.array(timelabel)
data_get.to_csv('./Dataset/CancerRNA_Prediction_{}_Risk_2.txt'.format(Cancer), index=None)

### Generate contrastive learning features based on the public cancer model

#### Get model path

In [None]:
import os

def find_clcp_folder_name(directory):
    for folder_name in os.listdir(directory):
        if folder_name.startswith('CLCP'):
            return folder_name
    return 'No CLCP folder found.'

# Assuming the directory to search is the current working directory
directory_to_search = './CL4CaPro_Models/Classifier Models/{}'.format(Cancer)
clcp_folder_name = find_clcp_folder_name(directory_to_search)
model_pth = os.path.join(directory_to_search, clcp_folder_name)

#### Generate feature

In [None]:
para = clcp_folder_name.split('_')
input_dim = para[1]
model_n_hidden_1 = para[2]
model_out_dim = para[3]
feat_dim = para[5]
batch_size = para[-3]
l2_rate = para[9]
seed = para[13]
round = para[11]
device = 0
lr = para[7]

In [None]:
! python GenerateFeatures_Predict.py --layer_name feat --model_in_dim {input_dim} --dim_1_list {model_n_hidden_1} \
                                     --dim_2_list {model_out_dim} --dim_3_list {feat_dim} --batch_size {batch_size} \
                                     --l2_rate {l2_rate} --seed {seed} --round {round} --gpu_device {device} \
                                     --learning_rate_list {lr} --task Risk --model_pth {model_pth} \
                                     --cancer_group {Cancer}

#### Predict Results

Predict Risk

In [None]:
from xgboost import XGBClassifier

# Initialize a model instance
loaded_classifier_model = XGBClassifier()

# Load the model from the file
loaded_classifier_model.load_model('./CL4CaPro_Models/Classifier Models/{}/classifier_model.json'.format(Cancer))

predict_input_df = pd.read_csv('Features/PredictFeature_{}.txt'.format(Cancer))
X = predict_input_df.iloc[:, 6:]

predictions = loaded_classifier_model.predict(X)

Calculate AUC for multiple patients

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_curve, roc_auc_score

prob_predictions = loaded_classifier_model.predict_proba(X)[:, 1]

y = predict_input_df['predicted_label']

# Compute ROC curve and ROC area for each class
fpr, tpr, _ = roc_curve(y, prob_predictions)
auc_roc = roc_auc_score(y, prob_predictions)
print(auc_roc)

# Plot
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % auc_roc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show

f1 = f1_score(y, predictions)
accuracy = accuracy_score(y, predictions)
precision = precision_score(y, predictions)
recall = recall_score(y, predictions)