# Radiomics for beginners
  
<b>'Data curation is the most laborious and boring step of the radiomics workflow' - anyone who has ever worked with radiomics</b>

## Introduction

In this script, you will use an open-source dataset of 422 NSCLC cancer patients named NSCLC Radiomics in TCIA https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics, also known as MAASTRO-Lung-1 (Aerts, H. J. W. L., Wee, L., Rios Velazquez, E., Leijenaar, R. T. H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M. M., Leemans, C. R., Dekker, A., Quackenbush, J., Gillies, R. J., Lambin, P. (2019). Data From NSCLC-Radiomics [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI), Attribution-NonCommercial 3.0 Unported license (https://creativecommons.org/licenses/by-nc/3.0/). You will build a classifier to differentiate contast-enhanced from non-contrast-enhanced lung CT scans using radiomic features extracted from the gross tumor volume (GTV). Incredibly, these data were originally (and still are) used without this differentiation, leading to poor model performance for clinical outcomes. Such classifying models can easily help you curate your dataset or add a new label to your list of predictors.

<b>Brief description</b>

Iodinated contrast agents can be used for obtaining a CT image depending on the clinical question that one wants to address. The purpose of such agents can be to enhance the contrast resolution between a lesion/ischemic or occluded region and the normally perfused surrounding tissues.
Sometimes contrast-enhanced CT is desireable, though the patient has an absolute contra-indication for contrast, such as:
* allergy to intravenous contrast media,
* pregnancy (particularly in the first trimester),
* severe renal impairment,
* hyperthyroidism/goitre (IV contrast might induce thyrotoxic crisis), etc.    

The task of assessing whether an image is contrast-enhanced is <b>fairly easy in early/late arterial phases</b> (major arteries appear as hyperdense, white structures immediately after contrast bolus is administered through a peripheral vein), but may become <b>more challenging in the late portal phase</b> (blood supply from arteries reaches the liver and from there goes back into the venous system through the portal vein) and delayed venous phases (most contrast washes out from the tissue and is filtered through the kidneys).<br>  

| ![picture](https://drive.google.com/uc?export=view&id=1Kw47_0GUdvFwPjvjcGW588SGMXuV9TpF)      | ![picture](https://drive.google.com/uc?export=view&id=1w288Lda5tI6_4sy3sFepgGX4HB0YYm-z)                                                                                         |
|:-----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Patient without contrast-enhancement and with a mass in the right upper lobe<br>(red circle) | Patient with contrast enhancement (upper red circle); <br>both the major chambers/arteries of the heart as well as the aorta (arrow) <br>appear as hyperdense white structures |

<b>Problem</b>  

Usually information about the contrast is <b>not included in the DICOM tags</b>, and manual assesment of the CT for the purpose of the data curation can take considerable time. In this task, we are adressing this problem by using radiomics of the GTV to classify the images. The ground truth labelling was performed by a radiology expert.

<b>Notebook structure</b>  

This Python script will take you through the radiomics analysis steps where you'll learn how to read in the features, select the features, train the models and display the results:
 - installing packages and importing libraries,  
 - preparation of the imaging data,
 - features extraction,
 - reading the data and assigning outcomes,
 - exploring the data,
 - removing highly correlated features,
 - selecting features using recursive feature elimination,
 - creating the classification models, displaying the results.

## Getting started

This is an interactive Python notebook. To run it, you don't need to install anything on your PC since the script is executed in the cloud. On the left tab, you can see the 'Files' folder, that contains the data. Results of the script execution will be saved in this folder.  
  
The dataset is open-source, so you can download the data to have a look at it. It you don't have any software to open DICOM files, you can download and install 3D Slicer (https://www.slicer.org/), RadiAnt viewer (https://www.radiantviewer.com/), or MicroDicom viewer (https://www.microdicom.com/downloads.html). As feature extraction process takes time, we prepared the .csv tables with the pre-extracted features to work with. They are available at the same shared folder. All the data will be pulled from the shared Google Drive to the temporary environment of this notebook. Further it is explained how to get these data.  

Please note, that all the files you pull or upload to this notebook, as well as the files, produced while executing the script, are automatically deleted as soon as you end the session.

First of all, the needed Python packages have to be uploaded. Some of them are not installed in the environment of this notebook, so the installation is needed with '!pip install <i>name-of-the-package</i>' command. We recommend you get acquainted with getting documentation and help on these packages. For example, google 'python sklearn' and you will get to the documentation quickly. Importing libraries is a necessary step with most progamming languages, not only Python.

In [None]:
# installing some packages, which are not part of the present Google Collab environment

!pip install precision-medicine-toolbox

In [None]:
import os
import numpy as np
import pandas as pd
from pmtool.AnalysisBox import AnalysisBox
from pmtool.ToolBox import ToolBox
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
import seaborn as sns
import xgboost as xgb



In [None]:
!gdown --id 15CPMLABDUtDpsnoFBqSu5P4d8b7tBiod
!unzip AI4I2022_Radiomics_Beginners_Data_sample15.zip

## Feature extraction
There are many softwares to perform this step, each with their strengths and weaknesses. Features extraction step takes the longest time, so we are happy to help you with it. We will supply you with "something we prepared earlier", the lists of features already extracted with Radiomics software, and outcomes (CE/non-CE). The data is already split into train and validation sets. Therefore, you can skip this step or return to it later. If you want to skip this step, please move to 'Reading the data' step.  

If you want to extract the features by yourself anyway, you can execute the following code. Here we perform features extraction with PyRadiomics software (https://pyradiomics.readthedocs.io/en/latest/) with the precision-medicine-toolbox (https://github.com/primakov/precision-medicine-toolbox) interface. These are the open-source Python packages.  

First of all, we would need to convert DICOM data into NRRD. But we will start with unzipping the imaging archive. Pulling the shared files and unzipping (pay attention at 'Files' tab of this notebook):

In [None]:
# listing the DICOM dataset parameters:
parameters = {'data_path': '/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_dicom', # path to your DICOM data
              'data_type': 'dcm', # original data format: DICOM
              'multi_rts_per_pat': False}   # when False, it will look only
                                            # for 1 rtstruct in the patient folder,
                                            # this will speed up the process,
                                            # if you have more then 1 rtstruct per patient,
                                            # set it to True

# initializing the dataset object:
data_dcms = ToolBox(**parameters)

# getting the description for the first 10 files:
dataset_description = data_dcms.get_dataset_description('CT')
dataset_description.head(10)

We can plot some scanning parameters:

In [None]:
sns.set(context='poster', style='whitegrid')

study_date = sorted([ 'Nan' if x=='' or x=='NaN' else str(x[0:4]) for x in list(dataset_description['SeriesDate'])])[2:]
conv_kernel =['Nan' if x=='' or x=='NaN' else x for x in list(dataset_description['ConvolutionKernel'])]
tube_current =[-1 if x=='' or x=='NaN' else x for x in list(dataset_description['XRayTubeCurrent'])]
exposure =[-1 if x=='' or x=='NaN' else x for x in list(dataset_description['Exposure'])]
ps = sorted([(x[0]) for x in list(filter(lambda x: x != 'NaN', dataset_description['PixelSpacing'].values))])
sl_th = sorted([str(x)[0:3] for x in list(filter(lambda x: x != 'NaN', dataset_description['SliceThickness'].values))])
figures,descriptions = [study_date,conv_kernel,tube_current,exposure,ps,sl_th],['Study Date','Conv Kernel','Tube Current','Exposure','Pixel spacing','Slice Thickness']

fig,ax = plt.subplots(2,3,figsize=(25,15))
for i in range(2):
    for j in range(3):
        ax[i,j].hist(figures.pop(0),alpha=0.7)
        ax[i,j].set_title(descriptions.pop(0),fontsize=20)

plt.suptitle('Imaging parameters in the dataset')
plt.show()

As PyRadiomics do not support DICOM, we convert images and masks into NRRD format:

In [None]:
data_dcms.convert_to_nrrd('/content/', 'gtv')

The data is saved in temporary folder and will be lost as soon as you reboot the notebook! We can read the NRRD data and get sure the convertion was correct and the images are co-aligned with the masks:

In [None]:
# initializing the newly created NRRD dataset:
data_nrrds = ToolBox(data_path = '/content/converted_nrrds',
                     data_type='nrrd')

# saving snapshots to JPEG files
data_nrrds.get_jpegs('/content/')

A quick look at the contours:

In [None]:
from ipywidgets import interact
import numpy as np
from PIL import Image

def browse_images(images,names):
    n = len(images)
    def view_image(i):
        plt.figure(figsize=(20,10))
        plt.imshow(images[i])#, cmap=plt.cm.gray_r, interpolation='nearest')
        plt.title('Slice: %s' % names[i])
        plt.axis('off')
        plt.show()
    interact(view_image, i=(0,n-1))

for pat,_ in data_nrrds:
    _,file_struct = [*os.walk(os.path.join('/content/images_quick_check/',pat))]
    root,images = file_struct[0],file_struct[2]
    imgs =[np.array(Image.open(os.path.join(root,img))) for img in images]
    print(pat)
    browse_images(imgs,images)
    break

In the precision-medicine-toolbox, we are using PyRadiomics software (https://pyradiomics.readthedocs.io/en/latest/) to extract the features. You can read the full documentation for the currently stable version: https://pyradiomics.readthedocs.io/_/downloads/en/stable/pdf/.

We are using PyRadiomics parameters file customized for CT data:

In [None]:
parameters = "/content/AI4I2022_Radiomics_Beginners_Data/example_ct_parameters.yaml"
features = data_nrrds.extract_features(parameters)

Let's have a look at the dataframe with he features:

In [None]:
features.head(10)

Now you know how to extract features. Nevertheless, in the prepared files, we also have an expert's outcome (CE/non-CE), therefore, from now on, we recommend to use those .csv files. Moreover, we selected only one ROI per patient (GTV) and the patients were split into training and testing sets. The key difference is that the present script is using Pyradiomics for feature extraction whereas in the prepared files, Radiomics software (https://radiomics.bio/) was used.

## Reading the data

For this step, you will need to specify the files containing the features and outcomes (in this case 0 or 1 for CE/non-CE). We have already cleaned the dataset to remove low quality CTs, and separated the data into training (N=253) and testing (N=106) groups.
It is important that apart from initial dataset cleaning, all the work is only performed in the training dataset and the test set remains untouched.

<b>NOTE: the separator character, text quotation, decimal point character change depending on your country settings. If this step fails, add those settings into read_csv function. Also, the outputs from different radiomics softwares can differ wildly, so some thought is necessary when doing this at home.</b>  



In [None]:
data_train = pd.read_csv("/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_features/WS2022_Beginners_TrainingSet.csv")
data_test = pd.read_csv("/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_features/WS2022_Beginners_ValidationSet.csv")

# let's have a look at our dataframe:
data_test

<b>Getting the features and defining the outcome</b>

It's a good time to look at the .csv files using Google Tables or similar software. IF the data looks jumbled up, try "importing from text". Ask a helper if you need assistance with this. Let's get the list of the features from the .csv header first:

In [None]:
features = ...........# Print features names

print (features)

Separate the features and the outcomes:
Create one datafranme for features and one for outcomes from features.csv above

1.   List item
2.   List item



In [None]:
outcome_train = ........
outcome_test = ........
data_train = ........
data_test = .........

# checking how the features dataframe looks like


## Exploratory data analysis

Before building the models, it is always useful to perform the exploratory data analysis to have a look at the data, understand the distribution of the features, and notice some data errors. To perform these steps, we will be using precision-medicine-toolbox (https://github.com/primakov/precision-medicine-toolbox), an open-source Python package, developed in the D-Lab, Maastricht University.  
  
Remember that for now, everything happens in the training data. Time to discuss why this is so important!  

In [None]:
# let's list the parameters of our features dataset
parameters = {
    'feature_path': "/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_features/WS2022_Beginners_TrainingSet.csv", # path to csv/xls file with features
    'outcome_path': "/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_features/WS2022_Beginners_TrainingSet.csv", #path to csv/xls file with outcome (in our case - the same file)
    'patient_column': 'General_PatientID', # name of column with patient ID
    'patient_in_outcome_column': 'General_PatientID', # name of column with patient ID in clinical data file
    'outcome_column': 'Outcome' # name of outcome column
}

# initializing the dataset, containing features
fs = AnalysisBox(**parameters)

Plotting feature distributions for the first 80 features to have a visual understanding about feature distributions in classes. For this and the next steps, interactive .html reports will be generated. You can find them on your google Drive in the same folder as the original .csv file.

In [None]:
fs.plot_distribution(fs._......)

Plotting the correlation matrix for the first 80 features to have an idea about how many features are inter-correlated. Why is it important to know about the mutual feature correlation?

In [None]:
fs............(fs._feature_column[:80])

Performing Mann-Whitney U-test (with False Discovery Rate correction) to see if the feature distributions in classes are statistically different (for the first 80 features).

In [None]:
...................................


Plotting univariate ROC curves and calculating AUC scores to have some estimation of the predictive power of every separate feature (for the first 80 features). Can we just build a classifer based on the best feature?

In [None]:
auc_th = 0.5 # Modify
fs.plot_univariate_roc(............, auc_threshold=auc_th)

Performing volumetric analysis to understand if our features are correlated to ROI volume and if the volume itself is a good predictor for our task. Why is it important to know if the features are correlated to volume?

In [None]:
corr_th = 0.5 # Modify
fs.volume_analysis(volume_feature='Shape_volume', corr_threshold=corr_th)

Writing the basic statistics for every feature into .csv file and having a look at the resulting dataframe.

In [None]:
fs.calculate_basic_stats(volume_feature='Shape_volume')

print('Basic statistics for each feature')
pd.read_excel('/content/AI4I2022_Radiomics_Beginners_Data_sample15/data_features/WS2022_Beginners_TrainingSet_basic_stats.xlsx')

## Features reduction and selection

We will remove columns with NA values and remove features with (near)zero variance, as they do not have value for classification and needlessly increase computation and dimensionality.

First, we will deal with the missing (NA) and zero variance values. Should we drop rows or columns of the dataframe?

In [None]:
features_non_nan = data_train.dropna(axis=1).columns
print ('Number of features without missing values: ', len(..............))

features_non_zero_var = data_train[features_non_nan].loc[:,
                                                         data_train[features_non_nan].std() > 0.3].columns
print ('Number of features with non-zero variance: ', len(..............))

Removing highly correlated features is a controversial step aimed to reduce the dimensionality of the feature space. Highly correlated features needlessly inflate the dimensionality of feature space. The idea is that highly correlated features can be grouped together and represented by one representative feature.  For features pairs with a high Spearman correlation (r > 0.9) the feature with the highest mean correlation with all remaining features is removed.
The opponents to this step argue that just because features are correlated doesn't mean that they don't individually  increase the model's performance.
Time to discuss the pro's and con's of this step in more detail!

<b>If you have time at the end of today's workshop, the cutoff is certaily a variable that can be played with.</b>

In [None]:
def selectNonIntercorrelated(df_in, ftrs, corr_th):

    # selection of the features, which are not 'highly intercorrelated' (correlation is defined by Spearman coefficient);
    # pairwise correlation between all the features is calculated,
    # from each pair of features, considered as intercorrelated,
    # feature with maximum sum of all the pairwise Spearman correlation coefficients is a 'candidate to be dropped'
    # for stability of the selected features, bootstrapping approach is used:
    # in each bootstrap split, the random subsample, stratified in relation to outcome,
    # is formed, based on original observations from input dataset;
    # in each bootstrap split, 'candidates to be dropped' are detected;
    # for each input feature, its frequency to appear as 'candidate to be dropped' is calculated,
    # features, appeared in 50 % of splits as 'candidate to be dropped', are excluded from feature set

    # input:
    # df_in - input dataframe, containing feature values (dataframe, columns = features, rows = observations),
    # ftrs - list of dataframe features, used in analysis (list of feature names - string variables),
    # corr_th - threshold for Spearman correlation coefficient, defining each pair of features as intercorrelated (float)

    # output:
    # non_intercorrelated_features - list of names of features, which did not appear as inter-correlated

    corr_matrix = df_in.corr(method='spearman').abs()
    mean_absolute_corr = corr_matrix.mean()
    intercorrelated_features_set = []
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    high_corrs = upper.where(upper > corr_th).dropna(how='all', axis=1).dropna(how='all', axis=0)

    for feature in high_corrs.columns:
        mean_absolute_main = mean_absolute_corr[feature]
        correlated_with_feature = high_corrs[feature].index[pd.notnull(high_corrs[feature])]
        for each_correlated_feature in correlated_with_feature:
            mean_absolute = mean_absolute_corr[each_correlated_feature]
            if mean_absolute_main > mean_absolute:
                if feature not in intercorrelated_features_set:
                    intercorrelated_features_set.append(feature)
            else:
                if each_correlated_feature not in intercorrelated_features_set:
                    intercorrelated_features_set.append(each_correlated_feature)

    non_intercorrelated_features_set = [e for e in ftrs if e not in intercorrelated_features_set]

    print ('Non intercorrelated features: ', non_intercorrelated_features_set)

    return non_intercorrelated_features_set

We will use a threshold of 0.9 for absolute value of Spearmann's correlation.

In [None]:
features_non_intercorrelated = selectNonIntercorrelated(data_train, features_non_zero_var, 0.9)
print ('Number of non-intercorrelated features: ', len(...................))

We will perform feature selection using Recursive Feature Elimination (RFE). In this step, feature selection is based on the outcome, so simple models are built and those features that contribute the least to the model are removed recursively. Here you can edit the parameters.

<b>You might have heard of variable normalization. Why are we not normalizing the variables (e.g. Z-score normalization)?</b>
Hint: e.g. https://stackoverflow.com/questions/57339104/is-normalization-necessary-for-randomforest

<b>How many features do we need to select with RFE?
Time to discuss pros and cons of using many and a little features.</b>  
  
There are some rules of thumb on how many features we need in the end:  
* $int(\frac{N_{samples}}{10})$ (Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. T. (2012). Learning from data (Vol. 4, p. 4). New York: AMLBook.)  
* $\sqrt{N_{samples}}$ (Hua, J., Xiong, Z., Lowey, J., Suh, E., & Dougherty, E. R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21(8), 1509-1515.)

In [None]:
print ('Nmber of samples in training dataset: ', np.sum(outcome_train))
print ('Number of features to select according to Abu-Mostafa: ', ................)
print ('Number of features to select according to Hua: ', .....................))

Let's go for the lower number of features since our dataset is not large. Below we implement Recursive Feature Elimination, RFE (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html), based on Random Forest Classifier, RFC (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
number_selected_features = 6 # Modify
estimator = RandomForestClassifier(n_estimators=100, random_state=np.random.seed(0))
selector = RFE(estimator, n_features_to_select=number_selected_features, step=1)
selector = selector.fit(data_train[features_non_intercorrelated], outcome_train) # Modify outcome_train

And then select the features 'supported' by the algorithm.

In [None]:
support = selector.get_support()
selected_features_set = data_train[features_non_intercorrelated].loc[:, support].columns.tolist()

print (selected_features_set)

<b>How many features give the best performance?</b>

Time to discuss why we don't just use 100 features!  
  
The other option could be ranking the features and then building classifier adding +1 feature at every step and checking its performance. We will evaluate the performance with the balanced accuracy score.

In [None]:
# ranking the features based on RFE results

features_ranks = pd.DataFrame({'Features': features_non_intercorrelated, 'Ranks': selector.ranking_})
features_ranks.sort_values(by='Ranks', inplace = True)

# taking one best feature first, building the RFC, estimating the performance;
# adding +1 next feature, repeating the steps to estimate the performance

ftrs_number_tuning = []
acc_tuning = []

for i in range (1, len(features_ranks)):

    ftrs_number_tuning.append(i)
    estimator_tuning = RandomForestClassifier(n_estimators=100, random_state=np.random.seed(0))
    estimator_tuning.fit(data_train[features_ranks['Features'][:i]], outcome_train) # Modify outcome train
    outcome_pred_tuning = estimator_tuning.predict(data_test[features_ranks['Features'][:i]])
    acc_tuning.append(balanced_accuracy_score(outcome_test, outcome_pred_tuning))

# plotting the results
plt.plot(ftrs_number_tuning, acc_tuning)
plt.xlabel('Number of features')
plt.ylabel('RFC performance')
plt.title('RFC performance depending on number of selected features')
plt.show()

What can we conclude from the plot? What are the downsides of the presented implementation? Is it correct to train and evaluate the model on the same samples? Is the selected performance metric correct? What other metrics can we use?

## Modeling



### MODEL 1: RFC

We started with RFC while performing RFE, so let us train this model with the selected features and evaluate in on test data.  

Training the model (it's possible to vary the parameters):

In [None]:
rfc = RandomForestClassifier(n_estimators=100, random_state=np.random.seed(0))
rfc.fit(data_train[selected_features_set], outcome_train) # Modify outcome train

Prediction for the testing set:

In [None]:
outcome_pred_rfc = rfc.predict(data_test[selected_features_set]) # Modify what to test on

Performance reporting on some key classification scores:

In [None]:
print (classification_report(........., ............)) # Modify what to report on

Are precision and recall informative metrics? Why don't we report accuracy as a key metric? In which cases accuracy is not suitable scores?  
  
The other popular metric for classification is Receiver Operating Characteristic (ROC) curve and area under the curve (AUC). We will calculate true positive rates (TPR) and false positive rates (FPR) while varying classification threshold and plot the curve.

In [None]:
fpr, tpr, _ = roc_curve(outcome_test, rfc.predict_proba(data_test[selected_features_set])[:, 1])
roc_auc = roc_auc_score(outcome_test, rfc.predict_proba(data_test[selected_features_set])[:, 1])

plt.plot(fpr, tpr)
plt.title('RFC ROC curve (AUC = {})'.format(roc_auc))
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

In which cases this metric does not give a correct representation of the model performance? Is AUC always the best metric?
Most cetainly not! Especially for unbalanced datasets, AUC can be meaningless.
http://www.davidsbatista.net/blog/2018/08/19/NLP_Metrics/
"With imbalanced classes, it’s easy to get a high accuracy without actually making useful predictions. So, accuracy as an evaluation metrics makes sense only if the class labels are uniformly distributed"    

To have a better understanding of model behaviour, we will plot a confusion matrix.

In [None]:
cm = confusion_matrix(..............., .........) # Modify what to report on
f = sns.heatmap(cm, annot=True)
plt.title('Confusion matrix for RFC')
plt.show()



### MODEL 2: XGBoost

The second model we will build is XGBoost because recently the algorithm was successfull in many machine learning competitions.

<b>XGBoost - what is it? It's always best to understand your models. </b>

https://xgboost.readthedocs.io/en/stable/python/python_intro.html#

Train the model: Here you can chose the max number of boosting iterations, a balance between computing time and accuracy.

Presenting the data in the appropriate format for the library.

In [None]:
dtrain = xgb.DMatrix(...........[selected_features_set], label=.........) # Modify the labels
dtest = xgb.DMatrix(............[selected_features_set], label=.........) # Modify the labels

Defining the parameters: the learning objective is logistic regression for binary classification with probability output, the metric is the area under precision-recall curve (why not ROC AUC?).

In [None]:
param = {'objective': 'binary:logistic', 'eval_metric': 'aucpr'}

We will train the model on the training set and evaluate on test set.

In [None]:
evallist = [(dtest, 'eval'), (dtrain, 'train')]

Training the model (number of iterations can be changed here!) and calculating outcomes for the test set.

In [None]:
num_round = 10
bst = xgb.train(param, dtrain, num_round, evallist)

outcome_pred_xgb = bst.predict(dtest)

Classification report:

In [None]:
print (classification_report(........, ............>0.5)) # Modify what to report on

ROC and ROC AUC:

In [None]:
fpr, tpr, _ = roc_curve(......., ..........) # This can be modified as an AUC curve was constructed already above
roc_auc = roc_auc_score(........, outcome_pred_xgb)

plt.plot(fpr, tpr)
plt.title('XGBoost ROC curve (AUC = {})'.format(roc_auc))
plt.show()

Create and display the confusion matrix and derived values for the Xgboost model.

In [None]:
cm = confusion_matrix(........., outcome_pred_xgb>0.5) # This can be modified as a confusion matrix was constructed already above
f = sns.heatmap(cm, annot=True)

<b>Almost done! You will now compare the performance of the models.</b>
  
Which classifier is better?  

To compare ROC AUC scores, we will perform a permuation test for the probabilities obtained on the test set for the both classifiers.

In [None]:
# adapted from https://stackoverflow.com/questions/52373318/how-to-compare-roc-auc-scores-of-different-binary-classifiers
#-and-assess-statist

def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=100):
    auc_differences = []
    auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
    auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
    observed_difference = auc1 - auc2
    for _ in range(nsamples):
        mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
        p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
        p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
        auc1 = roc_auc_score(y_test.ravel(), p1)
        auc2 = roc_auc_score(y_test.ravel(), p2)
        auc_differences.append(auc1 - auc2)
    return observed_difference, np.mean(auc_differences >= observed_difference)

print ('Difference, p-value: ',
       permutation_test_between_clfs(outcome_test,
                                     outcome_pred_xgb,
                                     rfc.predict_proba(data_test[selected_features_set])[:,1]))

After this test, which classifier is better?  
  
After this you can go back and "finetune" the models by changing the parameters you feel comfortable with. See you can increase the model performance. Can you beat the other groups?