# XGBoost Overview

XGboost is a gradient boosting algorithm with some tweaks that make it more effective, and faster, than traditional gradient boosting. Most importantly, it uses CART trees and regularization built in to the model which controls model complexity. It also has clever ways of splitting the data which makes it much faster at finding the optimal greedy split per class

## Why XGBoost is effective for this dataset

XGBoost should be effective for this dataset because our decision trees by themselves were quite effective, and XGBoost uses decision trees as the foundation of its prediction. Plus, one of the biggest struggles of this dataset is the class imbalance, but XGBoost has parameters that can easily control the class weights. These impact both the "splits" in our decision tree algorithm and also the importance of each data point, which hopefully balances out the precision/recall of our two classes.

# Preparing the data ...

In [26]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
data = pd.read_csv('../data/humsavar_dbnsfp53_complete.csv')
# drop irrelevant columns (either too many unique values or 'cheating' as they're based on the label'
data = data.drop(columns=['Gene', 'Entry', 'FTId', 'AA_change', 'Category', 'dbSNP', 'Disease', 'rs_dbSNP'])
# Update the values to all be numeric
label_mapping = {'Benign': 0, 'Pathogenic': 1}
data['Label'] = data['Label'].map(label_mapping)
data.head()

Unnamed: 0,Label,chr,pos,ref,alt,SIFT_score,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_pred,CADD_raw,CADD_phred,REVEL_score
0,1,1,93998027,A,G,0.049,D,0.765,P,4.435338,25.3,0.86
1,0,1,93998061,C,G,0.053,T,0.975,D,2.412747,18.61,0.503
2,0,1,93998061,C,T,0.268,T,0.061,B,1.492491,14.14,0.313
3,1,1,94000836,T,C,1.0,T,0.051,B,2.740497,20.1,0.577
4,1,1,94000866,C,G,0.0,D,1.0,D,4.731917,26.5,0.937


In [8]:
# encode chromosomes ('X' and 'Y')
chrom_mapping = {
    **{str(i): i for i in range(1, 23)},  # "1"–"22"
    "X": 23,
    "Y": 24
}
data["chr"] = data["chr"].map(chrom_mapping)
data['chr'].describe()

count    10940.000000
mean        10.453382
std          6.880055
min          1.000000
25%          5.000000
50%         11.000000
75%         16.000000
max         23.000000
Name: chr, dtype: float64

In [5]:
data.head()

Unnamed: 0,Label,chr,pos,ref,alt,SIFT_score,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_pred,CADD_raw,CADD_phred,REVEL_score
0,1,1,93998027,A,G,0.049,D,0.765,P,4.435338,25.3,0.86
1,0,1,93998061,C,G,0.053,T,0.975,D,2.412747,18.61,0.503
2,0,1,93998061,C,T,0.268,T,0.061,B,1.492491,14.14,0.313
3,1,1,94000836,T,C,1.0,T,0.051,B,2.740497,20.1,0.577
4,1,1,94000866,C,G,0.0,D,1.0,D,4.731917,26.5,0.937


In [9]:
# Since the SIFT, Polyphen, and CADD data have two columns, lets just keep the more detailed one
data = data.drop(columns=['SIFT_pred', 'Polyphen2_HDIV_pred', 'CADD_phred'])
data.head()

Unnamed: 0,Label,chr,pos,ref,alt,SIFT_score,Polyphen2_HDIV_score,CADD_raw,REVEL_score
0,1,1,93998027,A,G,0.049,0.765,4.435338,0.86
1,0,1,93998061,C,G,0.053,0.975,2.412747,0.503
2,0,1,93998061,C,T,0.268,0.061,1.492491,0.313
3,1,1,94000836,T,C,1.0,0.051,2.740497,0.577
4,1,1,94000866,C,G,0.0,1.0,4.731917,0.937


In [12]:
# Keep chromosomes ordinally encodied, but use one hot encoding for ref and alt
dummies = pd.get_dummies(data[['ref', 'alt']])
dummies.head()

Unnamed: 0,ref_A,ref_C,ref_G,ref_T,alt_A,alt_C,alt_G,alt_T
0,True,False,False,False,False,False,True,False
1,False,True,False,False,False,False,True,False
2,False,True,False,False,False,False,False,True
3,False,False,False,True,False,True,False,False
4,False,True,False,False,False,False,True,False


In [13]:
data = pd.concat([data, dummies], axis=1)
data.head()

Unnamed: 0,Label,chr,pos,ref,alt,SIFT_score,Polyphen2_HDIV_score,CADD_raw,REVEL_score,ref_A,ref_C,ref_G,ref_T,alt_A,alt_C,alt_G,alt_T
0,1,1,93998027,A,G,0.049,0.765,4.435338,0.86,True,False,False,False,False,False,True,False
1,0,1,93998061,C,G,0.053,0.975,2.412747,0.503,False,True,False,False,False,False,True,False
2,0,1,93998061,C,T,0.268,0.061,1.492491,0.313,False,True,False,False,False,False,False,True
3,1,1,94000836,T,C,1.0,0.051,2.740497,0.577,False,False,False,True,False,True,False,False
4,1,1,94000866,C,G,0.0,1.0,4.731917,0.937,False,True,False,False,False,False,True,False


In [14]:
# Drop the non-encoded columns
data = data.drop(columns=['ref', 'alt'])
data.head()

Unnamed: 0,Label,chr,pos,SIFT_score,Polyphen2_HDIV_score,CADD_raw,REVEL_score,ref_A,ref_C,ref_G,ref_T,alt_A,alt_C,alt_G,alt_T
0,1,1,93998027,0.049,0.765,4.435338,0.86,True,False,False,False,False,False,True,False
1,0,1,93998061,0.053,0.975,2.412747,0.503,False,True,False,False,False,False,True,False
2,0,1,93998061,0.268,0.061,1.492491,0.313,False,True,False,False,False,False,False,True
3,1,1,94000836,1.0,0.051,2.740497,0.577,False,False,False,True,False,True,False,False
4,1,1,94000866,0.0,1.0,4.731917,0.937,False,True,False,False,False,False,True,False


In [15]:
y = data['Label']
X = data.drop(columns=['Label'])

In [18]:
Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=0.8, random_state=0)

## Baseline: No parameter tuning; equal class weights

In [24]:
xgb_base = xgb.XGBClassifier(random_state=0, objective='binary:logistic')
xgb_base.fit(Xtr, ytr)

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [27]:
# Test
y_hat = xgb_base.predict(Xte)
print('ACCURACY:')
print(accuracy_score(yte, y_hat))
print('F1:')
print(f1_score(yte, y_hat))
report = classification_report(yte, y_hat)
print(report)

ACCURACY:
0.9442413162705667
F1:
0.908955223880597
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1502
           1       0.93      0.89      0.91       686

    accuracy                           0.94      2188
   macro avg       0.94      0.93      0.93      2188
weighted avg       0.94      0.94      0.94      2188



Just the base XGBClassifier netted great results, beating the naive bayes and getting higher precision and recall for pathogenic class.