<a href="https://colab.research.google.com/github/LarsBentsen/CourseDSAIStatisticalLearning/blob/main/binary_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIDS Virus infection classification

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, balanced_accuracy_score, roc_auc_score, recall_score, RocCurveDisplay
from imblearn.over_sampling import SMOTE

## Load the AIDS classification dataset

In [None]:
df = pd.read_csv('https://github.com/LarsBentsen/CourseDSAIStatisticalLearning/blob/main/data/AIDS_Classification.csv?raw=true')
print("number of rows: ", df.shape[0])
df.head()

In [None]:
# check missing values
df.info()

In [None]:
df.describe()

In [None]:
# training and test set

np.random.seed(666)
test_indxs = np.random.choice(np.arange(df.shape[0]), size=df.shape[0] // 5, replace=False)
df_test = df.iloc[test_indxs]
df = df.drop(test_indxs)

In [None]:
# look at the response: infected
plt.figure(figsize=(8,6))
sns.histplot(x=df['infected'], bins=10, palette='viridis', kde=True)
plt.title("The distribution of response: Infected")
plt.ylabel("")
plt.show()

In [None]:
# check class imbalance
df[df['infected'] == 1].shape[0] / df[df['infected'] == 0].shape[0] 

There are many option for data pre-processing which will depend on the data and analysis. For training purposes we will skip straight to the training.

In [None]:
# training and test set

y_train = df['infected']
X_train = df.drop('infected',axis = 1)

y_test = df_test['infected']
X_test = df_test.drop('infected',axis = 1)

## Random Forest modelling

In [None]:
# Random Forest
rfc = RandomForestClassifier() # default hyperparameter values
model = rfc.fit(X_train,y_train)

Random Forest has many hyperparameters suitable for tuning. You can check them out [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier). For imbalanced binary classification, especially the parameter class_weight can be of interest, which are the weights associated with the classes. If not given, all classes are supposed to have weight one.

In [None]:
# evaluate model

y_pred = model.predict(X_test)
accuarcy = accuracy_score(y_test,y_pred)
bacc = balanced_accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test, y_pred)
sensitivity =  recall_score(y_test, y_pred)
print("Accuracy:", accuarcy)
print("Sensitivity:", sensitivity)
print("Balanced accuracy:", bacc)
print("AUC:", auc)

# confusion matrix
print(confusion_matrix(y_pred, y_test))
print(classification_report(y_test,y_pred))

In [None]:
# plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_pred)

## XGBoost modelling

In [None]:
# XGBoost

clf = xgb.XGBClassifier() # default hyperparameter values
model = clf.fit(X_train, y_train, eval_set=[(X_test, y_test)])


XGBoost has many hyperparameters suitable for tuning. You can check them out [here](https://xgboost.readthedocs.io/en/stable/parameter.html). For imbalanced binary classification, especially the parameter scale_pos_weight can be of interest, which controls the balance of positive and negative weights. The default value is 1, while a typical value to consider is the sum(negative instances) / sum(positive instances).

In [None]:
# evaluate model

y_pred = model.predict(X_test)
accuarcy = accuracy_score(y_test,y_pred)
bacc = balanced_accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test, y_pred)
sensitivity =  recall_score(y_test, y_pred)
print("Accuracy:", accuarcy)
print("Sensitivity:", sensitivity)
print("Balanced accuracy:", bacc)
print("AUC:", auc)

# confusion matrix
print(confusion_matrix(y_pred, y_test))
print(classification_report(y_test,y_pred))

In [None]:
# plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_pred)

## Synthetic Minority Over-sampling Technique (SMOTE)
To tackle imbalanced datasets, read the article [here](https://arxiv.org/pdf/1106.1813).

In [None]:
# Apply SMOTE tp the dataset
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

In [None]:
# Random Forest
rfc = RandomForestClassifier() # default hyperparameter values
model = rfc.fit(X_res,y_res)

In [None]:
# evaluate model

y_pred = model.predict(X_test)
accuarcy = accuracy_score(y_test,y_pred)
bacc = balanced_accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test, y_pred)
sensitivity =  recall_score(y_test, y_pred)
print("Accuracy:", accuarcy)
print("Sensitivity:", sensitivity)
print("Balanced accuracy:", bacc)
print("AUC:", auc)

# confusion matrix
print(confusion_matrix(y_pred, y_test))
print(classification_report(y_test,y_pred))

In [None]:
# plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_pred)

In [None]:
# XGBoost

clf = xgb.XGBClassifier() # default hyperparameter values
model = clf.fit(X_res, y_res, eval_set=[(X_test, y_test)])


In [None]:
# evaluate model

y_pred = model.predict(X_test)
accuarcy = accuracy_score(y_test,y_pred)
bacc = balanced_accuracy_score(y_test,y_pred)
auc = roc_auc_score(y_test, y_pred)
sensitivity =  recall_score(y_test, y_pred)
print("Accuracy:", accuarcy)
print("Sensitivity:", sensitivity)
print("Balanced accuracy:", bacc)
print("AUC:", auc)

# confusion matrix
print(confusion_matrix(y_pred, y_test))
print(classification_report(y_test,y_pred))

In [None]:
# plot ROC curve
RocCurveDisplay.from_predictions(y_test, y_pred)