# <h1><center>Higgs Boson Classification</center></h1>
## Part-2
### Model training

After splitting and scaling the data we are ready to train our models. The models we are going to use in this task are from cuML library as they support GPU for traing and testing the models. We will be using Random forest model and XGBost model, and train these on our complete dataset as well as the dimentionality reduced set and observe the difference in the performance of the models. We will try to tune our models by changing the hyper parameters too.

In [37]:
#Import the RandomForest classifier from cuML library
from cuml.ensemble import RandomForestClassifier as cuRF

For our first random forest model we will be using a the default hyper parameters and below is the description of the hyperparameters that we are going to use in our model. 

1- **n_estimators:** The number of trees in the forest. Increasing this number typically improves the performance of the classifier, but also increases computation time.

2- **max_depth:** The maximum depth of each tree in the forest. This parameter controls the complexity of the trees, and increasing it can lead to overfitting.

3- **n_bins:** The number of bins used when splitting continuous features. A higher number of bins can lead to better accuracy, but also increases computation time.

4- **n_streams:** The number of parallel streams used to build the trees. This parameter can be set to the number of available GPU streams for faster computation.

5- **max_samples:** The maximum percentage of samples used to build each tree. This parameter controls the amount of randomness in the forest, and setting it to 1 means that all samples are used to build each tree.

6- **split_criterion:** The criterion used to select the best feature to split on at each node. A value of 0 indicates the Gini impurity criterion, while a value of 1 indicates the entropy criterion.

7- **random_state:** The random seed used for reproducibility.

In [11]:
# Default Random Forest params for our first model
cu_rf_params = {
    'n_estimators'    : 100,
    'max_depth'       : 16,
    'n_bins'          : 128,
    'n_streams'       : 4,
    'max_samples'     : 1,
    'split_criterion' : 0,
    'random_state'    : 123
}
cu_rf = cuRF(**cu_rf_params)

CuML was a possibility for this strategy since we wanted to leverage GPU to improve performance and cut down on processing time. We'll clock our training session as well to keep track of the time. Using the scaled training set, we will train our model.

In [12]:
%%time

cu_rf.fit(X_train_scaled, y_train)

CPU times: user 1min 35s, sys: 254 ms, total: 1min 35s
Wall time: 26.6 s


RandomForestClassifier()

The trained model is applied to the test data set using the predict function from the random forest library to provide predictions.

In [14]:
%%time

# using the predict method on test set
y_pred = cu_rf.predict(X_test_scaled)

CPU times: user 2min 57s, sys: 172 ms, total: 2min 57s
Wall time: 2min 57s


We can determine how many of the test predictions were accurate by comparing them to the actual outcomes. Utilising the accuracy score function, this may be determined.

In [15]:
print('Accuracy score: ', accuracy_score(y_test, y_pred))

Accuracy score:  0.7331298589706421


This model's accuracy of 73% is not dismal. To test if changing the hyperparameters boosts perforance, we may undertake hyperparameter tuning.

### Hyper-parameter tuning

Our model was previously trained using default parameters; however, we may now change the parameters to see whether we obtain any useful results. 

Increasing the number of bins in the Random Forest model can improve its ability to capture nonlinear relationships between the features and the target variable, as it allows for more granular splitting of the data. However, this can also increase the risk of overfitting, especially if the data is noisy or the number of samples is small.

Increasing the max depth of the trees in the Random Forest model can improve its ability to capture complex interactions between the features and the target variable. However, it can also increase the risk of overfitting, especially if the data is noisy or the number of samples is small.

Increasing the number of estimators in the Random Forest model can improve its ability to generalize to new data by reducing the variance of the model. However, this improvement in performance may come at the cost of increased computational complexity and longer training time.

We can also change the random state so that model can try to train on a different distribution of data.

In [12]:
# cuml Random Forest params
cu_rf_params_2 = {
    'n_estimators'   : 500, # increased from 100 to 500
    'max_depth'      : 20,  # changed from 16 to 20
    'n_bins'         : 150, # increased the bins from 128 to 150
    'n_streams'      : 4,   # default
    'max_samples'    : 1,   # default
    'split_criterion': 0,   # default
    'random_state'   : 786  # Changed the Random Number Generator seed to try different distribution
}
cu_rf_2 = cuRF(**cu_rf_params_2)

In [13]:
%%time

trained_model=cu_rf_2.fit(X_train_scaled, y_train)

CPU times: user 9min 19s, sys: 2.5 s, total: 9min 21s
Wall time: 2min 34s


In [39]:
%%time

y_pred_2 = cu_rf_2.predict(X_test_scaled)

CPU times: user 25min 17s, sys: 1.47 s, total: 25min 18s
Wall time: 25min 17s


In [40]:
print('Accuracy score: ', accuracy_score(y_test, y_pred_2))

Accuracy score:  0.7444262504577637


We can observe that altering the hyperparameter values does, to a certain extent, boost efficacy. albeit at the cost of 22 more minutes during the testing process.
To evaluate whether the performance changes, let's try a different random seed.

In [42]:
%%time
# cuml Random Forest params
cu_rf_params_3 = {
    'n_estimators'   : 600, # increase no. of trees to 500
    'max_depth'      : 20,  # change to 16
    'n_bins'         : 180, # change to 128
    'n_streams'      : 4,   # CUDA stream to use for parallel processing on GPU, default is 4
    'max_samples'    : 1,   # Percentage of input data to be considered for each tree, default is 1
    'split_criterion': 0,   # Split algorithm, default is 0 for gini impurity
    'random_state'   : 72   # Seed used for Random Number Generator
}
cu_rf_3 = cuRF(**cu_rf_params_3)
cu_rf_3.fit(X_train_scaled, y_train)

CPU times: user 11min 8s, sys: 2.71 s, total: 11min 11s
Wall time: 3min 16s


RandomForestClassifier()

In [43]:
%%time
y_pred_3 = cu_rf_3.predict(X_test_scaled)

CPU times: user 31min 48s, sys: 1.14 s, total: 31min 49s
Wall time: 31min 48s


In [44]:
print('Accuracy score: ', accuracy_score(y_test, y_pred_3))

Accuracy score:  0.7445030808448792


The accuracy is almost same as the second experiment and we can assume that without any further processing we would not be able to make any drastic change in the accuracy of the model..

In [None]:
from joblib import dump
dump( trained_model, 'RF.model')

### Train Random Forest with PCA components

Now let's train the random forest algorithm with the PCA components and see the effect on the performance. 

In [15]:
cu_rf_params_pca = {
    'n_estimators'    : 500, 
    'max_depth'       : 20, 
    'n_bins'          : 150, 
    'n_streams'       : 4, 
    'max_samples'     : 1, 
    'split_criterion' : 0, 
    'random_state'    : 786
}
# initailise RF object
cu_rf_PCA = cuRF(**cu_rf_params_pca)

Fit the PCA components

In [16]:
%%time

cu_rf_PCA.fit(components, y_train)

CPU times: user 5min 58s, sys: 1.14 s, total: 5min 59s
Wall time: 1min 38s


RandomForestClassifier()

Due to the lower amount of features, the training takes less time. To make predictions using the test set, we will also need to convert it into pca components.

In [17]:
X_test_PCA = pca.transform(X_test_scaled)
X_test_PCA.head()

Unnamed: 0,0,1,2
0,-0.087741,-0.21769,-0.298753
1,0.532698,2.316275,-0.554948
2,-1.379118,-1.403355,1.91752
3,0.948131,0.459213,1.628539
4,2.381529,1.415918,-1.045559


On the test set produced using PCA, the .predict() function will be used to get the predictions.

In [18]:
%%time
# using the predict method on test set
pred_pca = cu_rf_PCA.predict(X_test_PCA)

CPU times: user 18min 38s, sys: 541 ms, total: 18min 38s
Wall time: 18min 37s


In [19]:
print('Accuracy score: ', accuracy_score(y_test, pred_pca))

Accuracy score:  0.580947995185852


The accuracy has decreased by using the PCA components. Random Forests works better with more number of dimensions and has the ability to perform better parallel computing. 

## XGBoost

Extreme Gradient Boosting, or XGBoost, is a potent supervised machine learning technique. The most potent algorithms in traditional machine learning are Random Forest and XGBoost, which produce cutting-edge results comparable to those of neural networks. For GPU training, XGBoost with RAPIDS can be utilised. Additionally, XGBoost parallelizes well and trains on large datasets well. 

In [12]:
import xgboost as xgb

In XGBoost, a DMatrix is a data structure that is used to represent the input data for training or prediction. It is essentially a memory-optimized format that is designed to efficiently store and access large datasets, and it is used by the XGBoost library for training and predicting with gradient boosting models.

The DMatrix format is designed to handle both dense and sparse data, and it supports a variety of input formats. When creating a DMatrix, you typically provide the input data along with any associated labels or weights that are required for training or prediction.

The DMatrix format allows for efficient memory usage by storing only the non-zero elements of sparse data, and by using a compressed sparse column (CSC) representation to access this data quickly during training. For dense data, the DMatrix format stores the data in a contiguous block of memory, which can be efficiently accessed during training using SIMD instructions.

Once a DMatrix has been created, it can be used to train an XGBoost model using the xgboost.train() function or to make predictions using the xgboost.predict() function. The DMatrix format provides a convenient way to handle large datasets efficiently in XGBoost, which can be especially important when working with high-dimensional data or when training on a large number of samples.

#### Converting cuDF data to DMatrix format:

Our data is in a cuDF dataframe. we need to convert it to a DMatrix object for GPU optimized XGBoost model.

In [13]:
%%time

d_train = xgb.DMatrix(X_train_scaled, label=y_train)
d_validation = xgb.DMatrix(X_test_scaled, label=y_test)

CPU times: user 109 ms, sys: 37.6 ms, total: 146 ms
Wall time: 184 ms


We need to set the parameters for the XGBoost model now.

In [14]:
#  the parameters for the model
params = {
    #we are not putting any parameters as we just want tio run the model on the default values.
}

# general params
general_params = {'silent': 1} # for verbosity
params.update(general_params)

# booster params
n_gpus = 1  
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'
    
params.update(learning_task_params)
print(params)

{'silent': 1, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


Above code instantiates an XGBoost model by specifying various parameters that control its behavior.

The **params** dictionary is used to store the parameters for the model. The first set of parameters are general_params, which includes a single parameter to control the verbosity of the model output.

The **booster_params** dictionary contains parameters that are specific to the XGBoost booster, which is the algorithm that is used to build the model. The n_gpus variable controls whether the model will be trained on a GPU or CPU. If **n_gpus** is set to a value greater than 0, then the tree_method parameter is set to 'gpu_hist' and the n_gpus parameter is set to the number of GPUs to use. If n_gpus is set to 0, then the model will be trained on the CPU.

The **learning_task_params** dictionary contains parameters that are specific to the learning task, such as the evaluation metric and the objective function. In this case, the evaluation metric is set to 'auc', which is the area under the ROC curve, and the objective function is set to 'binary:logistic', which is used for binary classification problems.

Finally, the **params** dictionary is updated with the values from general_params, booster_params, and learning_task_params, and the resulting dictionary is printed to the console. This dictionary can then be used to create an instance of the XGBoost model and to train or make predictions with it.

**eval_list** is typically a list of tuples, where each tuple contains a string specifying the evaluation metric to use, and a DMatrix object containing the validation data.

In XGBoost, **num_round** refers to the number of boosting rounds to perform during training. Each boosting round corresponds to adding a new tree to the ensemble model. The purpose of num_round is to control the complexity of the model and to prevent overfitting to the training data.

Setting num_round too low may result in underfitting, where the model is too simple to capture the patterns in the data. On the other hand, setting num_round too high may result in overfitting, where the model becomes too complex and fits the noise in the training data, resulting in poor generalization performance on new, unseen data.

Therefore, num_round should be chosen based on a trade-off between model complexity and generalization performance. Typically, a large value of num_round is chosen and then early stopping is used to prevent overfitting. Early stopping involves monitoring the performance of the model on a validation set after each boosting round, and stopping the training process when the performance on the validation set no longer improves.

In [15]:
eval_list = [(d_validation, 'validation'), (d_train, 'train')]
num_round = 100

In [22]:
%%time

model = xgb.train(params, d_train, num_round, eval_list)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.74269	train-auc:0.74290
[1]	validation-auc:0.75468	train-auc:0.75489
[2]	validation-auc:0.76418	train-auc:0.76437
[3]	validation-auc:0.76987	train-auc:0.77006
[4]	validation-auc:0.77461	train-auc:0.77488
[5]	validation-auc:0.77925	train-auc:0.77950
[6]	validation-auc:0.78417	train-auc:0.78443
[7]	validation-auc:0.78715	train-auc:0.78743
[8]	validation-auc:0.78967	train-auc:0.78994
[9]	validation-auc:0.79127	train-auc:0.79161
[10]	validation-auc:0.79349	train-auc:0.79384
[11]	validation-auc:0.79480	train-auc:0.79515
[12]	validation-auc:0.79607	train-auc:0.79641
[13]	validation-auc:0.79752	train-auc:0.79787
[14]	validation-auc:0.79888	train-auc:0.799

In [24]:
model_pred = model.predict(d_validation)

By default, the predictions made by XGBoost are probabilities. To calculate the accuracy score, we can round them to the nearest 0 or 1, then convert them to binary class values.

In [25]:
model_predictions = [round(value) for value in model_pred]
model_predictions = np.array(model_predictions)

In [26]:
print('Accuracy score: ', accuracy_score(y_test, model_predictions))

Accuracy score:  0.7417985200881958


We can observe that the accuracy of RFC and XGBoost is not much different. We may experiment with other hyperparameters to see if the accuracy improves.
An ROC AUC score, which is a better approach to assess the performance of the model, is simple to find with XGBoost.

In [27]:
print('ROC AUC score: ', roc_auc_score(y_test, model_pred))

ROC AUC score:  0.8233003616333008


AUC score of 82% is a good score for this model.

##### lets try changeing the hyperparameters a bit to see if it makes any difference

1-  **silent**:1 : Sets the verbosity of XGBoost to silent mode, meaning that no messages will be printed to the console during training.

2-  **tree_method**:**gpu_hist**: Specifies the tree construction method to use during training. In this case,   **gpu_hist** indicates that the GPU histogram algorithm should be used for improved training speed.

3-  **n_gpus**: -1: Specifies the number of GPUs to use during training. A value of -1 indicates that all available GPUs should be used.

4-  **eval_metric**:**auc**: Specifies the evaluation metric to use during training. In this case,   **auc** indicates that the area under the receiver operating characteristic curve (AUC-ROC) should be used as the evaluation metric.

5-  **objective**:**binary:logistic**: Specifies the objective function to optimize during training. In this case,   **binary:logistic** indicates that binary logistic regression should be used.

5-  **max_depth**:15 : Specifies the maximum depth of each decision tree in the ensemble model. Increasing this value can increase the model complexity, but may also increase the risk of overfitting.

6-  **reg_lambda**:5 : Specifies the L2 regularization parameter for the weights of the decision tree. Increasing this value can help prevent overfitting by penalizing large weights.

7-  **scale_pos_weight**:2 : Specifies the scaling factor for the positive class in binary classification problems. This can be useful for imbalanced datasets where one class has many more samples than the other.

8-  **gamma**:1: Specifies the minimum loss reduction required to make a further partition on a leaf node of the tree. Increasing this value can help prevent overfitting by reducing the number of splits in the tree.



In [28]:
params_2 = {
    'silent': 1,
    'tree_method': 'gpu_hist',
    'n_gpus': -1,
    'eval_metric': 'auc', 
    'objective': 'binary:logistic',
    'max_depth': 15,
    'reg_lambda': 5,
    'scale_pos_weight': 2, 
    'gamma': 1
}

In [35]:
%%time

model_2 = xgb.train(params_2, d_train, num_round, eval_list)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.79511	train-auc:0.80975
[1]	validation-auc:0.80457	train-auc:0.82164
[2]	validation-auc:0.81000	train-auc:0.82940
[3]	validation-auc:0.81359	train-auc:0.83514
[4]	validation-auc:0.81645	train-auc:0.84004
[5]	validation-auc:0.81880	train-auc:0.84426
[6]	validation-auc:0.82090	train-auc:0.84860
[7]	validation-auc:0.82270	train-auc:0.85209
[8]	validation-auc:0.82432	train-auc:0.85544
[9]	validation-auc:0.82571	train-auc:0.85867
[10]	validation-auc:0.82715	train-auc:0.86210
[11]	validation-auc:0.82820	train-auc:0.86471
[12]	validation-auc:0.82919	train-auc:0.86798
[13]	validation-auc:0.83013	train-auc:0.86943
[14]	validation-auc:0.83080	train-auc:0.870

In [36]:
model_pred_2 = model_2.predict(d_validation)

In [37]:
model_predictions_2 = [round(value) for value in model_pred_2]
model_predictions_2 = np.array(model_predictions_2)

In [38]:
print('Accuracy score: ', accuracy_score(y_test, model_predictions_2))

Accuracy score:  0.7475994229316711


In [39]:
print('ROC AUC score: ', roc_auc_score(y_test, model_pred_2))

ROC AUC score:  0.8440521955490112


Looking at the accuracy and AUC score show that model performance is increased a bit but not much. We might need to do further processing of the data to get more accurate results.

### XGBoost with PCA componenets

Lets try PCA components for XGBoost model training.

In [22]:
d_train_pca = xgb.DMatrix(components, label=y_train)
d_validation_pca = xgb.DMatrix(X_test_PCA, label=y_test)

In [23]:
# model training settings
eval_list_pca = [(d_validation_pca, 'validation'), (d_train_pca, 'train')]

In [28]:
num_round=100

In [29]:
%%time

best_pca = xgb.train(params, d_train_pca, num_round, eval_list_pca)

Parameters: { "n_gpus", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation-auc:0.59148	train-auc:0.59152
[1]	validation-auc:0.59339	train-auc:0.59339
[2]	validation-auc:0.59469	train-auc:0.59471
[3]	validation-auc:0.59666	train-auc:0.59664
[4]	validation-auc:0.59694	train-auc:0.59695
[5]	validation-auc:0.59748	train-auc:0.59748
[6]	validation-auc:0.59782	train-auc:0.59784
[7]	validation-auc:0.59794	train-auc:0.59798
[8]	validation-auc:0.59805	train-auc:0.59810
[9]	validation-auc:0.59833	train-auc:0.59840
[10]	validation-auc:0.59837	train-auc:0.59846
[11]	validation-auc:0.59847	train-auc:0.59859
[12]	validation-auc:0.59857	train-auc:0.59871
[13]	validation-auc:0.59863	train-auc:0.59881
[14]	validation-auc:0.59866	train-auc:0.598

In [31]:
pred_pca = best_pca.predict(d_validation_pca)

In [32]:
print('ROC AUC score: ', roc_auc_score(y_test, pred_pca))

ROC AUC score:  0.5986874103546143


XGBoost with PCA gave similar results to the RFC model with PCA components and we need to consider more preprocessing or increae the number of pca components.