Model Training done by  Ng Aik Keong Sebastian 2400871

Importing of Libraries

In [1]:
import numpy as np
import pandas as pd
#import category_encoders as ce

from sklearn.metrics import classification_report, log_loss, accuracy_score
from sklearn.model_selection import train_test_split, ShuffleSplit, GridSearchCV, cross_val_score, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer

from scipy.sparse import hstack, csr_matrix

from hyperopt import fmin, tpe, hp, Trials

from skopt import BayesSearchCV

from xgboost import XGBClassifier

#import pickle
from joblib import dump, load
import os

Reading Excel data file (in the data folder)

In [2]:
df = pd.read_csv("../data/cleanedData.csv")

Separate target(label) from predictor columns

In [3]:
y = df.label

Vectoriser is used to change strings into numerical values as the model is unable to take in string values

In [4]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000, max_df=0.95, min_df=2)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['fullContent'])

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

sparse_features = csr_matrix(df[["urls", "totalLength", "generalConsumer", "govDomain", "eduDomain", "orgDomain", "netDomain", "otherDomain", "html", "punctuationCount"]].values)

X = hstack([sparse_features, tfidf_matrix])

Split full dataset into training set(80%) and testing set(20%)
<br>
8:2 Ratio is the standard in the coding scene

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Creating an XGBoost Classifier Model 
<br>
This will be my base model with the basic typical baseline configurations
<br>
Training the Model

In [7]:
clf = XGBClassifier(max_depth=3, n_estimators=1000, learning_rate=0.01)
clf.fit(X_train, y_train)

Evaluating the model’s performance

In [8]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy: ", accuracy)

Model accuracy:  0.9713253556107474


In [9]:
y_pred2 = clf.predict(X_train)
accuracy2 = accuracy_score(y_train, y_pred2)
print("Model accuracy:", accuracy2)

Model accuracy: 0.9766877398961391


Now that the base model is completed, i will move on to making modifications such as Hyperparameter tuning, Early stopping, Feature Engineering, Cross-Validation, Adressing Potential Overfitting

Tuning Hyperparameters
<br>
1. Change of max depth from 3 to 4,5 however cannot change to much as it can cause overfitting
<br>
2. Change of learning rate and Estimators from 0.01 to 0.02, 0.05 while reducing the estimators proportionally such as 500 or 200
<br>
3. Using of subsample and colsample_bytree

ChatGPT:
I have a problem whereby i am unsure how i can tune my hyperparameters efficiently hence i promted GPT the following:
<br>
How to tune hyperparameters more efficiently instead of trying each parameter one by one
<br>
I was given a result of using the 3 methods:
<br>
1. Grid Search (Exhausive Search)
2. Randomized Search (Random Search)
3. Bayesian Optimization with scikit-optimize (skopt)
4. Hyperopt (Another Bayesian Optimization Tool)

Attempting Grid Search for Hyperparameter Tuning

In [45]:
param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

In [47]:
grid_search = GridSearchCV(estimator=XGBClassifier(), param_grid=param_grid, cv=3, scoring='accuracy', verbose=1)

grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


KeyboardInterrupt: 

Attempting Bayesian Optimization with scikit-optimize (skopt) for Hyperparameter Tuning

In [32]:
param_space = {
    'max_depth': (3, 10),
    'n_estimators': (100, 1000),
    'learning_rate': (0.01, 0.3, 'log-uniform'),
    'subsample': (0.5, 1.0),
    'colsample_bytree': (0.5, 1.0),
    'reg_alpha': (0.0, 1.0),
    'reg_lambda': (0.0, 1.0)
}

In [37]:
bayes_search = BayesSearchCV(estimator=XGBClassifier(), search_spaces=param_space, n_iter=10, cv=3, random_state=42)

bayes_search.fit(X_train, y_train)
print("Best parameters found: ", bayes_search.best_params_)
print("Best accuracy: ", bayes_search.best_score_)

Best parameters found:  OrderedDict({'colsample_bytree': 0.7224162561505759, 'learning_rate': 0.22754356809600707, 'max_depth': 4, 'n_estimators': 490, 'reg_alpha': 0.1879551863673486, 'reg_lambda': 0.45366534380629897, 'subsample': 0.5777240270252717})
Best accuracy:  0.9909686246882138


Attempting Random search
<br>
Reason for choosing random search is because grid search is exhausive and not optimal for larger parameter spaces whereas random search selects specified number of random combinations which makes it faster and more efficient in this scenario

These are values that i determined i wanted to try randomly searching through for combinations resulting in the best performing model

In [13]:
param_dist = {
    'max_depth': [3, 4, 5, 6],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

1. Estimators is the model that i want to tune
2. Param dict is defined above which contains the hyperparameters and its ranges
3. Number of random parameter combinations to try
4. Number of coss validation checks
5. Verbosity level for logging progress
6. Random seed for reproducibility

In [14]:
random_search = RandomizedSearchCV(
    estimator=XGBClassifier(),
    param_distributions=param_dist,
    n_iter=10,
    cv=3,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)
print("Best parameters found: ", random_search.best_params_)
print("Best accuracy: ", random_search.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best parameters found:  {'subsample': 0.9, 'n_estimators': 1000, 'max_depth': 4, 'learning_rate': 0.2, 'colsample_bytree': 0.8}
Best accuracy:  0.9915895315467363


Serializing and Deserializing the python maching learning model as it takes 25 minutes to run and it will not be efficient to run it each time
<br>
The reason why i have opted for joblib and not pickle is because joblib is designed for handling a large numpy array and large dataset more efficiently

In [6]:
# Save the trained model
folder_path = '../model/'
randomSearch_model_filename = folder_path + 'XGBoost_random_sebastian.joblib'
bayesSearch_model_filename = folder_path + 'XGBoost_bayes_sebastian.joblib'
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

Seralizing the model

In [41]:
dump(random_search, randomSearch_model_filename)
dump(bayes_search, bayesSearch_model_filename)

['../model/XGBoost_bayes_sebastian.joblib']

Deserializing the model

In [7]:
random_search = load(randomSearch_model_filename)
bayes_search = load(bayesSearch_model_filename)

Fitting and making predictions using the saved hyperparameters

In [16]:
print("Best Hyperparameters: " + str(random_search.best_params_))
random_search_best_params = random_search.best_params_
random_search_xgb_best_model = XGBClassifier(**random_search_best_params)
random_search_xgb_best_model.fit(X_train, y_train)
y_pred = random_search_xgb_best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the XGBoost model using Random Search: ", accuracy)

Best Hyperparameters: {'subsample': 0.9, 'n_estimators': 1000, 'max_depth': 4, 'learning_rate': 0.2, 'colsample_bytree': 0.8}
Accuracy of the new XGBoost model:  0.9922104312485889


In [18]:
print("Best Hyperparameters: " + str(bayes_search.best_params_))
bayes_search_best_params = bayes_search.best_params_
bayes_search_xgb_best_model = XGBClassifier(**bayes_search_best_params)
bayes_search_xgb_best_model.fit(X_train, y_train)
y_pred = bayes_search_xgb_best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the XGBoost model using Bayes Search: ", accuracy)

Best Hyperparameters: OrderedDict({'colsample_bytree': 0.7224162561505759, 'learning_rate': 0.22754356809600707, 'max_depth': 4, 'n_estimators': 490, 'reg_alpha': 0.1879551863673486, 'reg_lambda': 0.45366534380629897, 'subsample': 0.5777240270252717})
Accuracy of the XGBoost model using Bayes Search:  0.9916459697448634
