Model Training done by  Ng Aik Keong Sebastian 2400871

Importing of Libraries

In [1]:
import numpy as np
import pandas as pd
import category_encoders as ce

from sklearn.metrics import classification_report, log_loss, accuracy_score
from sklearn.model_selection import train_test_split, ShuffleSplit, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer

from scipy.sparse import hstack, csr_matrix

from hyperopt import fmin, tpe, hp, Trials

from skopt import BayesSearchCV

from xgboost import XGBClassifier

Reading Excel data file (in the data folder)

In [2]:
df = pd.read_csv("../data/cleanedData.csv")

Separate target(label) from predictor columns

In [3]:
y = df.label

Vectoriser is used to change strings into numerical values as the model is unable to take in string values

In [4]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000, max_df=0.95, min_df=2)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['fullContent'])

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

sparse_features = csr_matrix(df[["urls", "totalLength", "generalConsumer", "govDomain", "eduDomain", "orgDomain", "netDomain", "otherDomain", "html", "punctuationCount"]].values)

X = hstack([sparse_features, tfidf_matrix])

Split full dataset into training set(80%) and testing set(20%)
<br>
8:2 Ratio is the standard in the coding scene

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Creating an XGBoost Classifier Model
<br>
Training the Model

In [6]:
clf = XGBClassifier(max_depth=3, n_estimators=1000, learning_rate=0.01)
clf.fit(X_train, y_train)

KeyboardInterrupt: 

Evaluating the model’s performance

In [7]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy: ", accuracy)

Model accuracy:  0.9714382479114925


In [8]:
y_pred2 = clf.predict(X_train)
accuracy2 = accuracy_score(y_train, y_pred2)
print("Model accuracy:", accuracy2)

Model accuracy: 0.9765184014450214


Now that the base model is completed, i will move on to making modifications such as Hyperparameter tuning, Early stopping, Feature Engineering, Cross-Validation, Adressing Potential Overfitting

Tuning Hyperparameters
<br>
1. Change of max depth from 3 to 4,5 however cannot change to much as it can cause overfitting
<br>
2. Change of learning rate and Estimators from 0.01 to 0.02, 0.05 while reducing the estimators proportionally such as 500 or 200
<br>
3. Using of subsample and colsample_bytree

ChatGPT:
I have a problem whereby i am unsure how i can tune my hyperparameters efficiently hence i promted GPT the following:
<br>
How to tune hyperparameters more efficiently instead of trying each parameter one by one
<br>
I was given a result of using the 3 methods:
<br>
1. Grid Search (Exhausive Search)
2. Randomized Search (Random Search)
3. Bayesian Optimization with scikit-optimize (skopt)
4. Hyperopt (Another Bayesian Optimization Tool)

Attempting Grid Search for Hyperparameter Tuning

In [9]:
param_grid = {
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

In [10]:
clf = XGBClassifier()

# Set up the grid search
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


KeyboardInterrupt: 

Attempting Bayesian Optimization with scikit-optimize (skopt) for Hyperparameter Tuning

In [20]:
param_space = {
    'max_depth': (3, 10),
    'n_estimators': (100, 1000),
    'learning_rate': (0.01, 0.3, 'log-uniform'),
    'subsample': (0.5, 1.0),
    'colsample_bytree': (0.5, 1.0),
    'reg_alpha': (0.0, 1.0),
    'reg_lambda': (0.0, 1.0)
}

In [None]:
# Initialize the XGBoost classifier
clf = XGBClassifier()

# Set up Bayesian optimization search
bayes_search = BayesSearchCV(estimator=clf, search_spaces=param_space, n_iter=32, cv=5, random_state=42)

# Fit the model
bayes_search.fit(X_train, y_train)

# Print the best parameters and score
print("Best parameters found: ", bayes_search.best_params_)
print("Best accuracy: ", bayes_search.best_score_)