## Training Model using XGBoost

This notebook explores the following techniques to improve the performance of the XGBoost model in detecting fraudulent job postings:

- **Various `scale_pos_weight`:** Adjusts the balance between positive (fraudulent) and negative (non-fraudulent) weights to handle the class imbalance in the dataset.

- **Threshold Adjustment:** Helps to balance precision and recall by modifying the cutoff point at which a job posting is classified as fraudulent.

- **Stratified K-Fold Cross-Validation:** Ensures that each fold in cross-validation maintains the same proportion of fraudulent and non-fraudulent jobs as in the original dataset, providing a more reliable evaluation.

- **Hyperparameter Tuning:** Used to find the optimal set of model parameters (such as learning rate, tree depth, and number of estimators) to improve overall model performance.


In [20]:
import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler

# Load and Preprocess the Data

In [21]:
df = pd.read_csv('../dataset/data_cleaned.csv') ##need to replace with ur own path

In [22]:
df.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,has_location,has_employment_type,has_required_experience,has_required_education,has_industry,has_function,...,city_ wilmington,city_ woodbridge,city_ woodruff,city_ worcester,city_ İstanbul,city_ Αthens,city_ Αθήνα,city_ ΕΛΛΗΝΙΚΟ,city_ 마포구 동교동,city_Unknown
0,0,1,0,0,1,1,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1,0,0,1,1,1,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,1,0,0,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,1,0,0,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,1,1,0,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
categorical_features = list(df.select_dtypes(include=['object']).columns)
numeric_features = list(df.select_dtypes(include=['int64', 'float64']).columns)
if 'fraudulent' in numeric_features:
    numeric_features.remove('fraudulent')

In [24]:
categorical_features

[]

In [25]:
def create_preprocessor(categorical_features, numeric_features):
    transformers = []

    transformers.append(
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    )

    transformers.append(('scaler', StandardScaler(), numeric_features))
    
    return ColumnTransformer(transformers=transformers, remainder='drop')

preprocessor = create_preprocessor(categorical_features, numeric_features)

X = pd.concat([df[categorical_features + numeric_features]], axis=1)
y = df['fraudulent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Base XGBoost Model

A base XGBoost model is used as a benchmark to identify further techniques needed to be applied to the dataset.

Run time: 2.5 mins

In [26]:
# Build the XGBoost model
model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42)

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

# Print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Calculate and print the ROC AUC score
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC AUC Score: {roc_auc:.4f}')

# Calculate and print the AUPRC (Area Under Precision-Recall Curve)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')

Classification Report:
              precision    recall  f1-score   support

           0     0.9907    0.9988    0.9947      3403
           1     0.9724    0.8150    0.8868       173

    accuracy                         0.9899      3576
   macro avg     0.9815    0.9069    0.9408      3576
weighted avg     0.9898    0.9899    0.9895      3576

Confusion Matrix:
[[3399    4]
 [  32  141]]
ROC AUC Score: 0.9937
AUPRC: 0.9507


# Adjusting scale_pos_weight in XGBoost

Since scale_pos_weight is used to control the balance of positive and negative weights, which is useful for imbalanced datasets. Setting this parameter will help the algorithm give more weights to the minority class during training.

In [27]:
# Define scale_pos_weight values to test
counter = Counter(y_train)
scale_pos_weight_base = counter[0] / counter[1]

scale_pos_weights = [scale_pos_weight_base, scale_pos_weight_base * 3, scale_pos_weight_base * 5, scale_pos_weight_base / 3, scale_pos_weight_base / 5]

results = []

# Iterate over different scale_pos_weight values
for spw in scale_pos_weights:
    print(f"\nTesting scale_pos_weight = {spw:.2f}")
    
    # Build the XGBoost model with the current scale_pos_weight
    model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42, scale_pos_weight=spw)
    
    # Create the pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Train the pipeline
    pipeline.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = pipeline.predict(X_test)
    
    # Print confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print('Confusion Matrix:')
    print(cm)
    
    # Calculate and print the ROC AUC score
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    print(f'ROC AUC Score: {roc_auc:.4f}')
    
    # Calculate and print the AUPRC (Area Under Precision-Recall Curve)
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    auprc = auc(recall, precision)
    print(f'AUPRC: {auprc:.4f}')
    
    # Store the result
    results.append({
        'scale_pos_weight': spw,
        'roc_auc': roc_auc,
        'auprc': auprc
    })

# Create a DataFrame to display results
df_results = pd.DataFrame(results)

print("\nEvaluation Metrics for Different scale_pos_weight Values:")
print(df_results)

# Find the best scale_pos_weight based on AUPRC
best_spw = df_results.loc[df_results['auprc'].idxmax()]
print(f"\nBest scale_pos_weight based on AUPRC: {best_spw['scale_pos_weight']:.2f} with AUPRC: {best_spw['auprc']:.4f}")



Testing scale_pos_weight = 19.64
Confusion Matrix:
[[3383   20]
 [  19  154]]
ROC AUC Score: 0.9926
AUPRC: 0.9431

Testing scale_pos_weight = 58.92
Confusion Matrix:
[[3365   38]
 [  25  148]]
ROC AUC Score: 0.9898
AUPRC: 0.9283

Testing scale_pos_weight = 98.20
Confusion Matrix:
[[3359   44]
 [  19  154]]
ROC AUC Score: 0.9900
AUPRC: 0.9287

Testing scale_pos_weight = 6.55
Confusion Matrix:
[[3394    9]
 [  24  149]]
ROC AUC Score: 0.9939
AUPRC: 0.9526

Testing scale_pos_weight = 3.93
Confusion Matrix:
[[3397    6]
 [  29  144]]
ROC AUC Score: 0.9935
AUPRC: 0.9431

Evaluation Metrics for Different scale_pos_weight Values:
   scale_pos_weight   roc_auc     auprc
0         19.640693  0.992577  0.943083
1         58.922078  0.989803  0.928315
2         98.203463  0.989987  0.928705
3          6.546898  0.993882  0.952600
4          3.928139  0.993520  0.943123

Best scale_pos_weight based on AUPRC: 6.55 with AUPRC: 0.9526


We can see that using different weights for the classes resulted in an improvement in AUPRC. Therefore, we will select the weight with the highest AUPRC, which is scale_pos_weight = 6.55.

# Stacking Ensemble Model

Our dataset has high dimensionality which might lead to the curse of dimensionality.

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
import xgboost as xgb

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Preprocessing pipeline (categorical + numeric)
preprocessor = create_preprocessor(categorical_features, numeric_features)

### Step 1: XGBoost with Regularization ###
xgb_model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=6.55,  # Adjust if necessary based on class imbalance
    alpha=1,  # L1 regularization (try different values)
    reg_lambda=10  # L2 regularization (try different values)
)

### Step 2: Stacking Ensemble Model ###
# Base models for stacking
base_models = [
    ('lr', LogisticRegression(max_iter=500)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', xgb_model)
]

# StackingClassifier with Logistic Regression as the final estimator
stacked_model = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5
)

# Create the pipeline with preprocessing and stacking
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', stacked_model)
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

# Print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Calculate and print the ROC AUC score
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC AUC Score: {roc_auc:.4f}')

# Calculate and print the AUPRC (Area Under Precision-Recall Curve)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')


Classification Report:
              precision    recall  f1-score   support

           0     0.9930    1.0000    0.9965      3403
           1     1.0000    0.8613    0.9255       173

    accuracy                         0.9933      3576
   macro avg     0.9965    0.9306    0.9610      3576
weighted avg     0.9933    0.9933    0.9931      3576

Confusion Matrix:
[[3403    0]
 [  24  149]]
ROC AUC Score: 0.9968
AUPRC: 0.9735


# Undersampling

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
import xgboost as xgb
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the class distribution before undersampling
print(f'Original class distribution: {Counter(y_train)}')

# Undersample the majority class using RandomUnderSampler
under_sampler = RandomUnderSampler(sampling_strategy=0.33, random_state=42)  # Undersample to a 1:3 ratio
X_train_resampled, y_train_resampled = under_sampler.fit_resample(X_train, y_train)

# Check the class distribution after undersampling
print(f'Resampled class distribution: {Counter(y_train_resampled)}')

# Preprocessing pipeline (categorical + numeric)
preprocessor = create_preprocessor(categorical_features, numeric_features)

### Step 1: XGBoost with Regularization (without pos_weight) ###
xgb_model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=6.55,  # Adjust if necessary based on class imbalance
    alpha=1,  # L1 regularization (try different values)
    reg_lambda=10  # L2 regularization (try different values)
)

### Step 2: Stacking Ensemble Model ###
# Base models for stacking
base_models = [
    ('lr', LogisticRegression(max_iter=1000)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', xgb_model)
]

# StackingClassifier with Logistic Regression as the final estimator
stacked_model = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5
)

# Create the pipeline with preprocessing and stacking
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', stacked_model)
])

# Train the pipeline on the undersampled data
pipeline.fit(X_train_resampled, y_train_resampled)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

# Print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Calculate and print the ROC AUC score
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC AUC Score: {roc_auc:.4f}')

# Calculate and print the AUPRC (Area Under Precision-Recall Curve)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')


Original class distribution: Counter({0: 13611, 1: 693})
Resampled class distribution: Counter({0: 2100, 1: 693})
Classification Report:
              precision    recall  f1-score   support

           0     0.9950    0.9880    0.9914      3403
           1     0.7919    0.9017    0.8432       173

    accuracy                         0.9838      3576
   macro avg     0.8934    0.9448    0.9173      3576
weighted avg     0.9851    0.9838    0.9843      3576

Confusion Matrix:
[[3362   41]
 [  17  156]]
ROC AUC Score: 0.9946
AUPRC: 0.9500


# SMOTENC

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
import xgboost as xgb
from imblearn.over_sampling import SMOTENC
from collections import Counter

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the class distribution before SMOTENC
print(f'Original class distribution: {Counter(y_train)}')

# Specify categorical feature indices for SMOTENC (based on your dataset structure)
categorical_feature_indices = [X.columns.get_loc(col) for col in categorical_features]

# Apply SMOTENC for oversampling
smote_nc = SMOTENC(categorical_features=categorical_feature_indices, random_state=42, sampling_strategy=0.3)
X_train_resampled, y_train_resampled = smote_nc.fit_resample(X_train, y_train)

# Check the class distribution after SMOTENC
print(f'Resampled class distribution: {Counter(y_train_resampled)}')

# Preprocessing pipeline (categorical + numeric)
preprocessor = create_preprocessor(categorical_features, numeric_features)

### Step 1: XGBoost with Regularization (without pos_weight) ###
xgb_model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=6.546898, 
    alpha=1,  # L1 regularization (try different values)
    lambda_=10  # L2 regularization (try different values)
)

### Step 2: Stacking Ensemble Model ###
# Base models for stacking
base_models = [
    ('lr', LogisticRegression(max_iter=1000)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', xgb_model)
]

# StackingClassifier with Logistic Regression as the final estimator
stacked_model = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5
)

# Create the pipeline with preprocessing and stacking
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', stacked_model)
])

# Train the pipeline on the resampled data
pipeline.fit(X_train_resampled, y_train_resampled)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

# Print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Calculate and print the ROC AUC score
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC AUC Score: {roc_auc:.4f}')

# Calculate and print the AUPRC (Area Under Precision-Recall Curve)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')


Original class distribution: Counter({0: 13611, 1: 693})


ValueError: SMOTE-NC is not designed to work only with numerical features. It requires some categorical features.

# Considering a range of scale_pos_weight

In imbalanced datasets like ours, experimenting with a range of scale_pos_weight values helps us balance the importance between the minority (fraudulent jobs) and majority (non-fraudulent jobs) classes, ensuring that the model doesn't become biased towards the majority class and improves its ability to detect fraudulent jobs (i.e., higher recall).

# Selecting Classification Threshold

From the above experiement, we observed that the peak in AUCPR at approximately 0.9309 when scale_pos_weight is 23.57.

Another possible action is to adjsut the classification threshold. Instead of using the default probability threshold of 0.5, we can experiment with different thresholds to balance precision and recall. This would help to balance precision and recall by modifying the cutoff point at which a job posting is classified as fraudulent.

In [36]:
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for threshold in thresholds:
    y_pred_adjusted = (y_pred_proba >= threshold).astype(int)
    report = classification_report(y_test, y_pred_adjusted, output_dict=True, digits=4)
    precision = report['1']['precision']
    recall = report['1']['recall']
    f1_score = report['1']['f1-score']
    print(f"Threshold: {threshold}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1_score:.4f}")


Threshold: 0.3, Precision: 0.6653, Recall: 0.9422, F1-Score: 0.7799
Threshold: 0.4, Precision: 0.7285, Recall: 0.9306, F1-Score: 0.8173
Threshold: 0.5, Precision: 0.7919, Recall: 0.9017, F1-Score: 0.8432
Threshold: 0.6, Precision: 0.8432, Recall: 0.9017, F1-Score: 0.8715
Threshold: 0.7, Precision: 0.8988, Recall: 0.8728, F1-Score: 0.8856


From the experiement on threshold above, we can see that as threshold increases, precision increases which means that fewer non-fraudulent jobs are incorrectly classified as fraudulent.

Recall however, decreases which means that more fraudulent jobs are missed.

As we need to find a balance between precision and recall, assuming that the risk and reward is equal, we will identify the point where the F1-Score peaks, which is when the threshold is 0.7.

Area under the precision-recall curve is not used here as it is independent of the threshold number.

# Integration both of our findings from the scale_pos_weight and threshold experiment

In [91]:
# Optimal scale_pos_weight
optimal_scale_pos_weight = 23.57

model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=optimal_scale_pos_weight,
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Previous threshold of 0.5
optimal_threshold = 0.5
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

print(f"Classification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')  

# Optimal threshold of 0.7
optimal_threshold = 0.7
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

print(f"Classification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')  

Classification Report at Threshold 0.5:
              precision    recall  f1-score   support

           0     0.9950    0.9885    0.9917      3403
           1     0.8000    0.9017    0.8478       173

    accuracy                         0.9843      3576
   macro avg     0.8975    0.9451    0.9198      3576
weighted avg     0.9855    0.9843    0.9848      3576

Confusion Matrix at Threshold 0.5:
[[3364   39]
 [  17  156]]
AUPRC: 0.9309
Classification Report at Threshold 0.7:
              precision    recall  f1-score   support

           0     0.9918    0.9953    0.9935      3403
           1     0.9006    0.8382    0.8683       173

    accuracy                         0.9877      3576
   macro avg     0.9462    0.9167    0.9309      3576
weighted avg     0.9874    0.9877    0.9875      3576

Confusion Matrix at Threshold 0.7:
[[3387   16]
 [  28  145]]
AUPRC: 0.9309


We can see that using the optimal threshold improves the f1-score.

# StratifiedKFold

Using StratifiedKFold ensures that each fold in cross-validation maintains the same proportion of fraudulent and non-fraudulent jobs as in the entire dataset. Doing so will help provide more reliable evaluation for our highly imbalanced data and preventing bias towards the majority class during model training and validation.

In [33]:
from sklearn.metrics import precision_recall_curve, auc
from sklearn.model_selection import StratifiedKFold
import numpy as np

n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

aucpr_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"\nFold {fold + 1}/{n_splits}")
    
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y.iloc[train_index], y.iloc[test_index]
    
    scale_pos_weight_fold = 6.546898
    print(f"Scale Pos Weight for Fold {fold + 1}: {scale_pos_weight_fold:.2f}")
    
    model_fold = xgb.XGBClassifier(
        eval_metric='aucpr',
        random_state=42,
        scale_pos_weight=scale_pos_weight_fold
    )
    
    pipeline_fold = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model_fold)
    ])
    
    pipeline_fold.fit(X_train_fold, y_train_fold)
    
    y_pred_proba_fold = pipeline_fold.predict_proba(X_test_fold)[:, 1]
    
    # Calculate Precision-Recall curve and AUPRC
    precision, recall, _ = precision_recall_curve(y_test_fold, y_pred_proba_fold)
    aucpr = auc(recall, precision)
    aucpr_scores.append(aucpr)
    
    print(f"AUPRC for Fold {fold + 1}: {aucpr:.4f}")

# Calculate and print average AUPRC across all folds
avg_aucpr = np.mean(aucpr_scores)
print(f"\nAverage AUPRC across all folds: {avg_aucpr:.4f}")



Fold 1/5
Scale Pos Weight for Fold 1: 6.55
AUPRC for Fold 1: 0.9414

Fold 2/5
Scale Pos Weight for Fold 2: 6.55
AUPRC for Fold 2: 0.9286

Fold 3/5
Scale Pos Weight for Fold 3: 6.55
AUPRC for Fold 3: 0.9340

Fold 4/5
Scale Pos Weight for Fold 4: 6.55
AUPRC for Fold 4: 0.9570

Fold 5/5
Scale Pos Weight for Fold 5: 6.55
AUPRC for Fold 5: 0.9143

Average AUPRC across all folds: 0.9350


# Using SMOTE for imbalanced dataset

We are considering to use SMOTE for our dataset. However, our data is high-dimensional because of the TF-IDF. SMOTE may not be as effective in such spaces because it relies on computing nearest neighbors, which can be unreliable in high dimensions.

# Hyperparameter tuning

In this section, we intend to try out the different hyperparameters to test out if other parameters could improve our AUPRC.

This section of code takes 4 hours to run. Please take note before running it.

In [101]:
scale_pos_weight = 23.57
print(f"Global Scale Pos Weight: {scale_pos_weight:.2f}")

param_grid = {
    'model__max_depth': [3, 5, 7],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__n_estimators': [100, 200, 300],
    'model__gamma': [0, 0.1, 0.3],
}

model = xgb.XGBClassifier(
    eval_metric='aucpr',  
    random_state=42,
    scale_pos_weight=scale_pos_weight
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Custom scoring function to calculate AUPRC
def custom_auprc(estimator, X, y_true):
    y_pred_proba = estimator.predict_proba(X)[:, 1]
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    return auc(recall, precision)

# Use custom_auprc without make_scorer (direct function)
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_grid,
    n_iter=50,
    scoring=custom_auprc,  # Custom AUPRC function without make_scorer
    cv=skf,
    n_jobs=1,  # Run without parallelization to avoid pickling issues
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print("Best Parameters:")
print(random_search.best_params_)
print(f"Best AUPRC Score from Cross-Validation: {random_search.best_score_:.4f}")

# Use the best pipeline after hyperparameter tuning
best_pipeline = random_search.best_estimator_

# Apply the optimal threshold for final predictions
optimal_threshold = 0.7
y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

# Print classification report and confusion matrix
print(f"\nClassification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)


Global Scale Pos Weight: 23.57
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters:
{'model__n_estimators': 300, 'model__max_depth': 7, 'model__learning_rate': 0.1, 'model__gamma': 0.1}
Best AUPRC Score from Cross-Validation: 0.9117

Classification Report at Threshold 0.7:
              precision    recall  f1-score   support

           0     0.9909    0.9962    0.9936      3403
           1     0.9161    0.8208    0.8659       173

    accuracy                         0.9877      3576
   macro avg     0.9535    0.9085    0.9297      3576
weighted avg     0.9873    0.9877    0.9874      3576

Confusion Matrix at Threshold 0.7:
[[3390   13]
 [  31  142]]
