## Training Model using XGBoost

This notebook explores the following techniques to improve the performance of the XGBoost model in detecting fraudulent job postings:

- **Various `scale_pos_weight`:** Adjusts the balance between positive (fraudulent) and negative (non-fraudulent) weights to handle the class imbalance in the dataset.

- **Threshold Adjustment:** Helps to balance precision and recall by modifying the cutoff point at which a job posting is classified as fraudulent.

- **Stratified K-Fold Cross-Validation:** Ensures that each fold in cross-validation maintains the same proportion of fraudulent and non-fraudulent jobs as in the original dataset, providing a more reliable evaluation.

- **Hyperparameter Tuning:** Used to find the optimal set of model parameters (such as learning rate, tree depth, and number of estimators) to improve overall model performance.


In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.feature_selection import RFE

# Load and Preprocess the Data

In [2]:
df = pd.read_csv('../dataset/data_cleaned.csv') ##need to replace with ur own path

In [3]:
df.head()

Unnamed: 0,title,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,...,required_education,industry,function,fraudulent,country,state,city,description_cleaned,requirements_cleaned,benefits_cleaned
0,Senior Sales Professionals,Sales/Marketing,Not Provided,Not Provided,Do YOU have the sales skills or entrepreneuria...,Not Provided,Not Provided,0,0,0,...,High School or equivalent,Unknown,Sales,1,US,IN,Indianapolis,sale skill entrepreneurial drive join usflexko...,not provided,not provided
1,Medical Surgical RN,Not Provided,Not Provided,Not Provided,Find more jobs at #URL_4708e598004bb0a85bf09f9...,Not Provided,Not Provided,0,0,0,...,Unknown,Hospital & Health Care,Unknown,1,US,CA,Unknown,find job url_ebbabffeeccaffdcfedebeeffeapply l...,not provided,not provided
2,Senior Mechanical Design Engineer,Not Provided,Not Provided,Aker Solutions is a global provider of product...,Corporate overviewAker Solutions is a global p...,Qualifications &amp; personal attributes :Degr...,We offer :• Friendly colleagues in an industry...,0,1,0,...,Unknown,Oil & Energy,Engineering,1,US,TX,Houston,corporate overviewaker solution global provide...,qualification amp personal attribute degree me...,offer friendly colleague industry bright futur...
3,franciscan st. francis health,Not Provided,Not Provided,Not Provided,Apply using below link#URL_ff6a6560a6c8ffc9abc...,Not Provided,Not Provided,0,0,0,...,Unknown,Hospital & Health Care,Unknown,1,US,IN,Indianapolis,apply using linkurl_ffaacffcabcecadddaddeddedf...,not provided,not provided
4,Director of Peri-Anesthesia,Not Provided,Not Provided,Not Provided,Apply using below link directly#URL_af5a535903...,Not Provided,Not Provided,0,0,0,...,Unknown,Hospital & Health Care,Unknown,1,US,MA,Unknown,apply using link directlyurl_afaaacceabcdfaeda...,not provided,not provided


In [4]:
text_features = [
    'description_cleaned', 'requirements_cleaned', 'benefits_cleaned', 'company_profile'
]

categorical_features = [
    'title', 'department', 'employment_type', 'required_experience',
    'required_education', 'industry', 'function', 'country', 'state', 'city'
]

numeric_features = ['telecommuting', 'has_company_logo', 'has_questions']

y = df['fraudulent']

df[text_features] = df[text_features].fillna('')

# df[categorical_features] = df[categorical_features].fillna('Unknown')

# Ensure numerical features have no missing values
df[numeric_features] = df[numeric_features].fillna(0)

In [6]:
def create_preprocessor(text_features, categorical_features, numeric_features):
    transformers = []

    # Text features: Apply TfidfVectorizer to each text feature separately
    for feature in text_features:
        tfidf = TfidfVectorizer(stop_words='english', max_features=1000)
        transformers.append((f'tfidf_{feature}', tfidf, feature))

    # Categorical features: Apply OneHotEncoder (returning dense output to avoid issues with XGBoost)
    transformers.append(
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    )

    # Numerical features: Pass through without changes
    transformers.append(('passthrough', 'passthrough', numeric_features))

    return ColumnTransformer(transformers=transformers, remainder='drop')

preprocessor = create_preprocessor(text_features, categorical_features, numeric_features)

In [8]:
X = df[text_features + categorical_features + numeric_features]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Base XGBoost Model

A base XGBoost model is used as a benchmark to identify further techniques needed to be applied to the dataset.

In [5]:
model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}') 

Classification Report:
              precision    recall  f1-score   support

           0     0.9884    0.9974    0.9928      3403
           1     0.9366    0.7688    0.8444       173

    accuracy                         0.9863      3576
   macro avg     0.9625    0.8831    0.9186      3576
weighted avg     0.9858    0.9863    0.9857      3576

Confusion Matrix:
[[3394    9]
 [  40  133]]
AUPRC: 0.9231


# Adjusting scale_pos_weight in XGBoost

scale_pos_weight in XGBoost is used to control the balance of positive and negative weights, which is useful for imbalanced datasets. Setting this parameter will help the algorithm give more weights to the minority class during training.

In [73]:
from collections import Counter

counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]

print(f'Scale Pos Weight: {scale_pos_weight:.2f}')

model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=scale_pos_weight
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')  

Scale Pos Weight: 19.64
Classification Report:
              precision    recall  f1-score   support

           0     0.9944    0.9903    0.9923      3403
           1     0.8235    0.8902    0.8556       173

    accuracy                         0.9855      3576
   macro avg     0.9090    0.9402    0.9239      3576
weighted avg     0.9861    0.9855    0.9857      3576

Confusion Matrix:
[[3370   33]
 [  19  154]]
AUPRC: 0.9302


Results: 

There is an increased recall for the fraudulent class which means that the model is now detecting a higher percentage of actual fraudulent jobs. There is a decreased pecision for fraudulent class which means that the model is incorrectly labeling more non-fraudulent jobs as fraudulent.

In business sense, we would have to spend more resources to vet through more reports because of the high false positive but we also have a higher detection rate of fraudlent job postings which might led to higher trust in our businesses. There is a trade of that could be quantified in business dollars - the risk of a fraudulent job versus the additional resources needed to vet the false positives.

# Considering a range of scale_pos_weight

In imbalanced datasets like ours, experimenting with a range of scale_pos_weight values helps us balance the importance between the minority (fraudulent jobs) and majority (non-fraudulent jobs) classes, ensuring that the model doesn't become biased towards the majority class and improves its ability to detect fraudulent jobs (i.e., higher recall).

In [82]:
counter = Counter(y_train)
scale_pos_weight_base = counter[0] / counter[1]

scale_pos_weights = [1, scale_pos_weight_base, scale_pos_weight_base * 1.2, 
                     scale_pos_weight_base * 1.5, scale_pos_weight_base * 2, scale_pos_weight_base * 3]

results = []

for spw in scale_pos_weights:
    print(f"\nTesting scale_pos_weight = {spw:.2f}")
    
    model = xgb.XGBClassifier(
        eval_metric='aucpr',
        random_state=42,
        scale_pos_weight=spw
    )
    
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    pipeline.fit(X_train, y_train)
    
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
    
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    aucpr = auc(recall, precision)
    
    results.append({
        'scale_pos_weight': spw,
        'aucpr': aucpr
    })
    
    print(f"AUPRC: {aucpr:.4f}")

df_results = pd.DataFrame(results)

print("\nEvaluation Metrics (AUPRC) for Different scale_pos_weight Values:")
print(df_results)

best_spw = df_results.loc[df_results['aucpr'].idxmax()]
print(f"\nBest scale_pos_weight based on AUPRC: {best_spw['scale_pos_weight']:.2f} with AUPRC: {best_spw['aucpr']:.4f}")


Testing scale_pos_weight = 1.00
AUPRC: 0.9231

Testing scale_pos_weight = 19.64
AUPRC: 0.9302

Testing scale_pos_weight = 23.57
AUPRC: 0.9309

Testing scale_pos_weight = 29.46
AUPRC: 0.9213

Testing scale_pos_weight = 39.28
AUPRC: 0.9196

Testing scale_pos_weight = 58.92
AUPRC: 0.9090

Evaluation Metrics (AUPRC) for Different scale_pos_weight Values:
   scale_pos_weight     aucpr
0          1.000000  0.923075
1         19.640693  0.930178
2         23.568831  0.930896
3         29.461039  0.921295
4         39.281385  0.919640
5         58.922078  0.909044

Best scale_pos_weight based on AUPRC: 23.57 with AUPRC: 0.9309


# Selecting Classification Threshold

From the above experiement, we observed that the peak in AUCPR at approximately 0.9309 when scale_pos_weight is 23.57.

Another possible action is to adjsut the classification threshold. Instead of using the default probability threshold of 0.5, we can experiment with different thresholds to balance precision and recall. This would help to balance precision and recall by modifying the cutoff point at which a job posting is classified as fraudulent.

In [85]:
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for threshold in thresholds:
    y_pred_adjusted = (y_pred_proba >= threshold).astype(int)
    report = classification_report(y_test, y_pred_adjusted, output_dict=True, digits=4)
    precision = report['1']['precision']
    recall = report['1']['recall']
    f1_score = report['1']['f1-score']
    print(f"Threshold: {threshold}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1-Score: {f1_score:.4f}")


Threshold: 0.3, Precision: 0.5993, Recall: 0.9249, F1-Score: 0.7273
Threshold: 0.4, Precision: 0.6569, Recall: 0.9075, F1-Score: 0.7621
Threshold: 0.5, Precision: 0.6909, Recall: 0.8786, F1-Score: 0.7735
Threshold: 0.6, Precision: 0.7449, Recall: 0.8439, F1-Score: 0.7913
Threshold: 0.7, Precision: 0.8161, Recall: 0.8208, F1-Score: 0.8184


From the experiement on threshold above, we can see that as threshold increases, precision increases which means that fewer non-fraudulent jobs are incorrectly classified as fraudulent.

Recall however, decreases which means that more fraudulent jobs are missed.

As we need to find a balance between precision and recall, assuming that the risk and reward is equal, we will identify the point where the F1-Score peaks, which is when the threshold is 0.7.

Area under the precision-recall curve is not used here as it is independent of the threshold number.

# Integration both of our findings from the scale_pos_weight and threshold experiment

In [91]:
# Optimal scale_pos_weight
optimal_scale_pos_weight = 23.57

model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=optimal_scale_pos_weight,
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Previous threshold of 0.5
optimal_threshold = 0.5
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

print(f"Classification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')  

# Optimal threshold of 0.7
optimal_threshold = 0.7
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

print(f"Classification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)

y_pred_proba = pipeline.predict_proba(X_test)[:, 1] 
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')  

Classification Report at Threshold 0.5:
              precision    recall  f1-score   support

           0     0.9950    0.9885    0.9917      3403
           1     0.8000    0.9017    0.8478       173

    accuracy                         0.9843      3576
   macro avg     0.8975    0.9451    0.9198      3576
weighted avg     0.9855    0.9843    0.9848      3576

Confusion Matrix at Threshold 0.5:
[[3364   39]
 [  17  156]]
AUPRC: 0.9309
Classification Report at Threshold 0.7:
              precision    recall  f1-score   support

           0     0.9918    0.9953    0.9935      3403
           1     0.9006    0.8382    0.8683       173

    accuracy                         0.9877      3576
   macro avg     0.9462    0.9167    0.9309      3576
weighted avg     0.9874    0.9877    0.9875      3576

Confusion Matrix at Threshold 0.7:
[[3387   16]
 [  28  145]]
AUPRC: 0.9309


We can see that using the optimal threshold improves the f1-score.

# StratifiedKFold

Using StratifiedKFold ensures that each fold in cross-validation maintains the same proportion of fraudulent and non-fraudulent jobs as in the entire dataset. Doing so will help provide more reliable evaluation for our highly imbalanced data and preventing bias towards the majority class during model training and validation.

In [92]:
from sklearn.metrics import precision_recall_curve, auc
from sklearn.model_selection import StratifiedKFold
import numpy as np

n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

aucpr_scores = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"\nFold {fold + 1}/{n_splits}")
    
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y.iloc[train_index], y.iloc[test_index]
    
    scale_pos_weight_fold = 23.57
    print(f"Scale Pos Weight for Fold {fold + 1}: {scale_pos_weight_fold:.2f}")
    
    model_fold = xgb.XGBClassifier(
        eval_metric='aucpr',
        random_state=42,
        scale_pos_weight=scale_pos_weight_fold
    )
    
    pipeline_fold = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model_fold)
    ])
    
    pipeline_fold.fit(X_train_fold, y_train_fold)
    
    y_pred_proba_fold = pipeline_fold.predict_proba(X_test_fold)[:, 1]
    
    # Calculate Precision-Recall curve and AUPRC
    precision, recall, _ = precision_recall_curve(y_test_fold, y_pred_proba_fold)
    aucpr = auc(recall, precision)
    aucpr_scores.append(aucpr)
    
    print(f"AUPRC for Fold {fold + 1}: {aucpr:.4f}")

# Calculate and print average AUPRC across all folds
avg_aucpr = np.mean(aucpr_scores)
print(f"\nAverage AUPRC across all folds: {avg_aucpr:.4f}")



Fold 1/5
Scale Pos Weight for Fold 1: 23.57
AUPRC for Fold 1: 0.9313

Fold 2/5
Scale Pos Weight for Fold 2: 23.57
AUPRC for Fold 2: 0.9079

Fold 3/5
Scale Pos Weight for Fold 3: 23.57
AUPRC for Fold 3: 0.9375

Fold 4/5
Scale Pos Weight for Fold 4: 23.57
AUPRC for Fold 4: 0.9329

Fold 5/5
Scale Pos Weight for Fold 5: 23.57
AUPRC for Fold 5: 0.9172

Average AUPRC across all folds: 0.9254


# Using SMOTE for imbalanced dataset

We are considering to use SMOTE for our dataset. However, our data is high-dimensional because of the TF-IDF. SMOTE may not be as effective in such spaces because it relies on computing nearest neighbors, which can be unreliable in high dimensions.

# Hyperparameter tuning

In this section, we intend to try out the different hyperparameters to test out if other parameters could improve our AUPRC.

This section of code takes 4 hours to run. Please take note before running it.

In [101]:
scale_pos_weight = 23.57
print(f"Global Scale Pos Weight: {scale_pos_weight:.2f}")

param_grid = {
    'model__max_depth': [3, 5, 7],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__n_estimators': [100, 200, 300],
    'model__gamma': [0, 0.1, 0.3],
}

model = xgb.XGBClassifier(
    eval_metric='aucpr',  
    random_state=42,
    scale_pos_weight=scale_pos_weight
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Custom scoring function to calculate AUPRC
def custom_auprc(estimator, X, y_true):
    y_pred_proba = estimator.predict_proba(X)[:, 1]
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    return auc(recall, precision)

# Use custom_auprc without make_scorer (direct function)
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_grid,
    n_iter=50,
    scoring=custom_auprc,  # Custom AUPRC function without make_scorer
    cv=skf,
    n_jobs=1,  # Run without parallelization to avoid pickling issues
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print("Best Parameters:")
print(random_search.best_params_)
print(f"Best AUPRC Score from Cross-Validation: {random_search.best_score_:.4f}")

# Use the best pipeline after hyperparameter tuning
best_pipeline = random_search.best_estimator_

# Apply the optimal threshold for final predictions
optimal_threshold = 0.7
y_pred_proba = best_pipeline.predict_proba(X_test)[:, 1]
y_pred_optimal = (y_pred_proba >= optimal_threshold).astype(int)

# Print classification report and confusion matrix
print(f"\nClassification Report at Threshold {optimal_threshold}:")
print(classification_report(y_test, y_pred_optimal, digits=4))

cm = confusion_matrix(y_test, y_pred_optimal)
print(f"Confusion Matrix at Threshold {optimal_threshold}:")
print(cm)


Global Scale Pos Weight: 23.57
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters:
{'model__n_estimators': 300, 'model__max_depth': 7, 'model__learning_rate': 0.1, 'model__gamma': 0.1}
Best AUPRC Score from Cross-Validation: 0.9117

Classification Report at Threshold 0.7:
              precision    recall  f1-score   support

           0     0.9909    0.9962    0.9936      3403
           1     0.9161    0.8208    0.8659       173

    accuracy                         0.9877      3576
   macro avg     0.9535    0.9085    0.9297      3576
weighted avg     0.9873    0.9877    0.9874      3576

Confusion Matrix at Threshold 0.7:
[[3390   13]
 [  31  142]]
