## Training Model using XGBoost

This notebook explores the following techniques to improve the performance of the XGBoost model in detecting fraudulent job postings:

- **Various `scale_pos_weight`:** Adjusts the balance between positive (fraudulent) and negative (non-fraudulent) weights to handle the class imbalance in the dataset.

- **Threshold Adjustment:** Helps to balance precision and recall by modifying the cutoff point at which a job posting is classified as fraudulent.

- **Stratified K-Fold Cross-Validation:** Ensures that each fold in cross-validation maintains the same proportion of fraudulent and non-fraudulent jobs as in the original dataset, providing a more reliable evaluation.

- **Hyperparameter Tuning:** Used to find the optimal set of model parameters (such as learning rate, tree depth, and number of estimators) to improve overall model performance.


In [17]:
import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from collections import Counter
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler

# Load and Preprocess the Data

In [18]:
df = pd.read_csv('../dataset/data_cleaned_2.csv') ##need to replace with ur own path

In [19]:
df.head()

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,has_location,has_employment_type,has_required_experience,has_required_education,has_industry,has_function,...,city_ wilmington,city_ woodbridge,city_ woodruff,city_ worcester,city_ İstanbul,city_ Αthens,city_ Αθήνα,city_ ΕΛΛΗΝΙΚΟ,city_ 마포구 동교동,city_Unknown
0,0,1,0,0,1,1,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1,0,0,1,1,1,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,1,0,0,1,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,1,0,0,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,1,1,0,1,1,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
categorical_features = list(df.select_dtypes(include=['object']).columns)
numeric_features = list(df.select_dtypes(include=['int64', 'float64']).columns)
if 'fraudulent' in numeric_features:
    numeric_features.remove('fraudulent')

In [21]:
categorical_features

[]

In [42]:
def create_preprocessor(categorical_features, numeric_features):
    transformers = []

    transformers.append(
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    )

    transformers.append(('scaler', StandardScaler(), numeric_features))
    
    return ColumnTransformer(transformers=transformers, remainder='drop')

preprocessor = create_preprocessor(categorical_features, numeric_features)

X = pd.concat([df[categorical_features + numeric_features]], axis=1)
y = df['fraudulent']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 2: Split the train+validation set into separate train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val)

print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 10728
Validation set size: 3576
Test set size: 3576


# Base XGBoost Model

**Key Techniques**:  
Baseline XGBoost model for benchmarking.

**Key Discoveries**:  
Our model might benefit from techniques that can deal with class imbalances.

In [43]:
# Build the XGBoost model
model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42)

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred, digits=4))

# Print confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(cm)

# Calculate and print the ROC AUC score
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f'ROC AUC Score: {roc_auc:.4f}')

# Calculate and print the AUPRC (Area Under Precision-Recall Curve)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
auprc = auc(recall, precision)
print(f'AUPRC: {auprc:.4f}')

Classification Report:
              precision    recall  f1-score   support

           0     0.9875    0.9988    0.9931      3403
           1     0.9701    0.7514    0.8469       173

    accuracy                         0.9869      3576
   macro avg     0.9788    0.8751    0.9200      3576
weighted avg     0.9867    0.9869    0.9861      3576

Confusion Matrix:
[[3399    4]
 [  43  130]]
ROC AUC Score: 0.9918
AUPRC: 0.9324


# Adjusting scale_pos_weight in XGBoost

**Key Techniques**:  
A commonly used parameter in xgboost is scale_pos_weight which is used to control the balance of positive and negative weights, which is useful for imbalanced datasets. Setting this parameter will help the algorithm give more weights to the minority class during training.

**Key Discoveries**:  
We can see that using different weights for the classes resulted in an improvement in AUPRC. Therefore, we will select the weight with the highest AUPRC, which is scale_pos_weight = 6.55.

In [44]:
# Define scale_pos_weight values to test
counter = Counter(y_train)
scale_pos_weight_base = counter[0] / counter[1]

scale_pos_weights = [scale_pos_weight_base, scale_pos_weight_base * 3, scale_pos_weight_base * 5, scale_pos_weight_base / 3, scale_pos_weight_base / 5]

results = []

# Iterate over different scale_pos_weight values
for spw in scale_pos_weights:
    
    # Build the XGBoost model with the current scale_pos_weight
    model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42, scale_pos_weight=spw)
    
    # Create the pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Train the pipeline
    pipeline.fit(X_train, y_train)
    
    # Calculate and print the ROC AUC score
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Calculate and print the AUPRC (Area Under Precision-Recall Curve)
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    auprc = auc(recall, precision)
    
    # Store the result
    results.append({
        'scale_pos_weight': spw,
        'roc_auc': roc_auc,
        'auprc': auprc
    })

# Create a DataFrame to display results
df_results = pd.DataFrame(results)

print("\nEvaluation Metrics for Different scale_pos_weight Values:")
print(df_results)

# Find the best scale_pos_weight based on AUPRC
best_spw = df_results.loc[df_results['auprc'].idxmax()]
print(f"\nBest scale_pos_weight based on AUPRC: {best_spw['scale_pos_weight']:.2f} with AUPRC: {best_spw['auprc']:.4f}")



Evaluation Metrics for Different scale_pos_weight Values:
   scale_pos_weight   roc_auc     auprc
0         19.630769  0.986156  0.922435
1         58.892308  0.985022  0.911373
2         98.153846  0.984719  0.919927
3          6.543590  0.991497  0.931075
4          3.926154  0.991539  0.938688

Best scale_pos_weight based on AUPRC: 3.93 with AUPRC: 0.9387


# Dealing with Class Imbalance

## Undersampling

**Key Techniques**:  
Undersampling is used to balance the dataset by reducing the number of majority class samples. This method helps ensure the model is trained on a dataset with a more even distribution between the positive and negative classes, allowing it to better recognize patterns related to the minority class.

**Key Discoveries**:  
Most results of undersampling fall below the baseline model’s AUPRC. This could be a result of key information being removed from the original dataset, which causes the model to not be able to identify certain patterns.

In [50]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Define sampling strategies to test
sampling_strategies = [0.1, 0.2, 0.3, 0.4, 0.5]  # Test different undersampling ratios
results = []

# Loop through each sampling strategy
for strategy in sampling_strategies:
    # Apply undersampling with the current strategy
    under_sampler = RandomUnderSampler(sampling_strategy=strategy, random_state=42)
    X_train_resampled, y_train_resampled = under_sampler.fit_resample(X_train, y_train)

    # Check the class distribution after undersampling
    print(f'Sampling strategy {strategy} - Resampled class distribution: {Counter(y_train_resampled)}')

    # Define scale_pos_weight values to test
    counter = Counter(y_train_resampled)
    scale_pos_weight_base = counter[0] / counter[1]

    scale_pos_weights = [1, scale_pos_weight_base, scale_pos_weight_base * 3, scale_pos_weight_base * 5, scale_pos_weight_base / 3, scale_pos_weight_base / 5]

    for weight in scale_pos_weights:
        # Build the XGBoost model
        model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42, scale_pos_weight=weight)

        # Create the pipeline
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),  # Assuming `preprocessor` is already defined for your data
            ('model', model)
        ])

        # Train the pipeline on the undersampled data
        pipeline.fit(X_train_resampled, y_train_resampled)

        # Calculate the ROC AUC score
        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
        roc_auc = roc_auc_score(y_test, y_pred_proba)

        # Calculate the AUPRC (Area Under Precision-Recall Curve)
        precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
        auprc = auc(recall, precision)

        # Store the results
        results.append({'Pos Weight': weight, 'Sampling Strategy': strategy, 'ROC AUC': roc_auc, 'AUPRC': auprc})

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Print the results table
print("Results for Different Sampling Strategies:")
print(results_df)

# Find and print the best sampling strategy based on AUPRC
best_strategy = results_df.loc[results_df['AUPRC'].idxmax()]
print("\nBest Sampling Strategy:")
print(best_strategy)


Sampling strategy 0.1 - Resampled class distribution: Counter({0: 5200, 1: 520})
Sampling strategy 0.2 - Resampled class distribution: Counter({0: 2600, 1: 520})
Sampling strategy 0.3 - Resampled class distribution: Counter({0: 1733, 1: 520})
Sampling strategy 0.4 - Resampled class distribution: Counter({0: 1300, 1: 520})
Sampling strategy 0.5 - Resampled class distribution: Counter({0: 1040, 1: 520})
Results for Different Sampling Strategies:
    Pos Weight  Sampling Strategy   ROC AUC     AUPRC
0     1.000000                0.1  0.991893  0.931355
1    10.000000                0.1  0.991273  0.935042
2    30.000000                0.1  0.992239  0.924283
3    50.000000                0.1  0.988611  0.923860
4     3.333333                0.1  0.992064  0.929902
5     2.000000                0.1  0.990641  0.926801
6     1.000000                0.2  0.991976  0.919709
7     5.000000                0.2  0.992985  0.924733
8    15.000000                0.2  0.991639  0.924293
9    25.0000

## SMOTE

**Key Techniques**:  
SMOTE (Synthetic Minority Over-sampling Technique) is applied to increase the number of minority class samples by generating synthetic examples. This method is aimed to create a balanced dataset that could help the model better identify minority class patterns. Different sampling strategies were tested, ranging from 0.35 to 0.6, in combination with various scale_pos_weight values to find the best performance.

**Key Discoveries**:  
The application of SMOTE showed improved results, which might indicate that the model is better at handling class imbalance now. The optimal performance was with a sampling strategy of 0.35 and a scale_pos_weight of 0.952, achieving an AUPRC of 0.9550 and ROC AUC of 0.9951. This is an improvement over previous iterations, showing that SMOTE effectively leveraged synthetic data to boost detection of the minority class. Higher sampling strategies, such as 0.6, led to a drop in AUPRC, indicating that oversampling beyond a certain point may result in diminishing returns or overfitting.

In [51]:
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, roc_auc_score, precision_recall_curve, auc

# Define sampling strategies to test
sampling_strategies = [0.35, 0.4, 0.5, 0.6]  # Proportions of the minority class after resampling
results = []

# Loop through each sampling strategy
for strategy in sampling_strategies:
    # Apply SMOTE with the current strategy
    smote = SMOTE(sampling_strategy=strategy, random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    # Check the class distribution after resampling
    print(f'Sampling strategy {strategy} - Resampled class distribution: {Counter(y_train_resampled)}')

    # Define scale_pos_weight values to test
    counter = Counter(y_train_resampled)
    scale_pos_weight_base = counter[0] / counter[1]

    scale_pos_weights = [1, scale_pos_weight_base, scale_pos_weight_base * 3, scale_pos_weight_base * 5, scale_pos_weight_base / 3, scale_pos_weight_base / 5]

    for weight in scale_pos_weights:
        # Build the XGBoost model
        model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42, scale_pos_weight=weight)

        # Create the pipeline
        pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),  # Assuming `preprocessor` is already defined for your data
            ('model', model)
        ])

        # Train the pipeline on the resampled data
        pipeline.fit(X_train_resampled, y_train_resampled)

        # Calculate the ROC AUC score
        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
        roc_auc = roc_auc_score(y_test, y_pred_proba)

        # Calculate the AUPRC (Area Under Precision-Recall Curve)
        precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
        auprc = auc(recall, precision)

        # Store the results
        results.append({'Pos Weight': weight, 'Sampling Strategy': strategy, 'ROC AUC': roc_auc, 'AUPRC': auprc})

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Print the results table
print("Results for Different Sampling Strategies:")
print(results_df)

# Find and print the best sampling strategy based on AUPRC
best_strategy = results_df.loc[results_df['AUPRC'].idxmax()]
print("\nBest Sampling Strategy:")
print(best_strategy)


Sampling strategy 0.35 - Resampled class distribution: Counter({0: 10208, 1: 3572})
Sampling strategy 0.4 - Resampled class distribution: Counter({0: 10208, 1: 4083})
Sampling strategy 0.5 - Resampled class distribution: Counter({0: 10208, 1: 5104})
Sampling strategy 0.6 - Resampled class distribution: Counter({0: 10208, 1: 6124})
Results for Different Sampling Strategies:
    Pos Weight  Sampling Strategy   ROC AUC     AUPRC
0     1.000000               0.35  0.991400  0.934437
1     2.857783               0.35  0.988871  0.930387
2     8.573348               0.35  0.989565  0.927188
3    14.288914               0.35  0.984996  0.916382
4     0.952594               0.35  0.991145  0.924868
5     0.571557               0.35  0.988747  0.930997
6     1.000000               0.40  0.990581  0.922622
7     2.500122               0.40  0.989768  0.931134
8     7.500367               0.40  0.986535  0.914503
9    12.500612               0.40  0.985124  0.919765
10    0.833374               0

# Stratified K-Fold Cross Validation

In [41]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, precision_recall_curve, auc
from imblearn.pipeline import Pipeline as ImbPipeline  # Use imblearn's Pipeline to integrate SMOTE

def auprc_score(y_true, y_pred_proba):
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    return auc(recall, precision)

# Set up stratified k-fold cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Experiment configurations
experiments = [
    {'use_smote': True, 'pos_weight': 0.952551, 'description': 'With SMOTE, With pos_weight'},
    {'use_smote': True, 'pos_weight': 1.0, 'description': 'With SMOTE, Without pos_weight'},
    {'use_smote': False, 'pos_weight': 0.952551, 'description': 'Without SMOTE, With pos_weight'},
    {'use_smote': False, 'pos_weight': 1.0, 'description': 'Without SMOTE, Without pos_weight'}
]

results = []

for experiment in experiments:
    # Build the XGBoost model with or without pos_weight
    model = xgb.XGBClassifier(eval_metric='aucpr', random_state=42, scale_pos_weight=experiment['pos_weight'])

    # Create the pipeline with or without SMOTE
    if experiment['use_smote']:
        pipeline = ImbPipeline(steps=[
            ('preprocessor', preprocessor),  # Assuming `preprocessor` is already defined for your data
            ('smote', SMOTE(sampling_strategy=0.35, random_state=42)),
            ('model', model)
        ])
    else:
        pipeline = ImbPipeline(steps=[
            ('preprocessor', preprocessor),
            ('model', model)
        ])

    # Run cross-validation
    cv_scores_auprc = cross_val_score(pipeline, X_train, y_train, cv=kf, 
                                      scoring=make_scorer(auprc_score, needs_proba=True))
    
    # Store results
    results.append({
        'Description': experiment['description'],
        'AUPRC Scores': cv_scores_auprc,
        'Mean AUPRC': np.mean(cv_scores_auprc)
    })

# Print results for each experiment
for result in results:
    print(f"Experiment: {result['Description']}")
    print(f"Cross-validated AUPRC Scores: {result['AUPRC Scores']}")
    print(f"Mean AUPRC Score: {result['Mean AUPRC']:.4f}")
    print("\n")




Experiment: With SMOTE, With pos_weight
Cross-validated AUPRC Scores: [0.94522022 0.90863379 0.91580369 0.90951941 0.90774601]
Mean AUPRC Score: 0.9174


Experiment: With SMOTE, Without pos_weight
Cross-validated AUPRC Scores: [0.94628992 0.91595282 0.91165832 0.89947483 0.90333678]
Mean AUPRC Score: 0.9153


Experiment: Without SMOTE, With pos_weight
Cross-validated AUPRC Scores: [0.93274399 0.91163103 0.91856448 0.90451142 0.90335677]
Mean AUPRC Score: 0.9142


Experiment: Without SMOTE, Without pos_weight
Cross-validated AUPRC Scores: [0.94069722 0.91297013 0.92283387 0.90870756 0.90928725]
Mean AUPRC Score: 0.9189




# Stacking Ensemble Model

Our dataset has high dimensionality which might lead to the curse of dimensionality.

In [49]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, auc
from imblearn.pipeline import Pipeline  # Use imblearn's Pipeline for SMOTE
from imblearn.over_sampling import SMOTE

# Preprocessing pipeline (categorical + numeric)
preprocessor = create_preprocessor(categorical_features, numeric_features)

# Function to test different final estimators, hyperparameters, and SMOTE strategies
def test_stacking_model(final_estimators, scale_weights, alphas, lambdas, smote_strategies):
    for smote_strategy in smote_strategies:
        for final_estimator in final_estimators:
            for scale_pos_weight in scale_weights:
                for alpha in alphas:
                    for reg_lambda in lambdas:
                        print(f'Testing: SMOTE={smote_strategy}, final_estimator={final_estimator}, scale_pos_weight={scale_pos_weight}, alpha={alpha}, reg_lambda={reg_lambda}')
                        
                        # Create the XGBoost model with varying hyperparameters
                        xgb_model = xgb.XGBClassifier(
                            eval_metric='aucpr',
                            random_state=42,
                            scale_pos_weight=scale_pos_weight,
                            alpha=alpha,
                            reg_lambda=reg_lambda
                        )
                        
                        # Base models for stacking
                        base_models = [
                            ('lr', LogisticRegression(max_iter=500)),
                            ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
                            ('xgb', xgb_model)
                        ]
                        
                        # StackingClassifier with varying final estimator
                        stacked_model = StackingClassifier(
                            estimators=base_models,
                            final_estimator=final_estimator,
                            cv=5
                        )
                        
                        # Create the pipeline with or without SMOTE
                        if smote_strategy is not None:
                            smote = SMOTE(sampling_strategy=smote_strategy, random_state=42)
                            pipeline = Pipeline(steps=[
                                ('preprocessor', preprocessor),
                                ('smote', smote),
                                ('model', stacked_model)
                            ])
                        else:
                            pipeline = Pipeline(steps=[
                                ('preprocessor', preprocessor),
                                ('model', stacked_model)
                            ])
                        
                        # Train the pipeline
                        pipeline.fit(X_train, y_train)
                        
                        # Calculate and print the ROC AUC score
                        y_pred_proba = pipeline.predict_proba(X_test)[:, 1]  # Get probability for the positive class
                        
                        # Calculate and print the AUPRC (Area Under Precision-Recall Curve)
                        precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
                        auprc = auc(recall, precision)
                        print(f'AUPRC: {auprc:.4f}')

# Define the final estimators, scale weights, alphas, lambdas, and SMOTE strategies to test
final_estimators = [LogisticRegression(max_iter=1000), RandomForestClassifier(n_estimators=200, random_state=42)]
scale_weights = [1, 3.92]
alphas = [0.1, 1]
lambdas = [1, 10]
smote_strategies = [None, 0.30, 0.5]  # None for no SMOTE, 0.35 for 35% minority class

# Run the testing function
test_stacking_model(final_estimators, scale_weights, alphas, lambdas, smote_strategies)


Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=1, alpha=0.1, reg_lambda=1
AUPRC: 0.9670
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=1, alpha=0.1, reg_lambda=10
AUPRC: 0.9691
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=1, alpha=1, reg_lambda=1
AUPRC: 0.9666
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=1, alpha=1, reg_lambda=10
AUPRC: 0.9674
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=3.92, alpha=0.1, reg_lambda=1
AUPRC: 0.9697
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=3.92, alpha=0.1, reg_lambda=10
AUPRC: 0.9699
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_weight=3.92, alpha=1, reg_lambda=1
AUPRC: 0.9666
Testing: SMOTE=None, final_estimator=LogisticRegression(max_iter=1000), scale_pos_we

In [52]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, auc, average_precision_score
from imblearn.pipeline import Pipeline  # Use imblearn's Pipeline for SMOTE
import xgboost as xgb
import joblib  # For saving the model

# Preprocessing pipeline (categorical + numeric)
preprocessor = create_preprocessor(categorical_features, numeric_features)

# Set the best known parameters
smote_strategy = None
final_estimator = LogisticRegression(max_iter=1000)
scale_pos_weight = 3.92
alpha = 0.1
reg_lambda = 10

# Create the XGBoost model with the best known parameters
xgb_model = xgb.XGBClassifier(
    eval_metric='aucpr',
    random_state=42,
    scale_pos_weight=scale_pos_weight,
    alpha=alpha,
    reg_lambda=reg_lambda
)

# Define base classifiers
base_models = [
    ('lr', LogisticRegression(max_iter=500)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', xgb_model)
]

# StackingClassifier with the best final estimator
stacked_model = StackingClassifier(
    estimators=base_models,
    final_estimator=final_estimator,
    cv=5
)

# Create the pipeline without SMOTE (since SMOTE strategy is None)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', stacked_model)
])

# Train the pipeline with the training data
pipeline.fit(X_train, y_train)

# Evaluate the model on the validation set and calculate AUPRC
y_pred_proba = pipeline.predict_proba(X_val)[:, 1]  # Get probability for the positive class
precision, recall, _ = precision_recall_curve(y_val, y_pred_proba)
auprc = auc(recall, precision)

# Print the AUPRC score
print(f'Best AUPRC on Validation Set: {auprc:.4f}')

# Save the best model to a file
joblib.dump(pipeline, 'best_stacking_model.pkl')

# Load and use the best model later if needed
# best_model = joblib.load('best_stacking_model.pkl')
# y_pred_prob = best_model.predict_proba(new_data)[:, 1]


Best AUPRC on Validation Set: 0.9582


['best_stacking_model.pkl']