## Step 0: Import Necessary Libraries

### Overview

It begins by importing **LogisticRegression, OneHotEncoder, Pipeline, SMOTE, StratifiedKFold, VotingClassifier, lightgbm, matplotlib, numpy, pandas, roc_auc_score, seaborn, warnings, xgboost** for data handling, modelling, and visualisation. It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning. Finally, it **evaluates model performance** on unseen data. Visualisations are created along the way to illuminate data patterns or results.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score, f1_score, precision_recall_curve, confusion_matrix, roc_curve
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier, RandomForestClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings('ignore')

sns.set(style='whitegrid')


## Step 1: Load the Datasets

### Overview

Next, it **loads external data** so it can be explored and modelled.

In [None]:
train_df = pd.read_csv('/kaggle/input/predict-the-success-of-bank-telemarketing/train.csv')
test_df = pd.read_csv('/kaggle/input/predict-the-success-of-bank-telemarketing/test.csv')


## Step 2: Handle Missing Values

### Overview

The snippet then **cleans or engineers features** to prepare the dataset.

In [None]:
categorical_cols_with_missing = ['job', 'education', 'contact', 'poutcome']
train_df[categorical_cols_with_missing] = train_df[categorical_cols_with_missing].fillna('unknown')
test_df[categorical_cols_with_missing] = test_df[categorical_cols_with_missing].fillna('unknown')


## Step 3: Feature Engineering

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
train_df['last contact date'] = pd.to_datetime(train_df['last contact date'])
test_df['last contact date'] = pd.to_datetime(test_df['last contact date'])

for df in [train_df, test_df]:
    df['contact_month'] = df['last contact date'].dt.month
    df['contact_year'] = df['last contact date'].dt.year
    df['contact_dayofweek'] = df['last contact date'].dt.dayofweek
    df['contact_day'] = df['last contact date'].dt.day
    df['contact_period'] = df['contact_dayofweek'].apply(lambda x: 'weekend' if x >= 5 else 'weekday')

train_df.drop(['last contact date'], axis=1, inplace=True)
test_df.drop(['last contact date'], axis=1, inplace=True)

train_df['target'] = train_df['target'].map({'yes': 1, 'no': 0})

train_df['pdays_contacted'] = train_df['pdays'].apply(lambda x: 0 if x == -1 else 1)
test_df['pdays_contacted'] = test_df['pdays'].apply(lambda x: 0 if x == -1 else 1)

interaction_features = {
    'job_marital': ['job', 'marital'],
    'job_education': ['job', 'education'],
    'housing_loan': ['housing', 'loan'],
    'campaign_outcome': ['campaign', 'poutcome']
}

for new_col, cols in interaction_features.items():
    train_df[new_col] = train_df[cols[0]].astype(str) + '_' + train_df[cols[1]].astype(str)
    test_df[new_col] = test_df[cols[0]].astype(str) + '_' + test_df[cols[1]].astype(str)

skewed_features = ['balance', 'duration', 'campaign', 'pdays', 'previous']
for col in skewed_features:
    min_val = train_df[col].min()
    train_df[col + '_log'] = train_df[col].apply(lambda x: np.log(x + abs(min_val) + 1))
    test_df[col + '_log'] = test_df[col].apply(lambda x: np.log(x + abs(min_val) + 1))


## Step 4: Encoding Categorical Variables

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
categorical_features = [
    'job', 'marital', 'education', 'default', 'housing', 'loan',
    'contact', 'poutcome', 'contact_period', 'campaign_outcome',
    'job_marital', 'job_education', 'housing_loan',
    'contact_month', 'contact_dayofweek'
]

onehot_encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

train_encoded = onehot_encoder.fit_transform(train_df[categorical_features])
test_encoded = onehot_encoder.transform(test_df[categorical_features])

train_encoded_df = pd.DataFrame(train_encoded, columns=onehot_encoder.get_feature_names_out(categorical_features))
test_encoded_df = pd.DataFrame(test_encoded, columns=onehot_encoder.get_feature_names_out(categorical_features))

train_encoded_df.index = train_df.index
test_encoded_df.index = test_df.index

train_df = pd.concat([train_df.drop(columns=categorical_features), train_encoded_df], axis=1)
test_df = pd.concat([test_df.drop(columns=categorical_features), test_encoded_df], axis=1)


## Step 5: Define Features and Target

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
features = [col for col in train_df.columns if col != 'target']
X = train_df[features]
y = train_df['target']
X_test = test_df[features]


## Step 6: Handle Class Imbalance using SMOTE

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
smote = SMOTE(random_state=42)


## Step 7: Define Models with Best Parameters

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning.

In [None]:
# Initialize classifiers with the best parameters
xgb_best = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='auc',
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1,
    subsample=0.6,
    n_estimators=550,
    max_depth=6,
    learning_rate=0.05,
    gamma=0.2,
    colsample_bytree=0.8
)

lgb_best = lgb.LGBMClassifier(
    objective='binary',
    random_state=42,
    n_jobs=-1,
    subsample=1.0,
    reg_lambda=0.1,
    reg_alpha=0.5,
    num_leaves=50,
    n_estimators=300,
    max_depth=-1,
    learning_rate=0.05,
    colsample_bytree=0.6
)

rf_best = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='auto',
    random_state=42,
    n_jobs=-1
)

# Create pipelines for XGBoost and LightGBM with SMOTE
xgb_pipeline = ImbPipeline([
    ('smote', smote),
    ('classifier', xgb_best)
])

lgb_pipeline = ImbPipeline([
    ('smote', smote),
    ('classifier', lgb_best)
])

rf_pipeline = ImbPipeline([
    ('smote', smote),
    ('classifier', rf_best)
])

print("Models initialized with best parameters.")


## Step 8: Split Data for Validation

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning.

In [None]:
X_train_full, X_valid_full, y_train_full, y_valid_full = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Step 9: Train Models on Training Split

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning.

In [None]:
print("\nTraining XGBoost...")
xgb_pipeline.fit(X_train_full, y_train_full)
print("XGBoost training completed.")

print("\nTraining LightGBM...")
lgb_pipeline.fit(X_train_full, y_train_full)
print("LightGBM training completed.")

print("\nTraining Random Forest...")
rf_pipeline.fit(X_train_full, y_train_full)
print("Random Forest training completed.")


## Step 10: Predict Probabilities on Validation Data

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
print("\nPredicting probabilities on validation data...")
xgb_valid_proba = xgb_pipeline.predict_proba(X_valid_full)[:, 1]
lgb_valid_proba = lgb_pipeline.predict_proba(X_valid_full)[:, 1]
rf_valid_proba = rf_pipeline.predict_proba(X_valid_full)[:, 1]


## Step 11: Ensemble Strategies

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning. Finally, it **evaluates model performance** on unseen data.

In [None]:
print("\nCalculating ROC AUC scores for ensemble weighting...")
xgb_roc_auc = roc_auc_score(y_valid_full, xgb_valid_proba)
lgb_roc_auc = roc_auc_score(y_valid_full, lgb_valid_proba)

total_auc = xgb_roc_auc + lgb_roc_auc
xgb_weight = xgb_roc_auc / total_auc
lgb_weight = lgb_roc_auc / total_auc

ensemble_weighted_proba = (xgb_valid_proba * xgb_weight) + (lgb_valid_proba * lgb_weight)
ensemble_simple_proba = (xgb_valid_proba + lgb_valid_proba) / 2
ensemble_soft_proba = ensemble_simple_proba.copy()

xgb_valid_pred = xgb_pipeline.predict(X_valid_full)
lgb_valid_pred = lgb_pipeline.predict(X_valid_full)

ensemble_hard_pred = (xgb_valid_pred + lgb_valid_pred) // 2
ensemble_hard_proba = ensemble_hard_pred

stack_X = np.vstack((xgb_valid_proba, lgb_valid_proba)).T

meta_model = LogisticRegression(random_state=42)
meta_model.fit(stack_X, y_valid_full)

ensemble_stacking_proba = meta_model.predict_proba(stack_X)[:, 1]


## Step 12: Compile Ensemble Predictions and Calculate F1 Macro Scores

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
ensemble_methods = {
    'Weighted Average': ensemble_weighted_proba,
    'Simple Average': ensemble_simple_proba,
    'Soft Voting': ensemble_soft_proba,
    'Hard Voting': ensemble_hard_proba,
    'Stacking': ensemble_stacking_proba
}

manual_threshold = 0.325

f1_scores_dict = {}

print("\nCalculating F1 Macro Scores for ensemble methods:")
for method, proba in ensemble_methods.items():
    if method == 'Hard Voting':
        pred = proba
    else:
        pred = (proba >= manual_threshold).astype(int)
    f1 = f1_score(y_valid_full, pred, average='macro')
    f1_scores_dict[method] = f1
    print(f"F1 Macro Score (Validation) for {method}: {f1:.6f} with Threshold: {manual_threshold}")


## Step 13: Visualization of F1 Macro Scores Across Ensemble Methods

### Overview

Visualisations are created along the way to illuminate data patterns or results.

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=list(f1_scores_dict.keys()), y=list(f1_scores_dict.values()), palette='viridis')
plt.ylabel('F1 Macro Score')
plt.title('F1 Macro Score Comparison Across Ensemble Methods')
plt.ylim(0, 1)
for index, value in enumerate(f1_scores_dict.values()):
    plt.text(index, value + 0.01, f"{value:.4f}", ha='center')
plt.show()


## Step 14: Feature Importance Visualizations

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning. Visualisations are created along the way to illuminate data patterns or results.

In [None]:
def plot_feature_importance(model, model_name, top_n=20):
    if model_name == 'XGBoost':
        booster = model.named_steps['classifier'].get_booster()
        importance = booster.get_score(importance_type='weight')
        importance_df = pd.DataFrame({
            'feature': list(importance.keys()),
            'importance': list(importance.values())
        }).sort_values(by='importance', ascending=False).head(top_n)
    elif model_name == 'LightGBM':
        importance_df = pd.DataFrame({
            'feature': model.named_steps['classifier'].feature_name_,
            'importance': model.named_steps['classifier'].feature_importances_
        }).sort_values(by='importance', ascending=False).head(top_n)
    elif model_name == 'Random Forest':
        importance_df = pd.DataFrame({
            'feature': model.named_steps['classifier'].feature_names_in_,
            'importance': model.named_steps['classifier'].feature_importances_
        }).sort_values(by='importance', ascending=False).head(top_n)
    else:
        raise ValueError("Model name not recognized. Use 'XGBoost', 'LightGBM', or 'Random Forest'.")
    
    plt.figure(figsize=(10, 8))
    sns.barplot(x='importance', y='feature', data=importance_df, palette='viridis')
    plt.title(f'Top {top_n} Feature Importances - {model_name}')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()

print("\nPlotting Feature Importances...")
plot_feature_importance(xgb_pipeline, 'XGBoost')
plot_feature_importance(lgb_pipeline, 'LightGBM')
plot_feature_importance(rf_pipeline, 'Random Forest')


## Step 15: Additional Performance Visualizations

### Overview

Finally, it **evaluates model performance** on unseen data. Visualisations are created along the way to illuminate data patterns or results.

In [None]:
def plot_roc_curve_custom(y_true, y_scores, model_name):
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc_score(y_true, y_scores):.3f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {model_name}')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.show()

print("\nPlotting ROC Curves for Ensemble Methods...")
for method, proba in ensemble_methods.items():
    if method != 'Hard Voting':
        plot_roc_curve_custom(y_valid_full, proba, method)

def plot_confusion_matrix_custom(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'Confusion Matrix - {model_name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()

print("\nPlotting Confusion Matrices for Ensemble Methods...")
for method, proba in ensemble_methods.items():
    if method == 'Hard Voting':
        pred = proba
    else:
        pred = (proba >= manual_threshold).astype(int)
    plot_confusion_matrix_custom(y_valid_full, pred, method)


## Step 16: Retrain Models on Full Data

### Overview

It proceeds to **train a machine‑learning model** with appropriate hyper‑parameter tuning.

In [None]:
print("\nRetraining models on the full dataset...")
xgb_pipeline.fit(X, y)
lgb_pipeline.fit(X, y)
rf_pipeline.fit(X, y)
print("Retraining completed.")


## Step 17: Ensemble Methods on Test Data

### Overview

Finally, it **evaluates model performance** on unseen data.

In [None]:
print("\nPredicting probabilities on test data...")
xgb_test_proba = xgb_pipeline.predict_proba(X_test)[:, 1]
lgb_test_proba = lgb_pipeline.predict_proba(X_test)[:, 1]
rf_test_proba = rf_pipeline.predict_proba(X_test)[:, 1]

ensemble_weighted_test_proba = (xgb_test_proba * xgb_weight) + (lgb_test_proba * lgb_weight)
ensemble_simple_test_proba = (xgb_test_proba + lgb_test_proba) / 2
ensemble_soft_test_proba = ensemble_simple_test_proba.copy()
ensemble_hard_pred = (xgb_pipeline.predict(X_test) + lgb_pipeline.predict(X_test)) // 2
ensemble_hard_test_proba = ensemble_hard_pred
stack_X_test = np.vstack((xgb_test_proba, lgb_test_proba)).T
ensemble_stacking_test_proba = meta_model.predict_proba(stack_X_test)[:, 1]

test_ensemble_methods = {
    'Weighted Average': ensemble_weighted_test_proba,
    'Simple Average': ensemble_simple_test_proba,
    'Soft Voting': ensemble_soft_test_proba,
    'Hard Voting': ensemble_hard_test_proba,
    'Stacking': ensemble_stacking_test_proba
}


## Step 18: Prepare Submission for the Best Ensemble Method

### Overview

This code cell performs a necessary step in the analysis pipeline.

In [None]:
best_method = max(f1_scores_dict, key=f1_scores_dict.get)
best_f1_score = f1_scores_dict[best_method]
print(f"\nBest Ensemble Method: {best_method} with F1 Macro Score: {best_f1_score:.6f}")

best_test_proba = test_ensemble_methods[best_method]

if best_method == 'Hard Voting':
    best_test_pred = best_test_proba
else:
    best_test_pred = (best_test_proba >= manual_threshold).astype(int)

pred_mapped = np.where(best_test_pred == 1, 'yes', 'no')

if 'client_id' in test_df.columns:
    identifier = test_df['client_id']
else:
    identifier = test_df.index

submission = pd.DataFrame({
    'id': identifier,
    'target': pred_mapped
})

submission_filename = 'submission.csv'

submission.to_csv(submission_filename, index=False)
print(f"\nSubmission File Created for the Best Ensemble Method ({best_method}): {submission_filename}")
print(submission.head())