# Bank Marketing Prediction — ML Coursework

**Objective:** Predict whether a client will subscribe to a term deposit (`y`: yes/no) based on a Portuguese bank's telemarketing campaign data.

**Dataset:** [UCI Bank Marketing Dataset](https://archive.ics.uci.edu/dataset/222/bank+marketing) — 41,188 records, 20 input features + 1 target.

---

## Table of Contents
1. [Data Loading & Exploration](#1)
2. [Task A — Preprocessing](#2)
   - 2.1 Missing Values & Outliers
   - 2.2 Feature Encoding
   - 2.3 Scaling / Standardisation
3. [Task B — Model Building (LR & SVM)](#3)
4. [Task C — Discussion & Comparison](#4)
5. [Model Export for Deployment](#5)

---
## 0. Imports & Configuration

In [None]:
# Core
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)
%matplotlib inline

# Preprocessing & Modelling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, roc_curve, accuracy_score, f1_score,
    precision_score, recall_score
)

# Model persistence
import joblib
import json
import os

RANDOM_STATE = 42
TEST_SIZE = 0.2

# Create output directories
os.makedirs('figures', exist_ok=True)
os.makedirs('model', exist_ok=True)

print('All imports loaded successfully.')

<a id='1'></a>
## 1. Data Loading & Exploration

In [None]:
# Load dataset (semicolon-separated)
df = pd.read_csv('data/bank-additional-full.csv', sep=';')
print(f'Dataset shape: {df.shape}')
df.head()

In [None]:
# Basic info
df.info()

In [None]:
# Statistical summary
df.describe(include='all').T

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
sns.countplot(x='y', data=df, ax=axes[0], palette='viridis')
axes[0].set_title('Target Variable Distribution (Count)')
axes[0].set_xlabel('Subscribed (y)')
axes[0].set_ylabel('Count')

# Percentage
df['y'].value_counts(normalize=True).plot.pie(
    autopct='%1.1f%%', ax=axes[1], colors=['#3498db', '#e74c3c'],
    startangle=90, explode=[0, 0.05]
)
axes[1].set_title('Target Variable Distribution (%)')
axes[1].set_ylabel('')

plt.tight_layout()
plt.savefig('figures/target_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

no_count = df['y'].value_counts()['no']
yes_count = df['y'].value_counts()['yes']
print(f'\nClass imbalance ratio: {no_count / yes_count:.2f}:1 (no:yes)')

<a id='2'></a>
## 2. Task A — Preprocessing

### 2.1 Missing Values & Outliers

In [None]:
# Check for explicit missing values
print('=== Null values per column ===')
print(df.isnull().sum())
print(f'\nTotal nulls: {df.isnull().sum().sum()}')

In [None]:
# Check for 'unknown' values (implicit missing values in this dataset)
print('=== "unknown" counts per categorical column ===')
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
    unknown_count = (df[col] == 'unknown').sum()
    if unknown_count > 0:
        pct = unknown_count / len(df) * 100
        print(f'  {col:15s}: {unknown_count:6d} ({pct:.2f}%)')

In [None]:
# ASSUMPTION: We treat 'unknown' as a valid category rather than imputing.
# Rationale: 
#   - 'unknown' in default (8,597 / 20.9%) is too large to drop.
#   - The bank recorded 'unknown' deliberately; it carries information.
#   - The OneHotEncoder will create a separate indicator for 'unknown'.

# Check for duplicates
n_dup = df.duplicated().sum()
print(f'Duplicate rows: {n_dup}')
if n_dup > 0:
    df = df.drop_duplicates()
    print(f'After removing duplicates: {df.shape}')

In [None]:
# Outlier analysis for numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
print(f'Numeric columns: {numeric_cols}\n')

fig, axes = plt.subplots(3, 4, figsize=(18, 12))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    sns.boxplot(y=df[col], ax=axes[i], color='#5dade2')
    axes[i].set_title(col, fontsize=11)

# Hide unused subplots
for j in range(len(numeric_cols), len(axes)):
    axes[j].set_visible(False)

plt.suptitle('Box Plots — Outlier Detection for Numeric Features', fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig('figures/outlier_boxplots.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Outlier summary using IQR method
print('=== Outlier Summary (IQR Method) ===')
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    pct = outliers / len(df) * 100
    if outliers > 0:
        print(f'  {col:20s}: {outliers:5d} outliers ({pct:.2f}%)')

# ASSUMPTION: We do NOT remove outliers.
# Rationale: 
#   - Outliers in 'campaign', 'age', etc. represent real client behaviour.
#   - SVM and LR are used with StandardScaler which mitigates extreme values.
#   - Removing outliers would lose valuable data from an already imbalanced dataset.
print('\n=> Decision: Retain outliers. StandardScaler will mitigate their influence.')

### 2.2 Feature Engineering & Duration Handling

**IMPORTANT — Duration Leakage Note:**

> `duration` is the last contact duration in seconds. It highly affects the output target. However, `duration` is **not known before a call is performed**, and after the call ends, `y` is already known. Therefore, `duration` should only be included for **benchmark purposes** and must be **discarded for a realistic predictive model**.

We will build **two versions**:
1. **Benchmark model** — includes `duration` (to show its predictive power)
2. **Realistic model** — excludes `duration` (deployed model)

In [None]:
# Encode target variable
le = LabelEncoder()
df['y_encoded'] = le.fit_transform(df['y'])  # no=0, yes=1
target_map = dict(zip(le.classes_, le.transform(le.classes_)))
print(f'Target encoding: {target_map}')

# Define feature groups
# --- Features for DEPLOYMENT (realistic model — no duration, no campaign-contact features) ---
DEPLOYMENT_FEATURES = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']

# --- All features for BENCHMARK model ---
ALL_FEATURES = [c for c in df.columns if c not in ['y', 'y_encoded']]

# --- Realistic model features (exclude duration but keep other campaign/economic features) ---
REALISTIC_FEATURES = [c for c in ALL_FEATURES if c != 'duration']

print(f'\nDeployment features ({len(DEPLOYMENT_FEATURES)}): {DEPLOYMENT_FEATURES}')
print(f'Realistic features  ({len(REALISTIC_FEATURES)}): {REALISTIC_FEATURES}')
print(f'All features        ({len(ALL_FEATURES)}): {ALL_FEATURES}')

In [None]:
# Correlation heatmap for numeric features
plt.figure(figsize=(14, 10))
corr_cols = numeric_cols + ['y_encoded']
corr_matrix = df[corr_cols].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
            cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            linewidths=0.5, square=True)
plt.title('Correlation Matrix — Numeric Features + Target', fontsize=14)
plt.tight_layout()
plt.savefig('figures/correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

print('\n=== Correlation with target (y_encoded) ===')
target_corr = corr_matrix['y_encoded'].drop('y_encoded').sort_values(ascending=False)
print(target_corr)

### 2.3 Scaling / Standardisation

We use `sklearn.pipeline.Pipeline` + `ColumnTransformer` to ensure preprocessing is **identical** during training and deployment.

In [None]:
def build_preprocessor(feature_list):
    """
    Build a ColumnTransformer that:
      - StandardScaler on numeric features
      - OneHotEncoder on categorical features
    """
    num_features = [f for f in feature_list if df[f].dtype in ['int64', 'float64']]
    cat_features = [f for f in feature_list if df[f].dtype == 'object']

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), num_features),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features)
        ],
        remainder='drop'
    )
    return preprocessor, num_features, cat_features

print('Preprocessor builder ready.')

In [None]:
# Demonstrate the effect of scaling on the DEPLOYMENT feature set
deploy_preprocessor, deploy_num, deploy_cat = build_preprocessor(DEPLOYMENT_FEATURES)

print(f'Numeric features to scale: {deploy_num}')
print(f'Categorical features to encode: {deploy_cat}')

# Show before/after scaling for numeric feature 'age'
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before scaling
axes[0].hist(df['age'], bins=40, color='#e74c3c', edgecolor='white', alpha=0.8)
axes[0].set_title('Age — Before Scaling (Raw)', fontsize=13)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
age_mean = df['age'].mean()
age_std = df['age'].std()
axes[0].axvline(age_mean, color='black', linestyle='--', label=f'Mean={age_mean:.1f}')
axes[0].axvline(age_std, color='gray', linestyle=':', label=f'Std={age_std:.1f}')
axes[0].legend()

# After scaling
scaler_demo = StandardScaler()
age_scaled = scaler_demo.fit_transform(df[['age']])
axes[1].hist(age_scaled, bins=40, color='#2ecc71', edgecolor='white', alpha=0.8)
axes[1].set_title('Age — After StandardScaler', fontsize=13)
axes[1].set_xlabel('Scaled Age (z-score)')
axes[1].set_ylabel('Frequency')
scaled_mean = age_scaled.mean()
scaled_std = age_scaled.std()
axes[1].axvline(scaled_mean, color='black', linestyle='--', label=f'Mean={scaled_mean:.4f}')
axes[1].axvline(scaled_std, color='gray', linestyle=':', label=f'Std={scaled_std:.4f}')
axes[1].legend()

plt.suptitle('Effect of StandardScaler on Numeric Feature', fontsize=15, y=1.02)
plt.tight_layout()
plt.savefig('figures/scaling_effect_age.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Show scaling effect across ALL numeric features (before vs after)
real_preprocessor, real_num, real_cat = build_preprocessor(REALISTIC_FEATURES)

fig, axes = plt.subplots(2, 1, figsize=(16, 8))

# Before
df[real_num].boxplot(ax=axes[0], vert=False, patch_artist=True,
                     boxprops=dict(facecolor='#e74c3c', alpha=0.6))
axes[0].set_title('Numeric Features — BEFORE Scaling', fontsize=13)
axes[0].set_xlabel('Raw Value')

# After
scaler_all = StandardScaler()
df_scaled = pd.DataFrame(scaler_all.fit_transform(df[real_num]), columns=real_num)
df_scaled.boxplot(ax=axes[1], vert=False, patch_artist=True,
                  boxprops=dict(facecolor='#2ecc71', alpha=0.6))
axes[1].set_title('Numeric Features — AFTER StandardScaler', fontsize=13)
axes[1].set_xlabel('Standardised Value (z-score)')

plt.suptitle('Effect of Scaling / Standardisation on All Numeric Features', fontsize=15, y=1.02)
plt.tight_layout()
plt.savefig('figures/scaling_effect_all.png', dpi=150, bbox_inches='tight')
plt.show()

print('\n=== Before Scaling ===')
print(df[real_num].describe().loc[['mean', 'std', 'min', 'max']].T)
print('\n=== After Scaling ===')
print(df_scaled.describe().loc[['mean', 'std', 'min', 'max']].T)

<a id='3'></a>
## 3. Task B — Model Building

**Train/Test Split:** 80/20 with stratification to preserve class distribution.

We build two pipelines for each model type:
1. **Realistic model** (excludes `duration`) — the deployable model
2. **Benchmark model** (includes `duration`) — to show duration's influence

We also build a **deployment model** using only the 7 form fields.

In [None]:
# Prepare target
y = df['y_encoded']

# =============================
# SPLIT 1: Realistic features
# =============================
X_real = df[REALISTIC_FEATURES]
X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(
    X_real, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)
print(f'Realistic split — Train: {X_real_train.shape}, Test: {X_real_test.shape}')

# =============================
# SPLIT 2: All features (benchmark)
# =============================
X_all = df[ALL_FEATURES]
X_all_train, X_all_test, y_all_train, y_all_test = train_test_split(
    X_all, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)
print(f'Benchmark split  — Train: {X_all_train.shape}, Test: {X_all_test.shape}')

# =============================
# SPLIT 3: Deployment features (7 form fields only)
# =============================
X_deploy = df[DEPLOYMENT_FEATURES]
X_deploy_train, X_deploy_test, y_deploy_train, y_deploy_test = train_test_split(
    X_deploy, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)
print(f'Deployment split — Train: {X_deploy_train.shape}, Test: {X_deploy_test.shape}')

train_dist = np.bincount(y_real_train)
test_dist = np.bincount(y_real_test)
print(f'\nTarget distribution (train): {train_dist} | (test): {test_dist}')

In [None]:
def build_pipeline(features, model, model_name):
    """Build a complete sklearn Pipeline: preprocessor -> model."""
    preprocessor, num_f, cat_f = build_preprocessor(features)
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    return pipe

def evaluate_model(pipe, X_test, y_test, model_name, dataset_label):
    """Evaluate a trained pipeline and return metrics dict."""
    y_pred = pipe.predict(X_test)
    
    # Probabilities for ROC-AUC (if available)
    try:
        y_prob = pipe.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
    except AttributeError:
        y_prob = pipe.decision_function(X_test)
        roc_auc = roc_auc_score(y_test, y_prob)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    sep = '=' * 60
    print(f'\n{sep}')
    print(f'{model_name} — {dataset_label}')
    print(f'{sep}')
    print(f'Accuracy:  {acc:.4f}')
    print(f'Precision: {prec:.4f}')
    print(f'Recall:    {rec:.4f}')
    print(f'F1 Score:  {f1:.4f}')
    print(f'ROC-AUC:   {roc_auc:.4f}')
    print('\nClassification Report:')
    print(classification_report(y_test, y_pred, target_names=['No', 'Yes']))
    
    return {
        'model': model_name, 'dataset': dataset_label,
        'accuracy': acc, 'precision': prec, 'recall': rec,
        'f1': f1, 'roc_auc': roc_auc,
        'y_pred': y_pred, 'y_prob': y_prob
    }

print('Pipeline builder & evaluator ready.')

### 3.1 Logistic Regression

In [None]:
# --- Logistic Regression: Realistic (no duration) ---
lr_real_pipe = build_pipeline(
    REALISTIC_FEATURES,
    LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE),
    'LR'
)
lr_real_pipe.fit(X_real_train, y_real_train)
lr_real_results = evaluate_model(lr_real_pipe, X_real_test, y_real_test, 'Logistic Regression', 'Realistic (no duration)')

In [None]:
# --- Logistic Regression: Benchmark (with duration) ---
lr_bench_pipe = build_pipeline(
    ALL_FEATURES,
    LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE),
    'LR'
)
lr_bench_pipe.fit(X_all_train, y_all_train)
lr_bench_results = evaluate_model(lr_bench_pipe, X_all_test, y_all_test, 'Logistic Regression', 'Benchmark (with duration)')

### 3.2 Support Vector Machine (SVM)

In [None]:
# --- SVM: Realistic (no duration) ---
# probability=True to enable predict_proba for ROC-AUC
svm_real_pipe = build_pipeline(
    REALISTIC_FEATURES,
    SVC(kernel='rbf', class_weight='balanced', probability=True, random_state=RANDOM_STATE, C=1.0),
    'SVM'
)
svm_real_pipe.fit(X_real_train, y_real_train)
svm_real_results = evaluate_model(svm_real_pipe, X_real_test, y_real_test, 'SVM (RBF)', 'Realistic (no duration)')

In [None]:
# --- SVM: Benchmark (with duration) ---
svm_bench_pipe = build_pipeline(
    ALL_FEATURES,
    SVC(kernel='rbf', class_weight='balanced', probability=True, random_state=RANDOM_STATE, C=1.0),
    'SVM'
)
svm_bench_pipe.fit(X_all_train, y_all_train)
svm_bench_results = evaluate_model(svm_bench_pipe, X_all_test, y_all_test, 'SVM (RBF)', 'Benchmark (with duration)')

### 3.3 Deployment Model (7 Features Only)

In [None]:
# Train LR on deployment features (7 form fields)
lr_deploy_pipe = build_pipeline(
    DEPLOYMENT_FEATURES,
    LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE),
    'LR'
)
lr_deploy_pipe.fit(X_deploy_train, y_deploy_train)
lr_deploy_results = evaluate_model(lr_deploy_pipe, X_deploy_test, y_deploy_test, 'Logistic Regression', 'Deployment (7 features)')

# Train SVM on deployment features (7 form fields)
svm_deploy_pipe = build_pipeline(
    DEPLOYMENT_FEATURES,
    SVC(kernel='rbf', class_weight='balanced', probability=True, random_state=RANDOM_STATE, C=1.0),
    'SVM'
)
svm_deploy_pipe.fit(X_deploy_train, y_deploy_train)
svm_deploy_results = evaluate_model(svm_deploy_pipe, X_deploy_test, y_deploy_test, 'SVM (RBF)', 'Deployment (7 features)')

### 3.4 Visualisation — Confusion Matrices & ROC Curves

In [None]:
# Confusion Matrices
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

results_list = [
    (lr_real_results, y_real_test, 'LR — Realistic'),
    (lr_bench_results, y_all_test, 'LR — Benchmark'),
    (lr_deploy_results, y_deploy_test, 'LR — Deployment'),
    (svm_real_results, y_real_test, 'SVM — Realistic'),
    (svm_bench_results, y_all_test, 'SVM — Benchmark'),
    (svm_deploy_results, y_deploy_test, 'SVM — Deployment'),
]

for idx, (res, y_t, title) in enumerate(results_list):
    ax = axes[idx // 3, idx % 3]
    cm = confusion_matrix(y_t, res['y_pred'])
    ConfusionMatrixDisplay(cm, display_labels=['No', 'Yes']).plot(ax=ax, cmap='Blues', colorbar=False)
    ax.set_title(title, fontsize=12)

plt.suptitle('Confusion Matrices — All Model Variants', fontsize=15, y=1.02)
plt.tight_layout()
plt.savefig('figures/confusion_matrices.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# ROC Curves
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

roc_groups = [
    ('Realistic (no duration)', lr_real_results, svm_real_results, y_real_test),
    ('Benchmark (with duration)', lr_bench_results, svm_bench_results, y_all_test),
    ('Deployment (7 features)', lr_deploy_results, svm_deploy_results, y_deploy_test),
]

for idx, (label, lr_res, svm_res, y_t) in enumerate(roc_groups):
    ax = axes[idx]
    
    # LR
    fpr_lr, tpr_lr, _ = roc_curve(y_t, lr_res['y_prob'])
    lr_auc = lr_res['roc_auc']
    ax.plot(fpr_lr, tpr_lr, label=f'LR (AUC={lr_auc:.3f})', linewidth=2)
    
    # SVM
    fpr_svm, tpr_svm, _ = roc_curve(y_t, svm_res['y_prob'])
    svm_auc = svm_res['roc_auc']
    ax.plot(fpr_svm, tpr_svm, label=f'SVM (AUC={svm_auc:.3f})', linewidth=2)
    
    # Diagonal
    ax.plot([0, 1], [0, 1], 'k--', alpha=0.4, label='Random')
    
    ax.set_title(label, fontsize=12)
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.legend(fontsize=10)
    ax.grid(alpha=0.3)

plt.suptitle('ROC Curves — LR vs SVM', fontsize=15, y=1.02)
plt.tight_layout()
plt.savefig('figures/roc_curves.png', dpi=150, bbox_inches='tight')
plt.show()

<a id='4'></a>
## 4. Task C — Discussion & Comparison

In [None]:
# Summary comparison table
all_results = [lr_real_results, lr_bench_results, lr_deploy_results,
               svm_real_results, svm_bench_results, svm_deploy_results]

summary_rows = []
for r in all_results:
    summary_rows.append({
        'Model': r['model'],
        'Dataset': r['dataset'],
        'Accuracy': round(r['accuracy'], 4),
        'Precision': round(r['precision'], 4),
        'Recall': round(r['recall'], 4),
        'F1': round(r['f1'], 4),
        'ROC-AUC': round(r['roc_auc'], 4)
    })

summary_df = pd.DataFrame(summary_rows)

print('\n' + '=' * 60)
print('           MODEL COMPARISON SUMMARY')
print('=' * 60)
display(summary_df.style.set_properties(**{'text-align': 'center'}))

### Discussion

#### 1. Duration Leakage Impact
- **Benchmark models** (with `duration`) achieve significantly higher ROC-AUC and recall than realistic models.
- This confirms the dataset authors' warning: `duration` is a **post-hoc variable** — it is only known *after* the call, at which point `y` is already determined.
- **Including `duration` in production would constitute data leakage**, as future (unknown) information is used to predict the outcome.
- The benchmark models exist purely to demonstrate this effect and should **never** be deployed.

#### 2. LR vs SVM Comparison

| Aspect | Logistic Regression | SVM (RBF) |
|--------|--------------------|-----------|
| **Speed** | Fast to train (~seconds) | Slower (~minutes on 41K rows) |
| **Interpretability** | High — coefficients show feature importance | Low — kernel-based, black box |
| **Handling Imbalance** | `class_weight='balanced'` works well | Same mechanism, similar effect |
| **Non-linearity** | Cannot capture non-linear boundaries | RBF kernel captures non-linear patterns |
| **Scalability** | Scales well to large datasets | O(n^2) to O(n^3) complexity — poor scaling |
| **Deployment** | Lightweight model file | Larger model file (stores support vectors) |

#### 3. Class Imbalance
- The dataset is **heavily imbalanced** (~88.7% 'no' vs 11.3% 'yes').
- Without `class_weight='balanced'`, models would predict 'no' for everything and still achieve ~89% accuracy.
- We use `class_weight='balanced'` to penalise misclassification of the minority class.
- **F1 and Recall** are more appropriate metrics than accuracy for this problem.

#### 4. Deployment Model Selection
- **Logistic Regression** is selected for deployment because:
  1. Comparable or better performance to SVM on deployment features.
  2. Significantly faster inference — critical for a web-based POC.
  3. Smaller model file — easier to containerize.
  4. Better interpretability — important for a banking use case.

#### 5. Limitations
- The deployment model uses only 7 client-profile features (no campaign or economic context).
- This limits predictive power but satisfies the form requirement.
- In a real-world system, campaign and economic features would be injected server-side.

<a id='5'></a>
## 5. Model Export for Deployment

In [None]:
# Save the deployment pipeline (LR with 7 features)
model_path = 'model/lr_deployment_pipeline.joblib'
joblib.dump(lr_deploy_pipe, model_path)
model_size = os.path.getsize(model_path) / 1024
print(f'Deployment model saved: {model_path} ({model_size:.1f} KB)')

# Save feature names for validation
feature_info = {
    'features': DEPLOYMENT_FEATURES,
    'categorical_values': {
        col: sorted(df[col].unique().tolist())
        for col in DEPLOYMENT_FEATURES if df[col].dtype == 'object'
    }
}
with open('model/feature_info.json', 'w') as f:
    json.dump(feature_info, f, indent=2)

print('Feature info saved: model/feature_info.json')
print('\n=== Feature Info ===')
print(json.dumps(feature_info, indent=2))

In [None]:
# Quick sanity check — test the saved model
loaded_pipe = joblib.load(model_path)

# Simulate a form submission
sample_input = pd.DataFrame([{
    'age': 35,
    'job': 'admin.',
    'marital': 'married',
    'education': 'university.degree',
    'default': 'no',
    'housing': 'yes',
    'loan': 'no'
}])

prediction = loaded_pipe.predict(sample_input)[0]
proba = loaded_pipe.predict_proba(sample_input)[0]

result_label = 'YES' if prediction == 1 else 'NO'
sample_dict = sample_input.to_dict(orient='records')[0]
print(f'Sample input: {sample_dict}')
print(f'Prediction:   {result_label}')
print(f'Probability:  No={proba[0]:.3f}, Yes={proba[1]:.3f}')
print('\n Model loads and predicts correctly — ready for deployment!')