# MLOps-Flavoured Churn Project – API + Batch Scoring

This notebook turns a churn model into something closer to a **real service**:

- Train a churn model on the **Bank Customer Churn** dataset.
- Package the model as a reusable **scikit-learn pipeline**.
- **Persist** the model artifact to disk (for reuse outside the notebook).
- Sketch a small **FastAPI** service to serve online predictions.
- Implement a simple **batch scoring** script for CSV files.

The goal is not full production MLOps, but to get a clean, realistic
**deployment-ready shape** you can extend later (e.g. Docker, CI/CD, monitoring).

We assume the dataset is available at:

```text
data/Churn_Modelling.csv
```


## 1. Imports and configuration

We will use:

- `pandas`, `numpy` – data handling.
- `scikit-learn` – model and pipeline.
- `joblib` – model persistence.
- `FastAPI`, `pydantic` – API sketch (code as text you can move into a .py file).

Only scikit-learn and joblib are strictly required to run this notebook; the
FastAPI code here is **example code** to be used in a separate service.


In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

import joblib

# Optional: only needed if you want to run the API code from this notebook
try:  # noqa: SIM105
    import fastapi  # type: ignore[import]
    import pydantic  # type: ignore[import]
except ImportError:
    fastapi = None
    pydantic = None

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_PATH: Path = Path("data") / "Churn_Modelling.csv"
ARTIFACTS_DIR: Path = Path("artifacts")
MODEL_PATH: Path = ARTIFACTS_DIR / "bank_churn_model.joblib"

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data file not found at {DATA_PATH.resolve()}. "
        "Please download the Bank Customer Churn CSV and place it under the 'data/' directory."
    )

ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)


## 2. Load and clean the churn dataset

We reuse the **Bank Customer Churn** dataset used in previous projects.

Columns (typical):

- IDs: `RowNumber`, `CustomerId`, `Surname`.
- Features: `CreditScore`, `Geography`, `Gender`, `Age`, `Tenure`,
  `Balance`, `NumOfProducts`, `HasCrCard`, `IsActiveMember`, `EstimatedSalary`.
- Target: `Exited` (1 = churn, 0 = stayed).

For the model we:

- Drop pure identifier columns.
- Make sure `Exited` is integer 0/1.


In [None]:
def load_bank_churn_data(path: Path) -> pd.DataFrame:
    """Load the bank customer churn dataset.

    Args:
        path: Path to the CSV file.

    Returns:
        DataFrame with the bank churn data.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path!s}")

    df: pd.DataFrame = pd.read_csv(path)
    if df.empty:
        raise ValueError(f"Loaded DataFrame is empty: {path!s}")

    return df


def clean_bank_churn_data(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Clean the bank churn dataset.

    - Drop identifier columns (if present).
    - Ensure `Exited` exists and is integer 0/1.

    Args:
        raw_df: Raw bank churn DataFrame.

    Returns:
        Cleaned DataFrame.
    """
    df = raw_df.copy()

    id_cols: List[str] = ["RowNumber", "CustomerId", "Surname"]
    drop_cols: List[str] = [c for c in id_cols if c in df.columns]

    if drop_cols:
        df = df.drop(columns=drop_cols)
        print(f"Dropped identifier columns: {drop_cols}")

    if "Exited" not in df.columns:
        raise ValueError("Target column 'Exited' not found in DataFrame.")

    df["Exited"] = df["Exited"].astype(int)

    return df


raw_df: pd.DataFrame = load_bank_churn_data(DATA_PATH)
df: pd.DataFrame = clean_bank_churn_data(raw_df)

print("Shape:", df.shape)
display(df.head())


## 3. Define the training pipeline

We build a scikit-learn **Pipeline** that includes:

- Preprocessing:
  - `StandardScaler` for numeric features.
  - `OneHotEncoder` for categorical features (`Geography`, `Gender`).
- Model:
  - `RandomForestClassifier` as a solid baseline.

We keep everything inside a single pipeline so it can be saved and loaded
as **one artifact**.


In [None]:
TARGET_COL: str = "Exited"

if TARGET_COL not in df.columns:
    raise KeyError(f"Target column {TARGET_COL!r} not found in DataFrame.")

X: pd.DataFrame = df.drop(columns=[TARGET_COL])
y: pd.Series = df[TARGET_COL]

categorical_cols: List[str] = [c for c in ["Geography", "Gender"] if c in X.columns]
numeric_cols: List[str] = [c for c in X.columns if c not in categorical_cols]

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=RANDOM_STATE,
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=4,
    min_samples_leaf=2,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

churn_pipeline: Pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("clf", model),
    ]
)


### 3.1 Train and evaluate

We fit the pipeline and compute simple metrics:

- Accuracy.
- ROC-AUC.


In [None]:
churn_pipeline.fit(X_train, y_train)

y_pred_test = churn_pipeline.predict(X_test)
y_proba_test = churn_pipeline.predict_proba(X_test)[:, 1]

acc: float = accuracy_score(y_test, y_pred_test)
roc_auc: float = roc_auc_score(y_test, y_proba_test)

print(f"Test accuracy: {acc:.3f}")
print(f"Test ROC-AUC:  {roc_auc:.3f}")


The exact numbers are not the main focus here; our goal is a **reasonable**
model with a clean deployment shape.

Next we **persist** the pipeline as a single artifact.


## 4. Persisting the model artifact

We save the entire `churn_pipeline` using `joblib`:

- This includes preprocessing and model.
- It can be loaded later by any Python process that has the same library versions.


In [None]:
def save_model(model: BaseEstimator, path: Path) -> None:
    """Save a scikit-learn model or pipeline to disk using joblib.

    Args:
        model: Fitted estimator or pipeline.
        path: File path to save the artifact.
    """
    path.parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(model, path)
    print(f"Model saved to {path.resolve()}")


def load_model(path: Path) -> BaseEstimator:
    """Load a scikit-learn model or pipeline from disk.

    Args:
        path: Path to the joblib artifact.

    Returns:
        Loaded estimator.
    """
    if not path.exists():
        raise FileNotFoundError(f"Model file not found: {path!s}")
    loaded = joblib.load(path)
    return loaded


save_model(churn_pipeline, MODEL_PATH)

# Quick smoke test: load and predict
loaded_pipeline = load_model(MODEL_PATH)
proba_check = loaded_pipeline.predict_proba(X_test.iloc[:5])[:, 1]
print("Sample predicted churn probabilities (loaded model):", proba_check)


At this point we have a **portable artifact**:

```text
artifacts/bank_churn_model.joblib
```

Everything else (API, batch scoring, etc.) should consume this artifact,
not retrain the model.


## 5. Sketching an online API with FastAPI

We now sketch a minimal **FastAPI** service that:

- Loads `bank_churn_model.joblib` at startup.
- Exposes a `/predict` endpoint.
- Accepts a **single customer** or a **list of customers** as JSON.
- Returns churn probabilities.

> The API code below is provided as a **string** so you can write it to
> `app/main.py` in a separate project, then run with `uvicorn`.


In [None]:
api_code = '''
from __future__ import annotations

from pathlib import Path
from typing import List

import joblib
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel

# Paths
ARTIFACTS_DIR = Path("artifacts")
MODEL_PATH = ARTIFACTS_DIR / "bank_churn_model.joblib"

# Load model at startup
if not MODEL_PATH.exists():
    raise FileNotFoundError(f"Model artifact not found at {MODEL_PATH.resolve()}")

model = joblib.load(MODEL_PATH)

app = FastAPI(title="Bank Churn Prediction API")


class CustomerFeatures(BaseModel):
    CreditScore: float
    Geography: str
    Gender: str
    Age: int
    Tenure: int
    Balance: float
    NumOfProducts: int
    HasCrCard: int
    IsActiveMember: int
    EstimatedSalary: float


class PredictionResponse(BaseModel):
    churn_proba: float


@app.get("/health")
async def health() -> dict:
    """Health check endpoint."""
    return {"status": "ok"}


@app.post("/predict", response_model=List[PredictionResponse])
async def predict(customers: List[CustomerFeatures]) -> List[PredictionResponse]:
    """Predict churn probability for one or more customers.

    Args:
        customers: List of customer feature payloads.

    Returns:
        List of PredictionResponse objects with churn probabilities.
    """
    # Convert list of Pydantic models to DataFrame
    data = [c.dict() for c in customers]
    df = pd.DataFrame(data)

    proba = model.predict_proba(df)[:, 1]

    return [PredictionResponse(churn_proba=float(p)) for p in proba]

'''

print(api_code)


### 5.1 How to use this API code

1. Create a small project structure, for example:

```text
bank-churn-service/
├── app/
│   └── main.py          # paste the FastAPI code here
├── artifacts/
│   └── bank_churn_model.joblib
└── requirements.txt
```

2. Save the printed `api_code` into `app/main.py`.
3. Install dependencies (example):

```bash
pip install fastapi uvicorn joblib scikit-learn pandas
```

4. Run the API:

```bash
uvicorn app.main:app --reload --port 8000
```

5. Example request (JSON body for `/predict`):

```json
[
  {
    "CreditScore": 600,
    "Geography": "France",
    "Gender": "Male",
    "Age": 40,
    "Tenure": 3,
    "Balance": 60000.0,
    "NumOfProducts": 2,
    "HasCrCard": 1,
    "IsActiveMember": 1,
    "EstimatedSalary": 50000.0
  }
]
```

Response (example):

```json
[
  {"churn_proba": 0.27}
]
```

From here you can add:

- Request logging.
- Input validation / defaults.
- Authentication.
- Monitoring hooks.


## 6. Batch scoring script

In many churn use cases you want to score **batches of customers** offline.

We implement a simple function that:

- Reads a CSV file with the same feature columns used in training.
- Loads the persisted pipeline.
- Adds predicted churn probabilities.
- Writes a new CSV with results.

This is the basis for a **daily churn scoring job**.


In [None]:
def batch_score_csv(
    model_path: Path,
    input_csv: Path,
    output_csv: Path,
    id_column: str | None = None,
) -> pd.DataFrame:
    """Score a batch of customers in a CSV file using a saved churn model.

    Args:
        model_path: Path to the joblib model artifact.
        input_csv: Path to the input CSV with customer features.
        output_csv: Path where the output CSV will be saved.
        id_column: Optional name of an identifier column to keep.

    Returns:
        DataFrame with predictions (also saved to `output_csv`).
    """
    if not input_csv.exists():
        raise FileNotFoundError(f"Input CSV not found: {input_csv!s}")

    model: BaseEstimator = load_model(model_path)

    data = pd.read_csv(input_csv)

    # Optionally separate out id column
    id_series = None
    if id_column is not None and id_column in data.columns:
        id_series = data[id_column]

    X_batch = data.copy()
    if TARGET_COL in X_batch.columns:
        X_batch = X_batch.drop(columns=[TARGET_COL])

    proba = model.predict_proba(X_batch)[:, 1]

    result_df = pd.DataFrame({"churn_proba": proba})

    if id_series is not None:
        result_df.insert(0, id_column, id_series)

    output_csv.parent.mkdir(parents=True, exist_ok=True)
    result_df.to_csv(output_csv, index=False)

    print(f"Saved batch predictions to {output_csv.resolve()}")
    return result_df


# Example usage (uncomment and adjust paths when running in your environment):
# input_demo = Path("data") / "Churn_Modelling.csv"
# output_demo = Path("data") / "Churn_Modelling_scored.csv"
# scored_batch_df = batch_score_csv(MODEL_PATH, input_demo, output_demo, id_column="CustomerId")
# display(scored_batch_df.head())


You can schedule this script as a **cron job** or orchestrate it with tools
like Airflow, Prefect, or any workflow system.

The key point: it **loads** the model artifact instead of retraining, giving
you a stable scoring behaviour across runs.


## 7. Minimal project layout suggestion

Putting everything together, a minimal MLOps-ish project structure could be:

```text
bank-churn-mlops/
├── data/
│   └── Churn_Modelling.csv           # training data (usually not in prod image)
├── artifacts/
│   └── bank_churn_model.joblib       # trained model artifact
├── notebooks/
│   └── 01_train_and_export.ipynb     # this notebook, or similar
├── app/
│   └── main.py                       # FastAPI service using the artifact
├── batch/
│   └── score_churn.py                # wrapper around batch_score_csv
├── requirements.txt
└── README.md
```

Typical lifecycle:

1. **Training phase** (offline / dev or scheduled):
   - Run the training notebook or script → update `bank_churn_model.joblib`.
2. **Serving phase**:
   - FastAPI service loads the latest artifact at startup.
   - Batch jobs call the same artifact for offline scoring.
3. **Monitoring phase**:
   - Track model performance over time (not covered here, but natural next step).


## 8. Next possible extensions

To turn this into a more complete MLOps project you could add:

- **Versioning** of model artifacts (timestamps, MLflow, or simple naming).
- **Config management** (YAML/JSON for paths, model hyperparameters, etc.).
- **Dockerfile** to containerise the FastAPI app and model artifact.
- **CI/CD** to:
  - Run tests.
  - Rebuild and deploy the service when a new model is pushed.
- **Monitoring & logging**:
  - Log input distributions and prediction summaries.
  - Monitor drift and performance degradation.

This notebook gives you a **starting point**: a clean pipeline, a persisted
artifact, an example API, and a batch scoring path you can plug into your own
stack.
