## Federated Logistic Regression with BKR and XXLLNC Data

This notebook demonstrates how early-warning analyses can be performed across multiple data environments using federated learning. The objective is to estimate logistic regression models that explain outcomes in debt-signal follow-up processes, without centralising or sharing raw individual-level data.

The analyses focus on understanding which factors influence contact outcomes after a debt signal, such as demographic characteristics, debt indicators, signal source, and outreach activities. These questions are central to early-warning research and policy evaluation.

### Data and Federated Setup

The data used in this notebook is assumed to be distributed across **BKR** and **XXLLNC** data environments. These platforms act as software systems for collecting, storing, and analysing early-warning data, each holding different but complementary variables, such as signal characteristics, dossier information, and contact actions.

Using the **vantage6** framework, the logistic regression algorithm is sent to each data environment. Local computations are performed there, and only aggregated model updates are returned. At no point are raw records exchanged or combined across organisations.

This setup demonstrates that meaningful statistical analyses for early-warning policy and practice can be executed across organisational boundaries while preserving data separation and governance requirements.

### Overview of the Experiments

Two federated logistic regression experiments are conducted, each analysing a different stage of the early-warning follow-up process.

**Experiment 1: Contact established (dossier level)**  
This experiment analyses whether contact is established at least once for a dossier. The outcome variable is `is_contact_gelegd`, and the unit of analysis is the dossier (case). The model includes demographic indicators, debt characteristics, signal source, and the total number of contact attempts. This experiment captures overall follow-up effectiveness and identifies which dossiers are more likely to result in successful contact.

**Experiment 2: Successful contact attempt (contact level)**  
This experiment focuses on individual outreach actions. The outcome variable is `is_succesvol`, indicating whether a specific contact attempt was successful. The unit of analysis is the contact attempt. In addition to dossier-level characteristics, the model includes contact-specific predictors such as contact type and attempt order. This experiment explains which outreach actions are most effective once contact attempts have started.

### What This Demonstrates

- **Methodological feasibility**  
  Logistic regression analyses commonly used in early-warning research can be performed without centralising individual-level data.

- **Cross-platform collaboration**  
  Federated learning enables joint model estimation across BKR and XXLLNC environments, while each platform retains full control over its data.

- **Multi-level analysis**  
  The federated approach naturally supports analyses at both the dossier level and the contact-attempt level within the same framework.

The following sections present the two experiments in detail, including their federated execution and results.

### Experiment 1 â€” Federated Analysis of Contact Establishment (Dossier Level)

This experiment analyses whether contact is established at least once for a dossier after an early-warning signal is received. The outcome variable, `is_contact_gelegd`, equals 1 if any contact attempt for a dossier was successful, and 0 otherwise.

The unit of analysis is the **dossier (case)**. Each dossier aggregates information about the citizen, the debt signal, and the overall outreach effort, including the total number of contact attempts. The model estimates how dossier-level characteristics influence the probability that contact is established at all.

The analysis is performed in a **federated learning setting**, where data remains distributed across BKR and XXLLNC environments. Each environment computes local updates of the logistic regression model on its own data, and only aggregated model parameters are shared and combined. No individual-level records are exchanged.

All categorical variables are represented as **one-hot encoded indicators**, and the model is estimated using federated logistic regression. Results are reported using regression coefficients and odds ratios to support interpretation.

This experiment addresses the core question in early-warning follow-up analysis:  
**Which dossiers are more likely to result in successful contact?**

In [23]:
# ============================================================
# Experiment 1 (Dossier-level): Federated LR on is_contact_gelegd
# Single-cell version (clean output + nice printing)
# ============================================================

import io
import contextlib
import logging
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from vantage6.algorithm.tools.mock_client import MockAlgorithmClient

from v6_logistic_regression_py.helper import initialize_model


# ----------------------------
# 1) Federated runner (Mock)
# ----------------------------
def run_federated_lr(
    data_dir: Path,
    predictors: list[str],
    outcome: str,
    classes: list,
    max_iter: int = 100,
    delta: float = 1e-4,
    suppress_logs: bool = True,
    file1: str = "data1_dossiers.csv",
    file2: str = "data2_dossiers.csv",
):
    # Silence INFO logs (and anything printed by vantage6) during training/validation
    if suppress_logs:
        logging.disable(logging.INFO)

    dataset_1 = {"database": data_dir / file1, "db_type": "csv"}
    dataset_2 = {"database": data_dir / file2, "db_type": "csv"}

    client = MockAlgorithmClient(
        datasets=[[dataset_1], [dataset_2]],
        module="v6_logistic_regression_py",
    )

    orgs = client.organization.list()
    org_ids = [o["id"] for o in orgs]

    buf = io.StringIO()
    with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf):
        # ---- train ----
        master_task = client.task.create(
            input_={
                "master": True,
                "method": "master",
                "kwargs": {
                    "org_ids": org_ids,
                    "predictors": predictors,
                    "outcome": outcome,
                    "classes": classes,
                    "max_iter": max_iter,
                    "delta": delta,
                },
            },
            organizations=[org_ids[0]],
        )
        results = client.result.get(master_task.get("id"))

        model = initialize_model(LogisticRegression, results["model_attributes"])
        iteration = results["iteration"]

        # ---- validate ----
        val_task = client.task.create(
            input_={
                "master": False,
                "method": "run_validation",
                "kwargs": {
                    "parameters": [model.intercept_.tolist(), model.coef_.tolist()],
                    "classes": classes,
                    "predictors": predictors,
                    "outcome": outcome,
                },
            },
            organizations=[org_ids[0]],
        )
        val = client.result.get(val_task.get("id"))

    if suppress_logs:
        logging.disable(logging.NOTSET)

    return {
        "client": client,
        "org_ids": org_ids,
        "predictors": predictors,
        "outcome": outcome,
        "classes": classes,
        "model": model,
        "iteration": iteration,
        "accuracy": val["score"],
        "confusion_matrix": val["confusion_matrix"],
        "muted_output": buf.getvalue(),
    }


# ----------------------------
# 2) Pretty printer
# ----------------------------
def _print_header(title: str):
    print("\n" + "=" * 70)
    print(title)
    print("=" * 70 + "\n")


def _safe_div(a: float, b: float) -> float:
    return float(a / b) if b else 0.0


def pretty_print_federated_lr(
    fed: dict,
    experiment_title: str,
    sort_by_odds_ratio: bool = True,
):
    orgs = len(fed["org_ids"])
    iteration = fed["iteration"]
    model = fed["model"]
    predictors = fed["predictors"]
    accuracy = fed["accuracy"]
    cm = np.array(fed["confusion_matrix"])

    print(f"\nðŸ”— Number of participating organizations: {orgs}\n")

    _print_header(f"ðŸ“Œ {experiment_title}")
    print(f"ðŸ”¢ Number of iterations: {iteration}\n")
    print(f"ðŸ”£ Intercept (Î²â‚€): {model.intercept_[0]:.4f}\n")

    coefs = model.coef_[0]
    df_coef = pd.DataFrame(
        {
            "Predictor": predictors,
            "Coefficient (Î²)": coefs,
            "Odds Ratio (expÎ²)": np.exp(coefs),
        }
    )

    if sort_by_odds_ratio:
        df_coef = df_coef.sort_values("Odds Ratio (expÎ²)", ascending=False)

    print("ðŸ“Š Coefficients" + (" (sorted by Odds Ratio):" if sort_by_odds_ratio else ":"))
    print(df_coef.to_string(index=False, float_format=lambda x: f"{x:0.4f}"))

    _print_header("âœ… Model Validation Results")
    print(f"Accuracy: {accuracy:.4f}\n")

    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        precision_1 = _safe_div(tp, tp + fp)
        recall_1 = _safe_div(tp, tp + fn)
        f1_1 = _safe_div(2 * precision_1 * recall_1, precision_1 + recall_1)

        df_cm = pd.DataFrame(
            cm,
            index=["Actual: 0 (no contact)", "Actual: 1 (contact)"],
            columns=["Predicted: 0", "Predicted: 1"],
        )
        print("ðŸ“‰ Confusion Matrix:")
        print(df_cm.to_string())
        print("")
        print(f"Precision (class=1): {precision_1:.4f}")
        print(f"Recall    (class=1): {recall_1:.4f}")
        print(f"F1-score  (class=1): {f1_1:.4f}")
    else:
        print("ðŸ“‰ Confusion Matrix:")
        print(cm)

    print("\nâœ¨ Done! Experiment 1 results displayed cleanly.\n")


# ----------------------------
# 3) Run Experiment 1
# ----------------------------
DATA_DIR = Path("./v6_logistic_regression_py/local")

predictors_exp1 = [
    "aandeel_schuldproblematiek_cat_5_7",
    "aandeel_schuldproblematiek_cat_7_10",
    "gemeente_grootte_cat_25001_50000",
    "gemeente_grootte_cat_50001_100000",
    "gemeente_grootte_cat_gt_100000",
    "leeftijd_cat_26_45",
    "leeftijd_cat_46_65",
    "leeftijd_cat_66_plus",
    "meetbureau_XXLLNC",
    "n_pogingen",
    "schuldbedrag_cat_100_250",
    "schuldbedrag_cat_250_500",
    "schuldbedrag_cat_500_1000",
    "schuldbedrag_cat_gt_2000",
    "schuldtype_woon_enkel",
    "schuldtype_woon_meervoudig",
    "schuldtype_woon_zorg",
]

fed1 = run_federated_lr(
    data_dir=DATA_DIR,
    predictors=predictors_exp1,
    outcome="is_contact_gelegd",
    classes=[0, 1],
    file1="data1_dossiers.csv",
    file2="data2_dossiers.csv",
    max_iter=100,
    delta=1e-4,
    suppress_logs=True,
)

pretty_print_federated_lr(
    fed1,
    experiment_title="Federated Logistic Regression â€” Model Summary (Experiment 1: is_contact_gelegd)",
    sort_by_odds_ratio=True,
)

# Optional: inspect muted logs if needed
# print(fed1["muted_output"])


ðŸ”— Number of participating organizations: 2


ðŸ“Œ Federated Logistic Regression â€” Model Summary (Experiment 1: is_contact_gelegd)

ðŸ”¢ Number of iterations: 23

ðŸ”£ Intercept (Î²â‚€): -0.7613

ðŸ“Š Coefficients (sorted by Odds Ratio):
                          Predictor  Coefficient (Î²)  Odds Ratio (expÎ²)
                         n_pogingen           0.5003             1.6493
aandeel_schuldproblematiek_cat_7_10           0.1126             1.1192
           schuldbedrag_cat_gt_2000           0.0870             1.0909
               leeftijd_cat_66_plus           0.0792             1.0824
              schuldtype_woon_enkel           0.0219             1.0221
 aandeel_schuldproblematiek_cat_5_7           0.0065             1.0065
                 leeftijd_cat_46_65          -0.0533             0.9481
          schuldbedrag_cat_500_1000          -0.0702             0.9322
                 leeftijd_cat_26_45          -0.1105             0.8954
               schuldtype_woon_zorg

### Experiment 2 â€” Federated Analysis of Successful Contact Attempts (Contact Level)

This experiment analyses individual contact attempts and focuses on the outcome variable `is_succesvol`, which indicates whether a specific outreach action (for example a letter, phone call, or home visit) was successful.

The unit of analysis is the **contact attempt**. As a result, a single dossier can contribute multiple observations to the dataset. In addition to dossier-level characteristics, the model includes contact-specific predictors such as contact type and attempt order, allowing for a more fine-grained analysis of outreach effectiveness.

The federated learning setup is identical to that used in Experiment 1. Data remains distributed across the BKR and XXLLNC environments, where local model updates are computed. Only aggregated parameter updates are exchanged, ensuring that detailed contact logs are never shared between environments.

All predictors are numeric, with categorical variables represented using **one-hot encoding**. Model results are presented using regression coefficients and odds ratios to support interpretation.

This experiment addresses a complementary question in early-warning follow-up analysis:  
**Which types of outreach actions are most effective once contact attempts begin?**

In [24]:
# ============================================================
# EXPERIMENT 2 â€” Contact-attempt level outcome: is_succesvol
# Uses: data1_contacts.csv, data2_contacts.csv
# Federated logistic regression via MockAlgorithmClient (vantage6)
# ============================================================

from pathlib import Path
import io
import contextlib
import logging

import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from vantage6.algorithm.tools.mock_client import MockAlgorithmClient
from v6_logistic_regression_py.helper import initialize_model


# ----------------------------
# Pretty printing utilities
# ----------------------------
def print_header(title: str):
    print("\n" + "=" * 70)
    print(title)
    print("=" * 70 + "\n")


def coef_table(predictors, coefs):
    df = pd.DataFrame({
        "Predictor": predictors,
        "Coefficient (Î²)": coefs,
        "Odds Ratio (expÎ²)": np.exp(coefs),
    }).sort_values("Odds Ratio (expÎ²)", ascending=False)
    return df.to_string(index=False, float_format=lambda x: f"{x:0.4f}")


def classification_metrics_from_cm(cm):
    cm = np.array(cm, dtype=float)
    tn, fp = cm[0, 0], cm[0, 1]
    fn, tp = cm[1, 0], cm[1, 1]
    precision = tp / (tp + fp) if (tp + fp) else np.nan
    recall = tp / (tp + fn) if (tp + fn) else np.nan
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else np.nan
    return precision, recall, f1


# ----------------------------
# Federated runner
# ----------------------------
def run_federated_lr_contacts(
    data_dir: Path,
    predictors: list[str],
    outcome: str,
    classes: list,
    max_iter: int = 100,
    delta: float = 1e-4,
    suppress_logs: bool = True,
):
    # mute "info > ..." from vantage6
    if suppress_logs:
        logging.disable(logging.INFO)

    dataset_1 = {"database": data_dir / "data1_contacts.csv", "db_type": "csv"}
    dataset_2 = {"database": data_dir / "data2_contacts.csv", "db_type": "csv"}

    client = MockAlgorithmClient(
        datasets=[[dataset_1], [dataset_2]],
        module="v6_logistic_regression_py",
    )

    orgs = client.organization.list()
    org_ids = [o["id"] for o in orgs]

    buf = io.StringIO()
    with contextlib.redirect_stdout(buf), contextlib.redirect_stderr(buf):
        # ---- train ----
        master_task = client.task.create(
            input_={
                "master": True,
                "method": "master",
                "kwargs": {
                    "org_ids": org_ids,
                    "predictors": predictors,
                    "outcome": outcome,
                    "classes": classes,
                    "max_iter": max_iter,
                    "delta": delta,
                },
            },
            organizations=[org_ids[0]],
        )
        results = client.result.get(master_task.get("id"))

        model = initialize_model(LogisticRegression, results["model_attributes"])
        iteration = results["iteration"]

        # ---- validate ----
        val_task = client.task.create(
            input_={
                "master": False,
                "method": "run_validation",
                "kwargs": {
                    "parameters": [model.intercept_.tolist(), model.coef_.tolist()],
                    "classes": classes,
                    "predictors": predictors,
                    "outcome": outcome,
                },
            },
            organizations=[org_ids[0]],
        )
        val = client.result.get(val_task.get("id"))

    if suppress_logs:
        logging.disable(logging.NOTSET)

    return {
        "org_ids": org_ids,
        "predictors": predictors,
        "outcome": outcome,
        "classes": classes,
        "model": model,
        "iteration": iteration,
        "accuracy": val["score"],
        "confusion_matrix": val["confusion_matrix"],
        "muted_output": buf.getvalue(),
    }


# ============================================================
# 1) CONFIG: point to your local folder
# ============================================================
DATA_DIR = Path("./v6_logistic_regression_py/local")

# ============================================================
# 2) EXPERIMENT 2 SETTINGS (CONTACT LEVEL)
# Outcome: is_succesvol (per contact attempt)
# Predictors: dossier-level + contactsoort dummies + attempt_nr
# ============================================================
outcome = "is_succesvol"
classes = [0, 1]

predictors = [
    # ---- dossier-level covariates (already one-hot in your contacts file) ----
    "leeftijd_cat_26_45",
    "leeftijd_cat_46_65",
    "leeftijd_cat_66_plus",

    "meetbureau_XXLLNC",  # (or meetbureau_OVERIG depending on your generator/export)

    "schuldbedrag_cat_100_250",
    "schuldbedrag_cat_250_500",
    "schuldbedrag_cat_500_1000",
    "schuldbedrag_cat_gt_2000",

    "gemeente_grootte_cat_25001_50000",
    "gemeente_grootte_cat_50001_100000",
    "gemeente_grootte_cat_gt_100000",

    "aandeel_schuldproblematiek_cat_5_7",
    "aandeel_schuldproblematiek_cat_7_10",

    "schuldtype_woon_enkel",
    "schuldtype_woon_meervoudig",
    "schuldtype_woon_zorg",

    # ---- contact-attempt covariates ----
    "attempt_nr",  # (if present; otherwise use n_pogingen or drop it)
    "contactsoort_email",
    "contactsoort_sms_whatsapp",
    "contactsoort_telefoon",
    "contactsoort_huisbezoek",
    # NOTE: "brief" is the implicit baseline if you did one-hot encoding without it.
]

# ============================================================
# 3) RUN FEDERATED TRAINING + VALIDATION
# ============================================================
fed2 = run_federated_lr_contacts(
    data_dir=DATA_DIR,
    predictors=predictors,
    outcome=outcome,
    classes=classes,
    max_iter=100,
    delta=1e-4,
    suppress_logs=True,
)

print(f"ðŸ”— Number of participating organizations: {len(fed2['org_ids'])}")


# ============================================================
# 4) NICE OUTPUT
# ============================================================
model = fed2["model"]
iteration = fed2["iteration"]
accuracy = fed2["accuracy"]
cm = fed2["confusion_matrix"]

print_header("ðŸ“Œ Federated Logistic Regression â€” Model Summary (Experiment 2: is_succesvol)")
print(f"ðŸ”¢ Number of iterations: {iteration}\n")
print(f"ðŸ”£ Intercept (Î²â‚€): {model.intercept_[0]:.4f}\n")
print("ðŸ“Š Coefficients (sorted by Odds Ratio):")
print(coef_table(predictors, model.coef_[0]))

print_header("âœ… Model Validation Results")
print(f"Accuracy: {accuracy:.4f}\n")

cm_arr = np.array(cm)
df_cm = pd.DataFrame(
    cm_arr,
    index=["Actual: 0 (not successful)", "Actual: 1 (successful)"],
    columns=["Predicted: 0", "Predicted: 1"],
)
print("ðŸ“‰ Confusion Matrix:")
print(df_cm.to_string())

precision, recall, f1 = classification_metrics_from_cm(cm_arr)
print(f"\nPrecision (class=1): {precision:.4f}")
print(f"Recall    (class=1): {recall:.4f}")
print(f"F1-score  (class=1): {f1:.4f}")

print("\nâœ¨ Done! Experiment 2 results displayed cleanly.\n")

ðŸ”— Number of participating organizations: 2

ðŸ“Œ Federated Logistic Regression â€” Model Summary (Experiment 2: is_succesvol)

ðŸ”¢ Number of iterations: 44

ðŸ”£ Intercept (Î²â‚€): -0.7643

ðŸ“Š Coefficients (sorted by Odds Ratio):
                          Predictor  Coefficient (Î²)  Odds Ratio (expÎ²)
              contactsoort_telefoon           0.6657             1.9459
            contactsoort_huisbezoek           0.3530             1.4234
aandeel_schuldproblematiek_cat_7_10           0.0957             1.1005
                 contactsoort_email           0.0798             1.0831
               leeftijd_cat_66_plus           0.0611             1.0630
           schuldbedrag_cat_gt_2000           0.0405             1.0413
                         attempt_nr           0.0233             1.0236
 aandeel_schuldproblematiek_cat_5_7           0.0057             1.0057
              schuldtype_woon_enkel          -0.0118             0.9882
                 leeftijd_cat_46_65       