# 05 — Wine Risk Calculator (Low Technical Quality)

This notebook builds a **risk calculator** that predicts the probability that a wine will have **low technical quality** using chemical composition.

**Risk definition (binary target)**

- `risk = 1` if `quality ≤ 5` (High risk: low quality)
- `risk = 0` if `quality > 5` (Low risk)

**Business objective**

- Minimize **false negatives** (missing a low-quality wine is more costly than raising a false alarm).
- Use **calibrated probabilities** and a **recall-oriented threshold**.

---

## Outputs

- `predict_risk()` function:
  - returns calibrated probability of **High risk**
  - returns risk label (`"High risk"` / `"Low risk"`)
  - returns decision threshold used
- Example scenarios (3–5 hypothetical wines)

In [1]:
# --- Imports
import os
import json
import joblib
import yaml
import numpy as np
import pandas as pd
from typing import Dict, Any, Tuple

pd.set_option("display.max_columns", 50)

## 1) Load configuration

We load file paths from `config.yaml` to avoid hardcoded paths.

In [2]:
# --- Load config.yaml
CONFIG_PATH = "../config.yaml"  # notebook lives in notebooks/
with open(CONFIG_PATH, "r") as f:
    config = yaml.safe_load(f)

config

FileNotFoundError: [Errno 2] No such file or directory: '../config.yaml'

## 2) Load trained artifacts

We load:

- Calibrated model pipeline (Logistic Regression + preprocessing + calibration)
- Saved threshold(s) for risk decisioning

Artifacts are expected to exist from `04_risk_modeling.ipynb`.

In [None]:
# --- Resolve artifact paths
# Expected keys in config.yaml (recommended):
#   artifacts:
#     calibrated_model_path: "models/logreg_risk_pipeline_calibrated.joblib"
#     thresholds_path: "models/risk_thresholds.joblib"
#
# If your config uses different keys, update this mapping.

def _get_artifact_path(cfg: dict, key: str, fallback: str) -> str:
    # Looks for cfg["artifacts"][key], else returns fallback
    return cfg.get("artifacts", {}).get(key, fallback)

CALIBRATED_MODEL_PATH = _get_artifact_path(
    config, "calibrated_model_path", "../models/logreg_risk_pipeline_calibrated.joblib"
)
THRESHOLDS_PATH = _get_artifact_path(
    config, "thresholds_path", "../models/risk_thresholds.joblib"
)

CALIBRATED_MODEL_PATH, THRESHOLDS_PATH

In [None]:
# --- Load artifacts
risk_model = joblib.load(CALIBRATED_MODEL_PATH)
thresholds = joblib.load(THRESHOLDS_PATH)

type(risk_model), thresholds

## 3) Define required input schema

The Wine Quality dataset uses the following features (standard for this dataset):

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

We provide a helper to validate inputs and enforce consistent column order.

In [None]:
FEATURES = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
]

def validate_and_build_df(x: Dict[str, Any]) -> pd.DataFrame:
    """Validate a single wine input dict and return a 1-row DataFrame in the correct feature order."""
    missing = [c for c in FEATURES if c not in x]
    extra = [c for c in x.keys() if c not in FEATURES]

    if missing:
        raise ValueError(f"Missing required feature(s): {missing}")
    if extra:
        raise ValueError(f"Unexpected feature(s): {extra}. Allowed: {FEATURES}")

    # Convert to numeric and build dataframe
    row = {c: float(x[c]) for c in FEATURES}
    return pd.DataFrame([row], columns=FEATURES)

# Quick check
validate_and_build_df({c: 1 for c in FEATURES}).head()

## 4) Risk calculator function

- Uses **calibrated probability**: `P(risk=1 | chemistry)`
- Applies the saved decision threshold (recall-oriented)
- Returns a user-friendly output dictionary

In [None]:
def get_operating_threshold(thresholds_obj: Any, default: float = 0.5) -> float:
    """Extract the chosen threshold from the saved thresholds artifact."""
    # Common patterns: dict with a key like "selected_threshold" or "threshold"
    if isinstance(thresholds_obj, dict):
        for k in ["selected_threshold", "threshold", "best_threshold", "recall_threshold"]:
            if k in thresholds_obj:
                return float(thresholds_obj[k])
    # If thresholds is a float already
    if isinstance(thresholds_obj, (float, int, np.floating, np.integer)):
        return float(thresholds_obj)
    # If stored as tuple/list (threshold, metrics)
    if isinstance(thresholds_obj, (list, tuple)) and len(thresholds_obj) > 0:
        if isinstance(thresholds_obj[0], (float, int, np.floating, np.integer)):
            return float(thresholds_obj[0])
    return float(default)


OPERATING_THRESHOLD = get_operating_threshold(thresholds, default=0.5)
OPERATING_THRESHOLD

In [None]:
def predict_risk(
    wine_features: Dict[str, Any],
    model=risk_model,
    threshold: float = OPERATING_THRESHOLD,
) -> Dict[str, Any]:
    """Predict low-quality risk using calibrated probabilities.

    Parameters
    ----------
    wine_features : dict
        A dictionary with the 11 chemical features.
    model : sklearn-compatible pipeline
        Trained pipeline providing predict_proba.
    threshold : float
        Decision threshold for classifying as High risk.

    Returns
    -------
    dict
        - risk_probability: calibrated probability of High risk (risk=1)
        - risk_label: 'High risk' if prob >= threshold else 'Low risk'
        - threshold: threshold used
    """
    X = validate_and_build_df(wine_features)
    proba_high_risk = float(model.predict_proba(X)[:, 1][0])
    label = "High risk" if proba_high_risk >= threshold else "Low risk"

    return {
        "risk_probability": proba_high_risk,
        "risk_label": label,
        "threshold": float(threshold),
    }


# Smoke test
example_input = {
    "fixed acidity": 7.4,
    "volatile acidity": 0.70,
    "citric acid": 0.00,
    "residual sugar": 1.9,
    "chlorides": 0.076,
    "free sulfur dioxide": 11.0,
    "total sulfur dioxide": 34.0,
    "density": 0.9978,
    "pH": 3.51,
    "sulphates": 0.56,
    "alcohol": 9.4,
}

predict_risk(example_input)

## 5) Batch scoring helper (optional)

Score multiple wines at once from a list of dicts.

In [None]:
def score_wines(
    wines: list,
    model=risk_model,
    threshold: float = OPERATING_THRESHOLD
) -> pd.DataFrame:
    """Score a list of wine dicts and return a tidy DataFrame."""
    rows = []
    for i, w in enumerate(wines, start=1):
        out = predict_risk(w, model=model, threshold=threshold)
        rows.append({"wine_id": i, **out, **w})
    df = pd.DataFrame(rows)
    # Put summary columns first
    front = ["wine_id", "risk_probability", "risk_label", "threshold"]
    return df[front + FEATURES]

## 6) Example scenarios (3–5 wines)

These are **hypothetical inputs** intended to demonstrate usage.

Interpretation note:
- This model estimates **probability of low technical quality** according to the dataset definition.
- It reflects **associations learned from the training data**, not causation.

In [None]:
example_wines = [
    # A) Higher volatile acidity + higher total SO2 + lower alcohol (often higher risk)
    {
        "fixed acidity": 7.2,
        "volatile acidity": 0.85,
        "citric acid": 0.02,
        "residual sugar": 2.0,
        "chlorides": 0.080,
        "free sulfur dioxide": 10.0,
        "total sulfur dioxide": 85.0,
        "density": 0.9980,
        "pH": 3.45,
        "sulphates": 0.55,
        "alcohol": 9.0,
    },
    # B) Lower volatile acidity + higher alcohol (often lower risk)
    {
        "fixed acidity": 7.5,
        "volatile acidity": 0.35,
        "citric acid": 0.30,
        "residual sugar": 2.4,
        "chlorides": 0.050,
        "free sulfur dioxide": 18.0,
        "total sulfur dioxide": 45.0,
        "density": 0.9955,
        "pH": 3.25,
        "sulphates": 0.70,
        "alcohol": 12.5,
    },
    # C) Mid-range profile
    {
        "fixed acidity": 8.0,
        "volatile acidity": 0.55,
        "citric acid": 0.20,
        "residual sugar": 2.2,
        "chlorides": 0.065,
        "free sulfur dioxide": 15.0,
        "total sulfur dioxide": 60.0,
        "density": 0.9968,
        "pH": 3.30,
        "sulphates": 0.60,
        "alcohol": 10.5,
    },
    # D) Higher chlorides + higher total SO2
    {
        "fixed acidity": 6.8,
        "volatile acidity": 0.65,
        "citric acid": 0.10,
        "residual sugar": 1.8,
        "chlorides": 0.095,
        "free sulfur dioxide": 12.0,
        "total sulfur dioxide": 95.0,
        "density": 0.9982,
        "pH": 3.48,
        "sulphates": 0.52,
        "alcohol": 9.6,
    },
    # E) Higher alcohol + higher sulphates (often protective) with moderate acidity
    {
        "fixed acidity": 7.1,
        "volatile acidity": 0.40,
        "citric acid": 0.25,
        "residual sugar": 2.1,
        "chlorides": 0.055,
        "free sulfur dioxide": 20.0,
        "total sulfur dioxide": 50.0,
        "density": 0.9959,
        "pH": 3.28,
        "sulphates": 0.78,
        "alcohol": 12.0,
    },
]

scored = score_wines(example_wines)
scored.sort_values("risk_probability", ascending=False)

## 7) Export-ready snippet

Use this snippet in a script or API:

- Load artifacts
- Call `predict_risk()` with a JSON-like dict of features

In [None]:
export_snippet = '''
import joblib
import pandas as pd

risk_model = joblib.load("models/logreg_risk_pipeline_calibrated.joblib")
thresholds = joblib.load("models/risk_thresholds.joblib")

FEATURES = [
    "fixed acidity","volatile acidity","citric acid","residual sugar","chlorides",
    "free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"
]

def validate_and_build_df(x):
    missing = [c for c in FEATURES if c not in x]
    if missing:
        raise ValueError(f"Missing required feature(s): {missing}")
    row = {c: float(x[c]) for c in FEATURES}
    return pd.DataFrame([row], columns=FEATURES)

def get_operating_threshold(thresholds_obj, default=0.5):
    if isinstance(thresholds_obj, dict):
        for k in ["selected_threshold", "threshold", "best_threshold", "recall_threshold"]:
            if k in thresholds_obj:
                return float(thresholds_obj[k])
    if isinstance(thresholds_obj, (float, int)):
        return float(thresholds_obj)
    return float(default)

OPERATING_THRESHOLD = get_operating_threshold(thresholds, default=0.5)

def predict_risk(wine_features, model=risk_model, threshold=OPERATING_THRESHOLD):
    X = validate_and_build_df(wine_features)
    p = float(model.predict_proba(X)[:, 1][0])
    label = "High risk" if p >= threshold else "Low risk"
    return {"risk_probability": p, "risk_label": label, "threshold": float(threshold)}
'''
print(export_snippet)