# 05 — Wine Risk Calculator (Low Technical Quality)

This notebook builds a **risk calculator** that predicts the probability that a wine will have **low technical quality** using chemical composition.

**Risk definition (binary target)**

- `risk = 1` if `quality ≤ 5` (High risk: low quality)
- `risk = 0` if `quality > 5` (Low risk)

**Business objective**

- Minimize **false negatives** (missing a low-quality wine is more costly than raising a false alarm).
- Use **model probability estimates** and a **recall-oriented threshold**.

---

## Outputs

- `predict_risk()` function:
  - returns probability of **High risk**
  - returns risk label (`"High risk"` / `"Low risk"`)
  - returns decision threshold used
- Example scenarios (3–5 hypothetical wines)

In [None]:
import joblib
import numpy as np
import pandas as pd
from pathlib import Path
from typing import Dict, Any


pd.set_option("display.max_columns", 50)

## 2) Load trained artifacts


In [5]:
# --- Load final model and frozen threshold (inference only)

MODEL_PATH = Path("../models/histgb_risk_model.joblib")
THRESH_PATH = Path("../models/risk_thresholds.joblib")

model = joblib.load(MODEL_PATH)
thresholds = joblib.load(THRESH_PATH)

FROZEN_THRESHOLD = float(thresholds.get("histgb_frozen_threshold", 0.288))

type(model), FROZEN_THRESHOLD



(sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingClassifier,
 0.288)

## 3) Define required input schema

The Wine Quality dataset uses the following features (standard for this dataset):

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

We provide a helper to validate inputs and enforce consistent column order.

In [7]:
FEATURES = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
]

def validate_and_build_df(x: Dict[str, Any]) -> pd.DataFrame:
    """Validate a single wine input dict and return a 1-row DataFrame in the correct feature order."""
    missing = [c for c in FEATURES if c not in x]
    extra = [c for c in x.keys() if c not in FEATURES]

    if missing:
        raise ValueError(f"Missing required feature(s): {missing}")
    if extra:
        raise ValueError(f"Unexpected feature(s): {extra}. Allowed: {FEATURES}")

    # Convert to numeric and build dataframe
    row = {c: float(x[c]) for c in FEATURES}
    return pd.DataFrame([row], columns=FEATURES)

# Quick check
validate_and_build_df({c: 1 for c in FEATURES}).head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## 4) Risk calculator function


In [8]:
def predict_risk(
    wine_features: Dict[str, Any],
    model,
    threshold: float = FROZEN_THRESHOLD,
) -> Dict[str, Any]:
    """
    Predict technical low-quality risk (risk=1) from chemical composition (inference only).

    Returns:
      - risk_probability: P(risk=1)
      - risk_decision: "HIGH RISK" if prob >= threshold else "LOW RISK"
      - threshold_used: frozen threshold applied
      - interpretation: business-friendly message
    """
    X = validate_and_build_df(wine_features)
    prob_high_risk = float(model.predict_proba(X)[:, 1][0])

    decision = "HIGH RISK" if prob_high_risk >= threshold else "LOW RISK"

    interpretation = (
        "Flag for preventive quality review before market release."
        if decision == "HIGH RISK"
        else "No preventive flag based on current chemical profile."
    )

    return {
        "risk_probability": prob_high_risk,
        "risk_decision": decision,
        "threshold_used": float(threshold),
        "interpretation": interpretation,
    }


# Smoke test
example_input = {
    "fixed acidity": 7.4,
    "volatile acidity": 0.70,
    "citric acid": 0.00,
    "residual sugar": 1.9,
    "chlorides": 0.076,
    "free sulfur dioxide": 11.0,
    "total sulfur dioxide": 34.0,
    "density": 0.9978,
    "pH": 3.51,
    "sulphates": 0.56,
    "alcohol": 9.4,
}

predict_risk(example_input, model=model)


{'risk_probability': 0.9290120091238108,
 'risk_decision': 'HIGH RISK',
 'threshold_used': 0.288,
 'interpretation': 'Flag for preventive quality review before market release.'}

## 5) Batch scoring helper (optional)

Score multiple wines at once from a list of dicts.

In [9]:
def score_wines(
    wines: list,
    model,
    threshold: float = FROZEN_THRESHOLD
) -> pd.DataFrame:
    """
    Score a list of wine feature dictionaries using the final HistGB risk model.
    Inference only (no retraining, no threshold tuning).
    """
    rows = []
    for i, w in enumerate(wines, start=1):
        out = predict_risk(w, model=model, threshold=threshold)
        rows.append(
            {
                "wine_id": i,
                **out,
                **w,
            }
        )

    df = pd.DataFrame(rows)

    # Order columns: decision outputs first, then features
    front = [
        "wine_id",
        "risk_probability",
        "risk_decision",
        "threshold_used",
        "interpretation",
    ]

    return df[front + FEATURES]


## 6) Example scenarios (3–5 wines)

These are **hypothetical inputs** intended to demonstrate usage.

Interpretation note:
- This model estimates **probability of low technical quality** according to the dataset definition.
- It reflects **associations learned from the training data**, not causation.

In [10]:
example_wines = [
    # A) Higher volatile acidity + higher total SO2 + lower alcohol (often higher risk)
    {
        "fixed acidity": 7.2,
        "volatile acidity": 0.85,
        "citric acid": 0.02,
        "residual sugar": 2.0,
        "chlorides": 0.080,
        "free sulfur dioxide": 10.0,
        "total sulfur dioxide": 85.0,
        "density": 0.9980,
        "pH": 3.45,
        "sulphates": 0.55,
        "alcohol": 9.0,
    },
    # B) Lower volatile acidity + higher alcohol (often lower risk)
    {
        "fixed acidity": 7.5,
        "volatile acidity": 0.35,
        "citric acid": 0.30,
        "residual sugar": 2.4,
        "chlorides": 0.050,
        "free sulfur dioxide": 18.0,
        "total sulfur dioxide": 45.0,
        "density": 0.9955,
        "pH": 3.25,
        "sulphates": 0.70,
        "alcohol": 12.5,
    },
    # C) Mid-range profile
    {
        "fixed acidity": 8.0,
        "volatile acidity": 0.55,
        "citric acid": 0.20,
        "residual sugar": 2.2,
        "chlorides": 0.065,
        "free sulfur dioxide": 15.0,
        "total sulfur dioxide": 60.0,
        "density": 0.9968,
        "pH": 3.30,
        "sulphates": 0.60,
        "alcohol": 10.5,
    },
    # D) Higher chlorides + higher total SO2
    {
        "fixed acidity": 6.8,
        "volatile acidity": 0.65,
        "citric acid": 0.10,
        "residual sugar": 1.8,
        "chlorides": 0.095,
        "free sulfur dioxide": 12.0,
        "total sulfur dioxide": 95.0,
        "density": 0.9982,
        "pH": 3.48,
        "sulphates": 0.52,
        "alcohol": 9.6,
    },
    # E) Higher alcohol + higher sulphates (often protective) with moderate acidity
    {
        "fixed acidity": 7.1,
        "volatile acidity": 0.40,
        "citric acid": 0.25,
        "residual sugar": 2.1,
        "chlorides": 0.055,
        "free sulfur dioxide": 20.0,
        "total sulfur dioxide": 50.0,
        "density": 0.9959,
        "pH": 3.28,
        "sulphates": 0.78,
        "alcohol": 12.0,
    },
]

scored = score_wines(example_wines, model=model)
scored.sort_values("risk_probability", ascending=False)

Unnamed: 0,wine_id,risk_probability,risk_decision,threshold_used,interpretation,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
3,4,0.943176,HIGH RISK,0.288,Flag for preventive quality review before mark...,6.8,0.65,0.1,1.8,0.095,12.0,95.0,0.9982,3.48,0.52,9.6
0,1,0.829064,HIGH RISK,0.288,Flag for preventive quality review before mark...,7.2,0.85,0.02,2.0,0.08,10.0,85.0,0.998,3.45,0.55,9.0
2,3,0.253594,LOW RISK,0.288,No preventive flag based on current chemical p...,8.0,0.55,0.2,2.2,0.065,15.0,60.0,0.9968,3.3,0.6,10.5
4,5,0.004915,LOW RISK,0.288,No preventive flag based on current chemical p...,7.1,0.4,0.25,2.1,0.055,20.0,50.0,0.9959,3.28,0.78,12.0
1,2,0.001838,LOW RISK,0.288,No preventive flag based on current chemical p...,7.5,0.35,0.3,2.4,0.05,18.0,45.0,0.9955,3.25,0.7,12.5


## 7) Export-ready snippet

Use this snippet in a script or API:

- Load artifacts
- Call `predict_risk()` with a JSON-like dict of features

In [None]:
export_snippet = '''
import joblib
import pandas as pd
from pathlib import Path

# --- Load final artifacts (inference only)
MODEL_PATH = Path("models/histgb_risk_model.joblib")
THRESH_PATH = Path("models/risk_thresholds.joblib")

model = joblib.load(MODEL_PATH)
thresholds = joblib.load(THRESH_PATH)

FROZEN_THRESHOLD = float(thresholds.get("histgb_frozen_threshold", 0.288))

# --- Expected feature schema
FEATURES = [
    "fixed acidity","volatile acidity","citric acid","residual sugar","chlorides",
    "free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"
]

def validate_and_build_df(x: dict) -> pd.DataFrame:
    """Validate a single wine input dict and return a 1-row DataFrame in the correct feature order."""
    missing = [c for c in FEATURES if c not in x]
    extra = [c for c in x.keys() if c not in FEATURES]

    if missing:
        raise ValueError(f"Missing required feature(s): {missing}")
    if extra:
        raise ValueError(f"Unexpected feature(s): {extra}. Allowed: {FEATURES}")

    row = {c: float(x[c]) for c in FEATURES}
    return pd.DataFrame([row], columns=FEATURES)

def predict_risk(wine_features: dict, model, threshold: float = FROZEN_THRESHOLD) -> dict:
    """
    Predict technical low-quality risk (risk=1) from chemical composition.

    Returns:
      - risk_probability: P(risk=1)
      - risk_decision: "HIGH RISK" if prob >= threshold else "LOW RISK"
      - threshold_used: frozen threshold applied
      - interpretation: business-friendly message
    """
    X = validate_and_build_df(wine_features)
    p = float(model.predict_proba(X)[:, 1][0])

    decision = "HIGH RISK" if p >= threshold else "LOW RISK"
    interpretation = (
        "Flag for preventive quality review before market release."
        if decision == "HIGH RISK"
        else "No preventive flag based on current chemical profile."
    )

    return {
        "risk_probability": p,
        "risk_decision": decision,
        "threshold_used": float(threshold),
        "interpretation": interpretation,
    }
'''
print(export_snippet)
