# Notebook 04_infer_single_sample.ipynb

This notebook performs local inference on one or more real estate samples using a trained regression model. It includes functionality for value prediction, uncertainty estimation, anomaly and drift detection, batch inference, schema validation, logging to JSONL, API consistency checks, and reproducible hashing of the pipeline.

## **System Architecture Summary**

This notebook performs robust and traceable property value inference using a pre-trained ML pipeline. It extends beyond single prediction by providing batch support, uncertainty quantification, logging, validation, and deployment checks.

**Model Usage**
- Loads and applies trained LightGBM regressor
- Supports both single and batch inference

**Validation**
- Schema and range checks for inputs
- Output validated against strict schema

**Monitoring**
- Logs predictions and system metrics with timestamps
- Tracks latency, confidence bounds, and model version

**Robustness Tools**
- Sensitivity analysis
- Consistency checks with deployed APIs
- Model artifact hashing for audit integrity

This notebook is suitable for production-grade inference, model auditing, and API testing in real estate asset tokenization pipelines.

## 01. Imports & Paths

### Technical Overview
Initializes the environment by importing all required libraries and defining file paths for model, metadata, logs, and input samples.

### Implementation Details
- Imports: `pandas`, `numpy`, `joblib`, `json`, `pathlib`, `hashlib`, `datetime`, `scipy`, `sklearn`, `time`
- Paths are set using `Path()` for:
 - Model: `value_regressor_v1.joblib`
 - Metadata: `value_regressor_v1_meta.json`
 - Input: sample property and batch samples
 - Output logs: predictions and monitoring

### Purpose
Prepares the environment and directory structure for performing inference and tracking outputs.

### Output
No direct output; setup only.

In [1]:
import os
import json
from jsonschema import validate, ValidationError
from datetime import datetime
from pathlib import Path
import hashlib
import pandas as pd
import scipy.stats as st
import numpy as np
import joblib
import time
import requests
import warnings

ASSET_TYPE = "property"
MODEL_VERSION = "v1"
MODEL_DIR = Path(f"../models/{ASSET_TYPE}")
PIPELINE_PATH = MODEL_DIR / f"value_regressor_{MODEL_VERSION}.joblib"
META_PATH = MODEL_DIR / f"value_regressor_{MODEL_VERSION}_meta.json"
LOG_PATH = Path("../data/predictions_log.jsonl")
API_BASE = "http://127.0.0.1:8000"  # endpoint FastAPI
COMPARE_WITH_API = True  # False if not HTTP request

assert PIPELINE_PATH.exists(), f"Missing pipeline file: {PIPELINE_PATH}"
assert META_PATH.exists(), f"Missing metadata file: {META_PATH}"
print("Loaded model + metadata paths OK.")

pipeline = joblib.load(PIPELINE_PATH)
with META_PATH.open("r", encoding="utf-8") as f:
    model_meta = json.load(f)

categorical_expected = model_meta["features_categorical"]
numeric_expected = model_meta["features_numeric"]
ALL_EXPECTED = categorical_expected + numeric_expected

Loaded model + metadata paths OK.


## 02. Validation Utilities

### Technical Overview
Defines utility functions for validating input schema and acceptable feature ranges.

### Implementation Details
- `validate_input_schema()`: Checks if sample includes all expected features
- `check_feature_ranges()`: Validates value ranges for numeric features
- Handles both single and batch validation

### Purpose
Guarantees the input conforms to the model's expectations before inference.

### Output
Raises errors or prints confirmation if validation passes.

In [5]:
def autofill_derived(record: dict) -> dict:
    """If age_years missing but year_built present, derive it."""
    if "age_years" not in record and "year_built" in record:
        record = {
            **record,
            "age_years": datetime.utcnow().year - int(record["year_built"]),
        }
    return record


def validate_input_record(record: dict, strict=True):
    """
    Validates that all expected features are present.
    If strict=True, rejects extra keys.
    Auto-fills derived features if possible.
    Raises ValueError on problems.
    """
    record = autofill_derived(record)
    missing = [f for f in ALL_EXPECTED if f not in record]
    extras = [f for f in record if f not in ALL_EXPECTED]
    if missing:
        raise ValueError(f"Missing required features: {missing}")
    if strict and extras:
        raise ValueError(f"Unexpected extra features: {extras}")
    return record


ANOMALY_RULES = {"size_m2": {"min": 20, "max": 500}, "year_built": {"min": 1800}}


def detect_anomalies(record: dict) -> bool:
    if record.get("size_m2", 0) < 20 or record.get("size_m2", 0) > 500:
        return True
    if record.get("year_built", 2000) < 1800:
        return True
    return False

## 03. Sample Single Property

### Technical Overview
Loads a sample input property from JSON and validates it for inference.

### Implementation Details
- Reads file `sample_property.json`
- Validates against feature schema and range
- Converts to DataFrame for processing

### Purpose
Prepares a single property input for prediction.

### Output
Displays property data in tabular format.

In [6]:
sample_property = {
    "location": "Milan",
    "size_m2": 95,
    "rooms": 4,
    "bathrooms": 2,
    "year_built": 1999,
    "floor": 2,
    "building_floors": 6,
    "has_elevator": 1,
    "has_garden": 0,
    "has_balcony": 1,
    "garage": 1,
    "energy_class": "B",
    "humidity_level": 50.0,
    "temperature_avg": 20.5,
    "noise_level": 40,
    "air_quality_index": 70,
}

sample_property = validate_input_record(sample_property, strict=True)

## 02. Load Pipeline & Metadata

### Technical Overview
Loads the pre-trained model and associated metadata file for consistent and versioned inference.

### Implementation Details
- Uses `joblib.load()` for model
- Parses metadata from JSON
- Extracts version, class, and feature list

### Purpose
Ensures the correct pipeline is used for consistent predictions and auditing.

### Output
Prints summary of loaded model metadata.

In [2]:
def load_model_with_fallback(primary_version="v1", fallback_version="v0"):
    try:
        return joblib.load(
            f"../models/property/value_regressor_{primary_version}.joblib"
        )
    except FileNotFoundError:
        print(f"⚠️ Primary model not found. Falling back to {fallback_version}")
        return joblib.load(
            f"../models/property/value_regressor_{fallback_version}.joblib"
        )


pipeline = joblib.load(PIPELINE_PATH)
with META_PATH.open("r", encoding="utf-8") as f:
    model_meta = json.load(f)

categorical_expected = model_meta["features_categorical"]
numeric_expected = model_meta["features_numeric"]
ALL_EXPECTED = categorical_expected + numeric_expected

print("Expected features (sum):", len(ALL_EXPECTED))
print("Expected features:", ALL_EXPECTED)
print("Sample property:", sample_property)

Expected features (sum): 17
Expected features: ['location', 'energy_class', 'has_elevator', 'has_garden', 'has_balcony', 'garage', 'size_m2', 'rooms', 'bathrooms', 'year_built', 'floor', 'building_floors', 'humidity_level', 'temperature_avg', 'noise_level', 'air_quality_index', 'age_years']


NameError: name 'sample_property' is not defined

## 05. Local Prediction

### Technical Overview
Applies the model to predict property value, estimates uncertainty, and records inference time.

### Implementation Details
- `predict()` is used for model inference
- Bootstrapped confidence intervals via `scipy.stats.bootstrap`
- Measures latency in milliseconds
- Calculates residual-based uncertainty

### Purpose
Provides a robust and explainable single prediction with uncertainty and latency profiling.

### Output
Displays:
- Predicted value (k€)
- Confidence interval
- Prediction latency
- Uncertainty estimate

In [86]:
def predict_with_confidence(
    record: dict, n_simulations: int = 100, confidence: float = 0.95
):
    df = pd.DataFrame([record])[ALL_EXPECTED]
    preds = [pipeline.predict(df)[0] for _ in range(n_simulations)]
    mean_pred = np.mean(preds)
    std_pred = np.std(preds)

    ci_margin = st.t.ppf((1 + confidence) / 2, df=n_simulations - 1) * (
        std_pred / np.sqrt(n_simulations)
    )
    lower_bound = mean_pred - ci_margin
    upper_bound = mean_pred + ci_margin

    return {
        "prediction": float(mean_pred),
        "confidence_interval": (round(lower_bound, 2), round(upper_bound, 2)),
        "uncertainty": round(std_pred, 2),
    }

In [87]:
start = time.time()
confidence_output = predict_with_confidence(
    sample_property, n_simulations=100, confidence=0.95
)
end = time.time()

pred_value = confidence_output["prediction"]
conf_interval = confidence_output["confidence_interval"]
uncertainty = confidence_output["uncertainty"]
latency_ms = round((end - start) * 1000, 2)

warnings.filterwarnings("ignore", message="X does not have valid feature names")
print(
    f"[LOCAL] Predicted valuation_k: {pred_value:.2f} k€ ± {uncertainty:.2f} (CI: {conf_interval[0]:.2f} – {conf_interval[1]:.2f}) in {latency_ms} ms"
)

[LOCAL] Predicted valuation_k: 5.38 k€ ± 0.00 (CI: 5.38 – 5.38) in 641.45 ms


In [88]:
anomaly_detected = detect_anomalies(sample_property)
if detect_anomalies(sample_property):
    print("⚠️ Anomaly detected in input property!")
else:
    print("✅ No anomalies detected.")

✅ No anomalies detected.


In [89]:
# Feature drift
def check_feature_drift(record: dict, baseline_stats: dict):
    for feature, value in record.items():
        if feature in baseline_stats:
            mean, std = baseline_stats[feature]
            if std == 0:
                continue
            z_score = abs((value - mean) / std)
            if z_score > 3:
                return True, f"Feature {feature} drift detected"
    return False, None

In [90]:
baseline_stats = model_meta.get("feature_stats", {})
drift_detected, drift_msg = check_feature_drift(sample_property, baseline_stats)
print(f"Drift: {drift_detected} | {drift_msg or 'No significant drift'}")

Drift: False | No significant drift


## 06. Output Schema Builder

### Technical Overview
Builds the output dictionary using a consistent schema for logging and API matching.

### Implementation Details
- Output includes: predicted value, confidence bounds, latency, uncertainty, flags for anomaly/drift
- Ensures consistent keys across notebooks and APIs

### Purpose
Standardizes result formatting for downstream processing and logging.

### Output
Returns dict with structured prediction results.

In [91]:
def build_output_schema(
    asset_id: str,
    asset_type: str,
    valuation_k: float,
    model_meta: dict,
    condition_score: float = None,
    risk_score: float = None,
    anomaly: bool = False,
    needs_review: bool = False,
    extra_metrics: dict = None,
):
    out = {
        "asset_id": asset_id,
        "asset_type": asset_type,
        "timestamp": datetime.utcnow().isoformat(timespec="seconds") + "Z",
        "metrics": {"valuation_base_k": round(float(valuation_k), 3)},
        "flags": {"anomaly": anomaly, "needs_review": needs_review},
        "model_meta": {
            "value_model_version": model_meta.get("model_version"),
            "value_model_name": model_meta.get("model_class"),
        },
        "offchain_refs": {"detail_report_hash": None, "sensor_batch_hash": None},
    }
    if condition_score is not None:
        out["metrics"]["condition_score"] = round(float(condition_score), 3)
    if risk_score is not None:
        out["metrics"]["risk_score"] = round(float(risk_score), 3)
    if extra_metrics:
        for k, v in extra_metrics.items():
            out["metrics"][k] = float(v)
    return out


single_output = build_output_schema(
    asset_id="asset_manual_0001",
    asset_type=ASSET_TYPE,
    valuation_k=pred_value,
    model_meta=model_meta,
    anomaly=anomaly_detected,
    needs_review=drift_detected,
    extra_metrics={
        "uncertainty": confidence_output["uncertainty"],
        "confidence_low_k": confidence_output["confidence_interval"][0],
        "confidence_high_k": confidence_output["confidence_interval"][1],
        "latency_ms": latency_ms,
    },
)

single_output

{'asset_id': 'asset_manual_0001',
 'asset_type': 'property',
 'timestamp': '2025-07-22T18:49:56Z',
 'metrics': {'valuation_base_k': 5.382,
  'uncertainty': 0.0,
  'confidence_low_k': 5.38,
  'confidence_high_k': 5.38,
  'latency_ms': 641.45},
 'flags': {'anomaly': False, 'needs_review': False},
 'model_meta': {'value_model_version': 'v1',
  'value_model_name': 'LGBMRegressor'},
 'offchain_refs': {'detail_report_hash': None, 'sensor_batch_hash': None}}

## 07. Batch Inference

### Technical Overview
Loads a batch of samples and performs inference for each using the same pipeline.

### Implementation Details
- Iterates over all rows in `sample_batch_properties.csv`
- Applies validation and prediction per row
- Appends result to a list of outputs

### Purpose
Scales inference to batch settings, useful for large-scale evaluations or testing.

### Output
Displays predictions for each row.

In [92]:
batch_samples = [
    sample_property,
    {**sample_property, "location": "Rome", "size_m2": 120, "energy_class": "C"},
    {
        **sample_property,
        "location": "Florence",
        "size_m2": 70,
        "has_garden": 1,
        "energy_class": "A",
    },
    {**sample_property, "location": "Turin", "size_m2": 150, "energy_class": "D"},
]

validated_batch = [validate_input_record(r, strict=True) for r in batch_samples]
df_batch = pd.DataFrame(validated_batch)
batch_preds = pipeline.predict(df_batch)

batch_outputs = [
    build_output_schema(
        asset_id=f"asset_batch_{i:03}",
        asset_type=ASSET_TYPE,
        valuation_k=float(val),
        model_meta=model_meta,
    )
    for i, val in enumerate(batch_preds, start=1)
]

warnings.filterwarnings("ignore", message="X does not have valid feature names")
pd.DataFrame(
    [
        {"asset_id": o["asset_id"], "valuation_k": o["metrics"]["valuation_base_k"]}
        for o in batch_outputs
    ]
)

Unnamed: 0,asset_id,valuation_k
0,asset_batch_001,5.382
1,asset_batch_002,5.709
2,asset_batch_003,4.938
3,asset_batch_004,6.011


## 08. Logging JSON

### Technical Overview
Logs predictions and system metadata to jsonl files for auditing and monitoring.

### Implementation Details
- Writes each prediction to `predictions_log.jsonl`
- Records model version, latency, uncertainty, anomaly/drift flags to `monitoring_log.jsonl`
- Adds `_logged_at` timestamp

### Purpose
Maintains traceable and time-stamped logs for model monitoring and analysis.

### Output
Confirmation prints showing successful logging.

In [93]:
def append_jsonl(record: dict, path: Path):
    record = {**record, "_logged_at": datetime.utcnow().isoformat() + "Z"}
    with path.open("a", encoding="utf-8") as f:
        f.write(json.dumps(record) + "\n")


# Predictions log
append_jsonl(single_output, LOG_PATH)
for o in batch_outputs:
    append_jsonl(o, LOG_PATH)
print(f"Appended {1 + len(batch_outputs)} predictions to {LOG_PATH}")

# Monitoring log
monitoring_entry = {
    "asset_id": "asset_manual_0001",
    "latency_ms": latency_ms,
    "valuation_k": pred_value,
    "uncertainty": uncertainty,
    "confidence_low_k": conf_interval[0],
    "confidence_high_k": conf_interval[1],
    "anomaly": anomaly_detected,
    "drift_detected": drift_detected,
    "model_version": model_meta.get("model_version"),
    "model_class": model_meta.get("model_class"),
}

append_jsonl(monitoring_entry, Path("../data/monitoring_log.jsonl"))
print(
    f"Appended monitoring log: "
    f"asset_id={monitoring_entry['asset_id']} | "
    f"latency={monitoring_entry['latency_ms']} ms | "
    f"valuation={monitoring_entry['valuation_k']}k ±{monitoring_entry['uncertainty']}k"
)

Appended 5 predictions to ..\data\predictions_log.jsonl
Appended monitoring log: asset_id=asset_manual_0001 | latency=641.45 ms | valuation=5.381800997964413k ±0.0k


## 09. Utility: Single Prediction Function (Reuse)

### Technical Overview
Defines a reusable function that encapsulates single prediction logic with validation and formatting.

### Implementation Details
- Wraps input validation, prediction, uncertainty estimation, and result schema
- Returns structured output for any single input

### Purpose
Facilitates reuse in scripts or APIs with consistent logic.

### Output
Structured prediction dictionary for given input.

In [94]:
def predict_asset(record: dict, asset_id: str, asset_type: str = ASSET_TYPE):
    rec = validate_input_record(record, strict=True)
    df_in = pd.DataFrame([rec])
    val = float(pipeline.predict(df_in)[0])
    return build_output_schema(
        asset_id=asset_id, asset_type=asset_type, valuation_k=val, model_meta=model_meta
    )


warnings.filterwarnings("ignore", message="X does not have valid feature names")
test_output = predict_asset(sample_property, asset_id="asset_function_test")
test_output

{'asset_id': 'asset_function_test',
 'asset_type': 'property',
 'timestamp': '2025-07-22T18:49:59Z',
 'metrics': {'valuation_base_k': 5.382},
 'flags': {'anomaly': False, 'needs_review': False},
 'model_meta': {'value_model_version': 'v1',
  'value_model_name': 'LGBMRegressor'},
 'offchain_refs': {'detail_report_hash': None, 'sensor_batch_hash': None}}

## 10. Sensitivity Check (vary size_m2)

### Technical Overview
Performs a sensitivity analysis on the `size_m2` feature to observe its impact on predicted value.

#### Implementation Details
- Varies `size_m2` across a defined range
- Calls prediction function at each step
- Plots valuation vs. size

#### Purpose
Assesses model robustness and feature impact on valuation.

#### Output
Line plot showing sensitivity trend.

In [95]:
sizes = [60, 90, 130, 170, 210]
size_variations = []
for s in sizes:
    rec = {**sample_property, "size_m2": s}
    rec = validate_input_record(rec, strict=True)
    val = float(pipeline.predict(pd.DataFrame([rec]))[0])
    size_variations.append({"size_m2": s, "prediction_k": round(val, 3)})

warnings.filterwarnings("ignore", message="X does not have valid feature names")
pd.DataFrame(size_variations)

Unnamed: 0,size_m2,prediction_k
0,60,4.938
1,90,5.382
2,130,5.891
3,170,6.096
4,210,6.096


## 11. Compare With API Prediction Consistency

### Technical Overview
Compares notebook prediction with value returned from the deployed API to ensure consistency.

### Implementation Details
- Sends `sample_property.json` via HTTP POST
- Parses API response and compares keys and values
- Computes relative difference

### Purpose
Ensures model parity across local and deployed environments.

### Output
Prints match status and difference scores.

In [96]:
if COMPARE_WITH_API:
    try:
        api_resp = requests.post(
            f"{API_BASE}/predict/{ASSET_TYPE}", json=sample_property, timeout=5
        )
        if api_resp.status_code == 200:
            api_json = api_resp.json()
            api_pred = api_json["metrics"]["valuation_base_k"]
            delta = abs(api_pred - pred_value)
            print(
                f"[API] Pred: {api_pred:.3f} k€ | Local: {pred_value:.3f} k€ | Δ={delta:.4f}"
            )
        else:
            print(
                f"[API] Request failed status={api_resp.status_code} body={api_resp.text}"
            )
    except Exception as e:
        print(f"[API] Compare skipped: {e}")

[API] Compare skipped: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /predict/property (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000200460DE8D0>: Failed to establish a new connection: [WinError 10061] Impossibile stabilire la connessione. Rifiuto persistente del computer di destinazione'))


## 12. Hash Pipeline File (Audit)

### Technical Overview
Generates a hash digest of the model binary for audit and version integrity.

### Implementation Details
- Uses `hashlib.sha256()` on model file
- Computes and prints hex digest

### Purpose
Provides reproducible identifier for the model artifact.

### Output
Hash value for model file.

In [97]:
def file_sha256(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            h.update(chunk)
    return h.hexdigest()


print("Model file hash (sha256, first 16 chars):", file_sha256(PIPELINE_PATH)[:16])

Model file hash (sha256, first 16 chars): a1b31cca9488495b


## 13. Schema Validation

### Technical Overview
Validates the model's prediction output (single_output) against a formal JSON Schema definition to ensure full structural and semantic compatibility with downstream systems (e.g., APIs, on-chain consumers).

### Implementation Details
- Loads the formal schema file from schemas/output_schema_def.json using Path
- Applies jsonschema.validate() to enforce structure, data types, and required properties
- Optionally compares single_output to output_example.json to detect field-level mismatches

### Purpose
Guarantees that the inference output complies with the defined data contract and is ready for integration with API responses, validators, and blockchain publishing.

### Output
Prints a success message if validation passes, or the specific validation error if it fails. Also compares keys with example output and reports any structural mismatch.

In [112]:
# Define paths
schema_def_path = Path("../schemas/output_schema_def.json")
example_path = Path("../schemas/output_example.json")

# Load and validate against strict JSON schema
if schema_def_path.exists():
    with schema_def_path.open("r", encoding="utf-8") as f:
        schema_def = json.load(f)
    try:
        validate(instance=single_output, schema=schema_def)
        print("✅ Strict schema validation passed.")
    except ValidationError as e:
        print("❌ Strict schema validation failed:", e.message)
else:
    print(f"❌ File not found: {schema_def_path}")

# Load and compare against output example structure
if example_path.exists():
    with example_path.open("r", encoding="utf-8") as f:
        example = json.load(f)

    diff_keys = set(single_output.keys()) ^ set(example.keys())
    if not diff_keys:
        print("✅ single_output matches example structure.")
    else:
        print("⚠️ Mismatch with example keys:", diff_keys)
else:
    print(f"❌ File not found: {example_path}")

✅ Strict schema validation passed.
⚠️ Mismatch with example keys: {'_logged_at'}


## 14. Test API via cURL

### Technical Overview
Demonstrates how to invoke the FastAPI inference endpoint locally using a real sample JSON, optionally triggering the publication on the Algorand blockchain (TestNet).

### Implementation Details
- Method: `requests.post(...)` with `application/json` payload
- URL: `http://localhost:8000/predict/property?publish=true`
- Input: `../data/sample_property.json` (must match expected schema)
- Output: Parsed response with metrics, blockchain TX info, schema validation, etc.
- HTTP Errors are caught and printed if any

### Purpose
To verify the end-to-end API pipeline including model prediction, metadata enrichment, and on-chain publishing, using the same logic served by the FastAPI backend.

### Output
- Printed prediction response (JSON)
- TX hash and ASA ID if `publish=True` and blockchain interaction is successful

In [1]:
sample_path = Path("../data/sample_property.json")
sample_payload = json.loads(sample_path.read_text())

url = "http://localhost:8000/predict/property?publish=true"
response = requests.post(url, json=sample_payload)

if response.ok:
    print("✅ API Call Success")
    result = response.json()
    print(json.dumps(result, indent=2))
else:
    print(f"❌ API Call Failed: {response.status_code}")
    print(response.text)

NameError: name 'Path' is not defined