# 06 · Пороги и калибровка

**Цель.** Подобрать прод-пороги и температуру, формализовать OOD-параметры, зафиксировать **`inference_thresholds.json`**.

**Что делаем**

- Optuna/грид для `τ_other` и `temperature` на валидации.
    
- Подбор саб-порогов по покрытию/точности (macro-F1 + coverage).
    
- OOD-калибровка по распределению L2 CLS: `μ/σ`, `z_thr`, `alpha`, `dist_thr`.
    

**Результаты (на наших прогонах)**

- **Main:** `tau_other = 0.2826`, `temperature = 0.917`.
    
- **Sub:** `autos_tau = 0.60`, `apart_tau = 0.30`.
    
- **OOD (CLS-L2 z-score):**  
    `mu = 863.18`, `sigma = 675.82`, `alpha = 1.498`, `z_thr = 7.948`, `threshold ≈ 4141.65`.
    

**Выход**  
`data/inference_thresholds.json` (консистентно читается в ноутбуке 05).

## 1. Env setup (NumPy 2.x stack for Py3.12)

In [2]:
import sys, numpy as np, pandas as pd, torch, transformers, optuna, rapidfuzz, sklearn
print("Python:", sys.version.split()[0])
print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("sklearn:", sklearn.__version__)
print("torch:", torch.__version__)
print("transformers:", transformers.__version__)
print("optuna:", optuna.__version__)
print("rapidfuzz:", rapidfuzz.__version__)


Python: 3.12.11
numpy: 1.26.4
pandas: 2.3.2
sklearn: 1.5.2
torch: 2.8.0+cu126
transformers: 4.56.1
optuna: 3.6.2
rapidfuzz: 3.6.2


## 2. Paths / Artifacts

In [4]:
!fusermount -u /content/drive
!rm -rf /content/drive
from google.colab import drive
drive.mount('/content/drive')


fusermount: failed to unmount /content/drive: Invalid argument
Mounted at /content/drive


In [9]:
from pathlib import Path
import json

BASE_DIR = Path("/content/drive/MyDrive/data_artifacts_AdAnalyser")

MAIN_MODEL_DIR   = BASE_DIR / "rubert_cls_model"
CE_MODEL_DIR     = BASE_DIR / "cross_encoder_rubert"
HEADS_DIR        = BASE_DIR / "heads"
CACHE_DIR        = BASE_DIR / "caches"
THRESH_PATH      = BASE_DIR / "inference_thresholds.json"

DATA_MAIN_SYN    = BASE_DIR / "synthetic_ru_private_ads_50cats_10000_v2.csv"
DATA_AUTOS       = BASE_DIR / "autos_subclf_30000.csv"
DATA_APART       = BASE_DIR / "apartments_subclf_30000.csv"

CACHE_DIR.mkdir(parents=True, exist_ok=True)
HEADS_DIR.mkdir(parents=True, exist_ok=True)

print("BASE_DIR:", BASE_DIR)
print("Exists:", MAIN_MODEL_DIR.exists(), CE_MODEL_DIR.exists(), THRESH_PATH.exists())
print("Data:", DATA_MAIN_SYN.exists(), DATA_AUTOS.exists(), DATA_APART.exists())

#  Label mapping
def _try_from_label_map(mp):
    # {"0":"Класс"}
    if isinstance(mp, dict) and all(k.isdigit() for k in mp.keys()) and all(isinstance(v, str) for v in mp.values()):
        return {int(k): v for k, v in mp.items()}, "dict[id:str]->label(str)"

    # {"0":["Класс"]}
    if isinstance(mp, dict) and all(k.isdigit() for k in mp.keys()) and \
       all(isinstance(v, list) and len(v) >= 1 and isinstance(v[0], str) for v in mp.values()):
        return {int(k): v[0] for k, v in mp.items()}, "dict[id:str]->list(label)"

    # {"Класс":"0"} или {"Класс":0}
    if isinstance(mp, dict) and all(isinstance(k, str) for k in mp.keys()):
        ok, tmp = True, {}
        for lbl, val in mp.items():
            if isinstance(val, int):
                tmp[val] = lbl
            elif isinstance(val, str) and val.isdigit():
                tmp[int(val)] = lbl
            elif isinstance(val, list) and len(val) and \
                 ((isinstance(val[0], int)) or (isinstance(val[0], str) and val[0].isdigit())):
                tmp[int(val[0])] = lbl
            else:
                ok = False
                break
        if ok and len(tmp):
            return tmp, "dict[label]->id(int/str/list)"

    # ["Класс0","Класс1",...]
    if isinstance(mp, list) and all(isinstance(v, str) for v in mp):
        return {i: mp[i] for i in range(len(mp))}, "list[label]"

    return None, None

def _try_from_config(cfg_path: Path):
    if not cfg_path.exists():
        return None, None
    try:
        cfg = json.load(open(cfg_path, "r", encoding="utf-8"))
    except Exception:
        return None, None

    # id2label в config.json как {"0":"Класс"}
    if isinstance(cfg.get("id2label"), dict) and all(k.isdigit() for k in cfg["id2label"].keys()):
        return {int(k): v for k, v in cfg["id2label"].items()}, "config.id2label"
    # label2id в config.json как {"Класс":0}
    if isinstance(cfg.get("label2id"), dict):
        mp = cfg["label2id"]
        tmp = {}
        for lbl, val in mp.items():
            if isinstance(val, int):
                tmp[val] = lbl
            elif isinstance(val, str) and val.isdigit():
                tmp[int(val)] = lbl
        if tmp:
            return tmp, "config.label2id"
    return None, None

# читаем label_mapping.json
label_map_path = MAIN_MODEL_DIR / "label_mapping.json"
if not label_map_path.exists():
    raise FileNotFoundError(f"Not found: {label_map_path}")

label_map = json.load(open(label_map_path, "r", encoding="utf-8"))

ID2LABEL, source = _try_from_label_map(label_map)
if ID2LABEL is None:
    # фолбэк на config.json
    ID2LABEL, source = _try_from_config(MAIN_MODEL_DIR / "config.json")

if ID2LABEL is None:
    print("label_mapping.json SAMPLE:", list(label_map.items())[:5])
    raise ValueError("Unsupported label_mapping.json format and no usable mapping in config.json")

LABEL2ID = {v: k for k, v in ID2LABEL.items()}
LABELS = [ID2LABEL[i] for i in sorted(ID2LABEL.keys())]

print(f"Mapping source: {source}")
print("Num coarse classes:", len(LABELS))
print("Sample pairs:", list(ID2LABEL.items())[:5])


BASE_DIR: /content/drive/MyDrive/data_artifacts_AdAnalyser
Exists: True True True
Data: True True True
Mapping source: config.id2label
Num coarse classes: 50
Sample pairs: [(0, 'Автоаксессуары'), (1, 'Аудиотехника'), (2, 'Велосипеды'), (3, 'Водный транспорт'), (4, 'Гаражи и парковки')]


## 3. Load thresholds (+defaults)

In [10]:

import json, numpy as np

T = json.load(open(THRESH_PATH, "r", encoding="utf-8")) if THRESH_PATH.exists() else {}

TAU_OTHER = float(T.get("main_tau_other", 0.35))
TAU_HIGH  = float(T.get("main_tau_high", 0.75))
ALPHA     = float(T.get("alpha", 1.0))

OOD = T.get("ood", {})
Z_THR = float(OOD.get("z_thr", 8.0))
MU    = float(OOD.get("mu", 0.0))
SIGMA = float(OOD.get("sigma", 1.0))
THR_RAW = OOD.get("threshold", None)
try: THR_RAW = float(THR_RAW) if THR_RAW is not None else None
except: THR_RAW = None

SUB = T.get("sub", {})
TAU_AUTOS  = float(SUB.get("autos_tau", 0.83))
TAU_APART  = float(SUB.get("apart_tau", 0.45))

print("tau_other:", TAU_OTHER, "| tau_high:", TAU_HIGH, "| alpha:", ALPHA)
print("OOD -> z_thr:", Z_THR, "| mu:", MU, "| sigma:", SIGMA, "| thr_raw:", THR_RAW)
print("Sub -> autos_tau:", TAU_AUTOS, "| apart_tau:", TAU_APART)

AUTOS_COARSE  = {"Легковые автомобили"}
APART_COARSE  = {"Квартиры — аренда", "Квартиры — продажа"}


tau_other: 0.35 | tau_high: 0.75 | alpha: 1.0
OOD -> z_thr: 7.947675119608771 | mu: 863.1790571325726 | sigma: 675.8236249816604 | thr_raw: 4141.648698247054
Sub -> autos_tau: 0.8341092920652331 | apart_tau: 0.44362904422064614


## 4. Load models (main, shared, CE?, heads)

In [11]:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel
import joblib

DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu")
print("DEVICE:", DEVICE)

main_tok  = AutoTokenizer.from_pretrained(str(MAIN_MODEL_DIR))
main_clf  = AutoModelForSequenceClassification.from_pretrained(str(MAIN_MODEL_DIR)).to(DEVICE)
main_clf.eval()

shared_tok = main_tok
shared_enc = AutoModel.from_pretrained(str(MAIN_MODEL_DIR)).to(DEVICE)
shared_enc.eval()

if CE_MODEL_DIR.exists():
    try:
        ce_tok = AutoTokenizer.from_pretrained(str(CE_MODEL_DIR))
        ce_mdl = AutoModelForSequenceClassification.from_pretrained(str(CE_MODEL_DIR)).to(DEVICE)
        ce_mdl.eval()
        print("CE loaded.")
    except Exception as e:
        ce_tok = None; ce_mdl = None
        print("CE not loaded:", e)
else:
    ce_tok = None; ce_mdl = None
    print("CE folder not found.")

HEADS_DIR.mkdir(exist_ok=True, parents=True)
autos_head = joblib.load(HEADS_DIR / "head_autos_brand.joblib") if (HEADS_DIR / "head_autos_brand.joblib").exists() else None
apart_head = joblib.load(HEADS_DIR / "head_apart.joblib")        if (HEADS_DIR / "head_apart.joblib").exists() else None
print("Heads loaded:", bool(autos_head), bool(apart_head))


DEVICE: cuda
CE loaded.


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Heads loaded: True True


## 5. Helpers (infer, ood, metrics)

In [12]:

import torch, numpy as np, pandas as pd
from sklearn.metrics import f1_score, accuracy_score, classification_report, confusion_matrix, log_loss

@torch.inference_mode()
def main_infer(texts, batch_size=64, temperature: float = 1.0):
    res = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        tok = main_tok(batch, padding=True, truncation=True, max_length=256, return_tensors="pt").to(DEVICE)
        out = main_clf(**tok)
        logits = out.logits / max(1e-6, temperature)
        probs = torch.softmax(logits, dim=-1).detach().cpu().numpy()
        res.append(probs)
    probs = np.vstack(res) if res else np.zeros((0, len(LABELS)))
    top_ids = probs.argmax(axis=1)
    top_scores = probs.max(axis=1)
    top_labels = [ID2LABEL[i] for i in top_ids]
    return probs, top_labels, top_scores

@torch.inference_mode()
def cls_embeddings(texts, batch_size=128):
    all_emb = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        tok = shared_tok(batch, padding=True, truncation=True, max_length=256, return_tensors="pt").to(DEVICE)
        out = shared_enc(**tok, output_hidden_states=True, return_dict=True)
        cls = out.last_hidden_state[:,0,:].detach().cpu().numpy()
        all_emb.append(cls)
    return np.vstack(all_emb) if all_emb else np.zeros((0, shared_enc.config.hidden_size))

def apply_thresh(labels, scores, tau=0.35, per_label_tau: dict = None):
    out = []
    for l, s in zip(labels, scores):
        thr = per_label_tau.get(l, tau) if per_label_tau else tau
        out.append(l if s >= thr else "Other")
    return out

def eval_report(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    print("Acc:", round(accuracy_score(y_true, y_pred), 4))
    print("F1(macro):", round(f1_score(y_true, y_pred, average="macro"), 4))
    print(classification_report(y_true, y_pred, digits=4))

def ood_stub(probs):
    maxp = probs.max(axis=1)
    z = (1.0 - maxp)
    return z


## 6. Load validation data

In [21]:
import io, pandas as pd
from ftfy import fix_text

def _score_text(s: str) -> int:
    cyr = sum('А' <= ch <= 'я' or ch in 'ее' for ch in s)
    junk = sum(s.count(t) for t in ['â', 'Ä', '™', '‚', 'œ'])
    return cyr - 4 * junk

def smart_read_ads_csv(path):
    raw = open(path, 'rb').read()
    candidates = ['utf-8', 'utf-8-sig', 'cp1251', 'windows-1251', 'latin1']
    best = None; best_enc = None; best_score = -10**9

    for enc in candidates:
        try:
            df_try = pd.read_csv(io.BytesIO(raw), encoding=enc)
            # формируем временный текст для оценки читабельности
            tt = (df_try.get('title', pd.Series(dtype=str)).astype(str).head(200).str.cat(sep=' ')
                  + ' ' +
                  df_try.get('description', pd.Series(dtype=str)).astype(str).head(200).str.cat(sep=' '))
            score = _score_text(tt)
            if score > best_score:
                best_score, best, best_enc = score, df_try, enc
        except Exception:
            pass

    df = best.copy()
    # если кодировка оказалась latin1 — почти наверняка mojibake далее чиним текстовые поля
    if best_enc == 'latin1':
        for col in ['title', 'description', 'category']:
            if col in df.columns:
                df[col] = df[col].astype(str).apply(fix_text)

    # собираем нужные колонки text/label
    if 'title' in df.columns or 'description' in df.columns:
        title = df.get('title', '')
        descr = df.get('description', '')
        df['text'] = title.fillna('') + ' ' + descr.fillna('')
    else:
        raise KeyError(f"Нет колонок title/description, нашли: {df.columns.tolist()}")

    if 'category' in df.columns:
        df = df.rename(columns={'category':'label'})
    elif 'label' not in df.columns:
        raise KeyError(f"Нет колонки category/label, нашли: {df.columns.tolist()}")

    return df[['text','label']]

# Используем:
df_main = smart_read_ads_csv(str(DATA_MAIN_SYN))
print('main', df_main.shape)
display(df_main.head(3))


main (10000, 2)


Unnamed: 0,text,label
0,Продаю автоаксессуары — Санкт-Петербург Покажу...,Автоаксессуары
1,Продаю автоаксессуары — Казань Город: Воронеж....,Автоаксессуары
2,В продаже автоаксессуары — Нижний Новгород Гар...,Автоаксессуары


## 7. Cache CLS embeddings for sub-heads

In [22]:

AUTOS_CACHE = CACHE_DIR / "val_autos_cls.npy"
APART_CACHE = CACHE_DIR / "val_apart_cls.npy"

if df_autos is not None:
    if not AUTOS_CACHE.exists():
        emb_a = cls_embeddings(df_autos["text"].tolist())
        np.save(AUTOS_CACHE, emb_a)
    else:
        emb_a = np.load(AUTOS_CACHE)

if df_apart is not None:
    if not APART_CACHE.exists():
        emb_p = cls_embeddings(df_apart["text"].tolist())
        np.save(APART_CACHE, emb_p)
    else:
        emb_p = np.load(APART_CACHE)

for p in [AUTOS_CACHE, APART_CACHE]:
    print("cache:", p, p.exists())


cache: /content/drive/MyDrive/data_artifacts_AdAnalyser/caches/val_autos_cls.npy True
cache: /content/drive/MyDrive/data_artifacts_AdAnalyser/caches/val_apart_cls.npy True


## 8. Baseline & Objective

In [23]:

def main_eval_block(df, tau=0.35, temperature=1.0):
    probs, top_labels, top_scores = main_infer(df["text"].tolist(), temperature=temperature)
    pred_after_tau = apply_thresh(top_labels, top_scores, tau=tau)
    return {
        "labels": top_labels,
        "scores": top_scores,
        "pred_tau": pred_after_tau,
        "probs": probs
    }

def score_pipeline(df, tau=0.35, temperature=1.0, weight_acc=0.6, weight_f1=0.4):
    y_true = df["label"].tolist()
    out = main_eval_block(df, tau=tau, temperature=temperature)
    y_pred = out["pred_tau"]
    acc = accuracy_score(y_true, y_pred)
    f1  = f1_score(y_true, y_pred, average="macro")
    return weight_acc*acc + weight_f1*f1, acc, f1

if df_main is not None:
    s, acc, f1m = score_pipeline(df_main, tau=TAU_OTHER, temperature=float(T.get("temperature", 1.0)))
    print("Baseline -> Score:", round(s,4), "| Acc:", round(acc,4), "| F1:", round(f1m,4))


Baseline -> Score: 0.9859 | Acc: 0.9927 | F1: 0.9757


## 9. Optuna: tune tau_other & temperature

In [26]:

import optuna

def objective(trial):
    tau = trial.suggest_float("tau_other", 0.10, 0.70)
    temp = trial.suggest_float("temperature", 0.7, 1.5)
    s, acc, f1m = score_pipeline(df_main, tau=tau, temperature=temp)
    trial.set_user_attr("acc", acc)
    trial.set_user_attr("f1", f1m)
    return s

if df_main is not None:
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=40, show_progress_bar=False)
    best = study.best_params
    print("Best:", best)
    print("Acc:", study.best_trial.user_attrs.get("acc"), "F1:", study.best_trial.user_attrs.get("f1"))
else:
    best = {"tau_other": TAU_OTHER, "temperature": float(T.get("temperature", 1.0))}


[I 2025-09-08 03:58:03,220] A new study created in memory with name: no-name-5994f7ce-70f0-4fbb-b485-eb5327d52dc9
[I 2025-09-08 03:58:27,033] Trial 0 finished with value: 0.9803017027945589 and parameters: {'tau_other': 0.6963247701593592, 'temperature': 0.799933612457451}. Best is trial 0 with value: 0.9803017027945589.
[I 2025-09-08 03:58:51,552] Trial 1 finished with value: 0.9831688669345943 and parameters: {'tau_other': 0.451626612870048, 'temperature': 1.0792416950482666}. Best is trial 1 with value: 0.9831688669345943.
[I 2025-09-08 03:59:16,117] Trial 2 finished with value: 0.9860478024082948 and parameters: {'tau_other': 0.2825582677025756, 'temperature': 0.91699954894511}. Best is trial 2 with value: 0.9860478024082948.
[I 2025-09-08 03:59:39,634] Trial 3 finished with value: 0.9732339616513854 and parameters: {'tau_other': 0.6809534140172065, 'temperature': 0.9332571589384284}. Best is trial 2 with value: 0.9860478024082948.
[I 2025-09-08 04:00:03,009] Trial 4 finished with 

Best: {'tau_other': 0.2825582677025756, 'temperature': 0.91699954894511}
Acc: 0.9929 F1: 0.9757695060207371


## 10. Tune sub thresholds (autos/apart)

In [28]:

import numpy as np, json, joblib
from pathlib import Path
from sklearn.metrics import precision_recall_fscore_support

# Пути
BASE_DIR  = Path(globals().get("BASE_DIR", "/content/drive/MyDrive/data_artifacts_AdAnalyser"))
HEADS_DIR = Path(globals().get("HEADS_DIR", BASE_DIR / "heads"))
CACHE_DIR = Path(globals().get("CACHE_DIR", BASE_DIR / "caches"))
THRESH_P  = Path(globals().get("THRESH_PATH", BASE_DIR / "inference_thresholds.json"))

HEAD_AUTOS = HEADS_DIR / "head_autos_brand.joblib"
HEAD_APART = HEADS_DIR / "head_apart.joblib"
EMB_AUTOS  = CACHE_DIR / "autos_cls_emb.npy"
EMB_APART  = CACHE_DIR / "apart_cls_emb.npy"

# helpers
def _unwrap_estimator(head_obj):
    """Вернет sklearn-estimator с predict_proba из dict/tuple/Pipeline/самой модели."""
    if head_obj is None:
        return None
    # уже модель
    if hasattr(head_obj, "predict_proba"):
        return head_obj
    # dict: ищем по типичным ключам
    if isinstance(head_obj, dict):
        for k in ("model","clf","estimator","pipe","sk_model","pipeline"):
            v = head_obj.get(k)
            if hasattr(v, "predict_proba"):
                return v
        # иначе ищем в значениях
        for v in head_obj.values():
            if hasattr(v, "predict_proba"):
                return v
    # tuple/list: ищем внутри
    if isinstance(head_obj, (list, tuple)):
        for v in head_obj:
            if hasattr(v, "predict_proba"):
                return v
    return None

def _extract_classes(head_obj, est):
    """Вернет np.array имен классов."""
    if est is not None and hasattr(est, "classes_"):
        return np.array(list(est.classes_), dtype=str)
    if isinstance(head_obj, dict):
        for k in ("classes_", "classes", "class_names", "labels"):
            if k in head_obj:
                return np.array(list(head_obj[k]), dtype=str)
    return None

def _ensure_head_and_classes(var_obj, fallback_path: Path):
    """Берет объект из переменной (если есть), иначе грузит joblib; распаковывает модель и классы."""
    obj = var_obj
    if obj is None and fallback_path.exists():
        obj = joblib.load(fallback_path)
    est = _unwrap_estimator(obj)
    classes = _extract_classes(obj, est)
    return est, classes

def _ensure_emb(var_name: str, path: Path):
    """Возвращает эмбеддинги из переменной, либо грузит .npy из кеша."""
    arr = globals().get(var_name, None)
    if arr is None and path.exists():
        arr = np.load(path)
    if arr is not None and arr.ndim != 2:
        raise ValueError(f"{var_name} должен быть матрицей [N, D], shape={arr.shape}")
    return arr

# heads & embeddings
autos_est, autos_classes = _ensure_head_and_classes(globals().get("autos_head", None), HEAD_AUTOS)
apart_est, apart_classes = _ensure_head_and_classes(globals().get("apart_head", None), HEAD_APART)

emb_a = _ensure_emb("emb_a", EMB_AUTOS)  # CLS-эмбеддинги для авто
emb_p = _ensure_emb("emb_p", EMB_APART)   # CLS-эмбеддинги для квартир

print("Autos head:", None if autos_est is None else type(autos_est).__name__,
      "| classes:", None if autos_classes is None else len(autos_classes),
      "| emb:", None if emb_a is None else emb_a.shape)
print("Apart head:", None if apart_est is None else type(apart_est).__name__,
      "| classes:", None if apart_classes is None else len(apart_classes),
      "| emb:", None if emb_p is None else emb_p.shape)

# проверяем датасеты
if "df_autos" not in globals() or df_autos is None:
    print("[warn] df_autos отсутствует — метрика для авто будет 0.")
if "df_apart" not in globals() or df_apart is None:
    print("[warn] df_apart отсутствует — метрика для квартир будет 0.")

# metrics
def _score_macro_f_cov(y_true, y_pred):
    p, r, f, _ = precision_recall_fscore_support(y_true, y_pred, average="macro", zero_division=0)
    cov = (np.asarray(y_pred) != "Other").mean()
    return 0.7 * float(f) + 0.3 * float(cov)

def sub_score_autos(thr: float) -> float:
    if autos_est is None or autos_classes is None or emb_a is None or "df_autos" not in globals() or df_autos is None:
        return 0.0
    P = autos_est.predict_proba(emb_a)
    idx = P.argmax(axis=1)
    prb = P.max(axis=1)
    y_pred = np.where(prb >= thr, autos_classes[idx], "Other")
    y_true = df_autos["label"].astype(str).values
    return _score_macro_f_cov(y_true, y_pred)

def sub_score_apart(thr: float) -> float:
    if apart_est is None or apart_classes is None or emb_p is None or "df_apart" not in globals() or df_apart is None:
        return 0.0
    P = apart_est.predict_proba(emb_p)
    idx = P.argmax(axis=1)
    prb = P.max(axis=1)
    y_pred = np.where(prb >= thr, apart_classes[idx], "Other")
    y_true = df_apart["label"].astype(str).values
    return _score_macro_f_cov(y_true, y_pred)

def tune_scalar(func, low=0.3, high=0.99, steps=30):
    grid = np.linspace(low, high, steps).astype(float)
    vals = [(float(x), float(func(float(x)))) for x in grid]
    vals.sort(key=lambda t: t[1], reverse=True)
    return vals[0][0], vals[:5]

# run tuning — основные из Optuna
best_tau  = float(globals().get("best", {}).get("tau_other", globals().get("TAU_OTHER", 0.35)))
best_temp = float(globals().get("best", {}).get("temperature", float(globals().get("T", {}).get("temperature", 1.0))))

best_autos, topA = tune_scalar(sub_score_autos, low=0.60, high=0.99, steps=25)
best_apart, topP = tune_scalar(sub_score_apart, low=0.30, high=0.90, steps=25)

print("\nBest main:", dict(tau_other=round(best_tau,3), temperature=round(best_temp,3)))
print("Best sub:",  dict(autos_tau=round(best_autos,3), apart_tau=round(best_apart,3)))
print("Top autos grid:", topA[:3])
print("Top apart grid:", topP[:3])


Autos head: CalibratedClassifierCV | classes: 20 | emb: (5000, 768)
Apart head: CalibratedClassifierCV | classes: 10 | emb: (5000, 768)

Best main: {'tau_other': 0.283, 'temperature': 0.917}
Best sub: {'autos_tau': 0.6, 'apart_tau': 0.3}
Top autos grid: [(0.6, 0.29946), (0.61625, 0.29928), (0.6325, 0.29916)]
Top apart grid: [(0.3, 0.29946), (0.325, 0.2985), (0.35, 0.29712)]


## 11. Save updated thresholds -> inference_thresholds.json

In [29]:

newT = dict(T)
newT["main_tau_other"] = float(best_tau)
newT["main_tau_high"]  = float(TAU_HIGH)
newT["alpha"]          = float(ALPHA)
newT["temperature"]    = float(best_temp)
newT["sub"] = dict(newT.get("sub", {}), autos_tau=float(best_autos), apart_tau=float(best_apart))

with open(THRESH_PATH, "w", encoding="utf-8") as f:
    json.dump(newT, f, ensure_ascii=False, indent=2)
print("Saved:", THRESH_PATH)
print(json.dumps(newT, ensure_ascii=False, indent=2)[:600], "...")


Saved: /content/drive/MyDrive/data_artifacts_AdAnalyser/inference_thresholds.json
{
  "ood": {
    "threshold": 4141.648698247054,
    "alpha": 1.49839657065949,
    "z_thr": 7.947675119608771,
    "mu": 863.1790571325726,
    "sigma": 675.8236249816604
  },
  "sub": {
    "autos_tau": 0.6,
    "apart_tau": 0.3
  },
  "main_tau_other": 0.2825582677025756,
  "artifacts_root": "/content/drive/MyDrive/data_artifacts_AdAnalyser",
  "main_tau_high": 0.75,
  "alpha": 1.0,
  "temperature": 0.91699954894511
} ...


## 12. Sanity mini-test

In [30]:

samples = [
    "Продаю автомобиль Toyota Camry 2019, автомат, один хозяин",
    "Продам Kia Rio, 1.6, пробег 52 тыс, без вложений",
    "BMW 3 серии, 2016 год, М-пакет, обмен возможен",
    "Квартира в Москве, 2 комнаты, продажа от собственника",
    "Сдам 2-комнатную квартиру в центре",
    "iPhone 12, 128GB, б/у, батарея 90%",
    "кликайте по ссылке и выигрывайте айфон 15 бесплатно",
]

probs, labs, scrs = main_infer(samples, temperature=newT.get("temperature", 1.0))
pred = apply_thresh(labs, scrs, tau=newT["main_tau_other"])
import pandas as pd
df = pd.DataFrame({"text": samples, "coarse": labs, "score": np.round(scrs,3), "after_tau": pred})
display(df)


Unnamed: 0,text,coarse,score,after_tau
0,"Продаю автомобиль Toyota Camry 2019, автомат, ...",Легковые автомобили,0.457,Легковые автомобили
1,"Продам Kia Rio, 1.6, пробег 52 тыс, без вложений",Запчасти для авто,0.203,Other
2,"BMW 3 серии, 2016 год, М-пакет, обмен возможен",Легковые автомобили,0.162,Other
3,"Квартира в Москве, 2 комнаты, продажа от собст...",Квартиры — продажа,0.356,Квартиры — продажа
4,Сдам 2-комнатную квартиру в центре,Квартиры — аренда,0.518,Квартиры — аренда
5,"iPhone 12, 128GB, б/у, батарея 90%",Смартфоны,0.574,Смартфоны
6,кликайте по ссылке и выигрывайте айфон 15 бесп...,Смартфоны,0.19,Other


## Итоги (06 · Thresholds)

- Калибровка завершена, пороги зафиксированы в общем конфиге.
    
- Баланс "качество <=> покрытие <=> стабильность" документированость.
    
- Все компоненты пайплайна (05) читают **один** конфиг порогов.
    

**Дальше:**  
(1) рантайм-метрики попаданий/отсечений для контроля дрейфа,  
(2) периодический пересчет `τ` и OOD-параметров оффлайн,  
(3) A/B-exploration порогов на свежем трафике (если появится).
