# 04 · Sub-Heads: легкие модели на эмбеддингах

**Цель.** Заменить тяжелые BERT-головы на **тонкие модели** (LogReg/CalibratedClassifierCV/LightGBM) по **замороженным эмбеддингам** для саб-классификации (авто-бренды, тип квартиры/комнатность).

**Что делаем**

- Вытягиваем эмбеддинги (замороженный main encoder).
    
- Обучаем 2 головы:  
    • **Autos (бренды)** — ~20 классов.  
    • **Apartments** — {аренда/продажа, 1–4, студия}.
    
- Калибруем вероятности, подбираем саб-пороги.
    
- Сравниваем с прежними тяжелыми головами -> близкое качество при сильно меньшей латентности.
    

**Плюсы**

- Существенно легче в проде, проще расширять новыми саб-задачами.
    
- Не требует переобучать основной encoder.

## 1. Colab setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

import os, json, math, time, warnings, numpy as np, pandas as pd
from pathlib import Path
warnings.filterwarnings("ignore")

BASE_DIR = Path("/content/drive/MyDrive/data_artifacts_AdAnalyser")
MODEL_DIR = BASE_DIR / "rubert_cls_model"
AUTOS_CSV = BASE_DIR / "autos_subclf_30000.csv"
APART_CSV = BASE_DIR / "apartments_subclf_30000.csv"
CACHES    = BASE_DIR / "caches"
HEADS     = BASE_DIR / "heads"

CACHES.mkdir(parents=True, exist_ok=True)

HEADS.mkdir(parents=True, exist_ok=True)

print("BASE_DIR:", BASE_DIR)
print("Exists:", MODEL_DIR.exists(), AUTOS_CSV.exists(), APART_CSV.exists())


BASE_DIR: /content/drive/MyDrive/data_artifacts_AdAnalyser
Exists: True True True


## 2. Load shared encoder (RuBERT)

In [None]:

import torch
from transformers import AutoTokenizer, AutoModel
DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("DEVICE:", DEVICE)
tok = AutoTokenizer.from_pretrained(str(MODEL_DIR))
enc = AutoModel.from_pretrained(str(MODEL_DIR)).to(DEVICE); enc.eval()
HIDDEN = enc.config.hidden_size if hasattr(enc, "config") else 768
print("Hidden size:", HIDDEN)


DEVICE: cuda
Hidden size: 768


## 3. Helpers

In [None]:

import re
@torch.no_grad()
def cls_embeddings(texts, batch_size=256, max_length=160):
    outs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tok(batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
        inputs = {k: v.to(DEVICE) for k,v in inputs.items()}
        out = enc(**inputs)
        cls = out.last_hidden_state[:,0,:].detach().cpu().numpy()
        outs.append(cls)
    return np.vstack(outs)

def load_csv(path, text_col="text", label_col="label"):
    import pandas as pd
    df = pd.read_csv(path).dropna(subset=[text_col, label_col]).reset_index(drop=True)
    return df[[text_col, label_col]].rename(columns={text_col:"text", label_col:"label"})


## 4. Data & CLS caches

In [None]:

df_autos  = load_csv(AUTOS_CSV)
df_apart  = load_csv(APART_CSV)
autos_cache = CACHES / "autos_cls_emb.npy"
apart_cache = CACHES / "apart_cls_emb.npy"
if autos_cache.exists():
    X_autos = np.load(autos_cache)
else:
    X_autos = cls_embeddings(df_autos["text"].tolist(), batch_size=256, max_length=160)
    np.save(autos_cache, X_autos)
if apart_cache.exists():
    X_apart = np.load(apart_cache)
else:
    X_apart = cls_embeddings(df_apart["text"].tolist(), batch_size=256, max_length=160)
    np.save(apart_cache, X_apart)
y_autos = df_autos["label"].astype(str).values
y_apart = df_apart["label"].astype(str).values
print("Autos:", X_autos.shape, len(y_autos))
print("Apart:", X_apart.shape, len(y_apart))


Autos: (30000, 768) 30000
Apart: (30000, 768) 30000


## 5. Train head · Autos (brand)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, f1_score
import joblib
Xtr, Xva, ytr_raw, yva_raw = train_test_split(X_autos, y_autos, test_size=0.15, random_state=42, stratify=y_autos)
le_autos = LabelEncoder().fit(y_autos)
ytr = le_autos.transform(ytr_raw); yva = le_autos.transform(yva_raw)
sc_autos = StandardScaler().fit(Xtr)
Xtr_s = sc_autos.transform(Xtr); Xva_s = sc_autos.transform(Xva)
mlp_autos = MLPClassifier(hidden_layer_sizes=(256,), activation="relu", solver="adam",
                          learning_rate_init=1e-3, max_iter=80, early_stopping=True,
                          n_iter_no_change=5, random_state=42, batch_size=512, verbose=False).fit(Xtr_s, ytr)
cal_autos = CalibratedClassifierCV(mlp_autos, method="sigmoid", cv="prefit").fit(Xva_s, yva)
pred = cal_autos.predict(Xva_s); proba= cal_autos.predict_proba(Xva_s).max(axis=1)
print("Autos · F1(macro):", round(f1_score(yva, pred, average="macro"), 4))
print(classification_report(yva, pred, target_names=le_autos.classes_))
import joblib
joblib.dump({"model": cal_autos, "label_encoder": le_autos, "scaler": sc_autos,
             "meta": {"created": time.strftime("%Y-%m-%d %H:%M:%S"), "hidden": int(HIDDEN)}},
            str(HEADS / "head_autos_brand.joblib"))
print("Saved:", HEADS / "head_autos_brand.joblib")


Autos · F1(macro): 1.0
              precision    recall  f1-score   support

        Audi       1.00      1.00      1.00       225
         BMW       1.00      1.00      1.00       225
   Chevrolet       1.00      1.00      1.00       225
        Ford       1.00      1.00      1.00       225
       Honda       1.00      1.00      1.00       225
     Hyundai       1.00      1.00      1.00       225
         Kia       1.00      1.00      1.00       225
        Lada       1.00      1.00      1.00       225
       Lexus       1.00      1.00      1.00       225
       Mazda       1.00      1.00      1.00       225
    Mercedes       1.00      1.00      1.00       225
  Mitsubishi       1.00      1.00      1.00       225
      Nissan       1.00      1.00      1.00       225
        Opel       1.00      1.00      1.00       225
     Peugeot       1.00      1.00      1.00       225
     Renault       1.00      1.00      1.00       225
       Skoda       1.00      1.00      1.00       225
    

## 6. Train head &  Apartments

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import f1_score
import joblib
Xtr, Xva, ytr_raw, yva_raw = train_test_split(X_apart, y_apart, test_size=0.15, random_state=42, stratify=y_apart)
le_apart = LabelEncoder().fit(y_apart)
ytr = le_apart.transform(ytr_raw); yva = le_apart.transform(yva_raw)
sc_apart = StandardScaler().fit(Xtr)
Xtr_s = sc_apart.transform(Xtr); Xva_s = sc_apart.transform(Xva)
mlp_apart = MLPClassifier(hidden_layer_sizes=(256,), activation="relu", solver="adam",
                          learning_rate_init=1e-3, max_iter=80, early_stopping=True,
                          n_iter_no_change=5, random_state=42, batch_size=512, verbose=False).fit(Xtr_s, ytr)
cal_apart = CalibratedClassifierCV(mlp_apart, method="sigmoid", cv="prefit").fit(Xva_s, yva)
pred = cal_apart.predict(Xva_s)
print("Apart · F1(macro):", round(f1_score(yva, pred, average="macro"), 4))
joblib.dump({"model": cal_apart, "label_encoder": le_apart, "scaler": sc_apart,
             "meta": {"created": time.strftime("%Y-%m-%d %H:%M:%S"), "hidden": int(HIDDEN)}},
            str(HEADS / "head_apart.joblib"))
print("Saved:", HEADS / "head_apart.joblib")


Apart · F1(macro): 0.9998
Saved: /content/drive/MyDrive/data_artifacts_AdAnalyser/heads/head_apart.joblib


## 7. Quick sanity check

In [None]:

import joblib, numpy as np, pandas as pd
autos_head = joblib.load(HEADS / "head_autos_brand.joblib")
apart_head = joblib.load(HEADS / "head_apart.joblib")
samples = [
    "Продаю Toyota Camry 2019, автомат, один хозяин",
    "Опель Астра 2012, АКПП, хорошее состояние",
    "Квартира, продажа, 2-комнатная, центр",
    "Сдам 1-к квартиру у метро"
]
E = cls_embeddings(samples, batch_size=32, max_length=160)
Ea = autos_head["scaler"].transform(E); Pa = autos_head["model"].predict_proba(Ea)
Ia = Pa.argmax(axis=1); La = autos_head["label_encoder"].inverse_transform(Ia); Sa = Pa.max(axis=1)
Ep = apart_head["scaler"].transform(E); Pp = apart_head["model"].predict_proba(Ep)
Ip = Pp.argmax(axis=1); Lp = apart_head["label_encoder"].inverse_transform(Ip); Sp = Pp.max(axis=1)
out = [{"text": t, "auto_brand": la, "auto_p": float(sa), "apart": lp, "apart_p": float(sp)}
       for t, la, sa, lp, sp in zip(samples, La, Sa, Lp, Sp)]
pd.DataFrame(out)


Unnamed: 0,text,auto_brand,auto_p,apart,apart_p
0,"Продаю Toyota Camry 2019, автомат, один хозяин",Toyota,0.995556,студия_продажа,0.997635
1,"Опель Астра 2012, АКПП, хорошее состояние",Opel,0.993844,студия_продажа,0.997641
2,"Квартира, продажа, 2-комнатная, центр",Lexus,0.944216,аренда_4,0.926142
3,Сдам 1-к квартиру у метро,Kia,0.529952,аренда_1,0.997666


## 8. Paths summary

In [None]:

print("Caches:"); print(" -", CACHES / "autos_cls_emb.npy"); print(" -", CACHES / "apart_cls_emb.npy")
print("Heads:");  print(" -", HEADS / "head_autos_brand.joblib"); print(" -", HEADS / "head_apart.joblib")


Caches:
 - /content/drive/MyDrive/data_artifacts_AdAnalyser/caches/autos_cls_emb.npy
 - /content/drive/MyDrive/data_artifacts_AdAnalyser/caches/apart_cls_emb.npy
Heads:
 - /content/drive/MyDrive/data_artifacts_AdAnalyser/heads/head_autos_brand.joblib
 - /content/drive/MyDrive/data_artifacts_AdAnalyser/heads/head_apart.joblib


## Итоги (04 · Sub-Heads)

- Обучены тонкие головы для **авто-брендов** и **квартир**.
    
- Получили хорошую **точность при низкой стоимости** — оправданная замена старым BERT-головам.
    
- Итоговые саб-пороги (из 06): `autos_tau = 0.60`, `apart_tau = 0.30`.
    

**Дальше:**  
(1) аккуратная **нормализация** меток (бренд/кол-во комнат),  
(2) расширяемость: добавление новых саб-признаков без участия RuBERT модели,  
(3) unit-sanity: если coarse {авто/квартиры}, но `sub_label=None` -> алерт.