<a href="https://colab.research.google.com/github/Aliaksandr-Borsuk/Recommender_Systems_project/blob/main/notebooks/07_two_tower_hybrid_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Подготовка


**Цель:**
- Реализовать и оценить гибридную two-tower модель на PyTorch, объединяющую:
- - коллаборативный сигнал (user ID - collaborative embedding),
- - контентный сигнал (item features: genres, title embeddings).

**Данные:**
- используем warm-split из 251021_173655,
- все взаимодействия в test_warm участвуют в оценке без фильтрации по рейтингу  


**Идея модели: Two-Tower Hybrid**
- User Tower:
- - эмбеддинг по user_id (как в NCF) — collaborative часть.
- Item Tower:
- - genres - one-hot - dense,
- - title - предварительно вычисленный text embedding (например, усреднённый BERT или TF-IDF),
- - всё конкатенируется и проходит через MLP - content-aware item embedding.
Скор: скалярное произведение user_emb ⋅ item_emb.
Обучение: BCE loss .  
  
Это гибридная модель, способная:

- рекомендовать новые айтемы (если есть фичи),
- работать лучше NCF за счёт регуляризации через контент.

## 01. Клонируем репозиторий. Подключаем GoogleDrive.

In [None]:
!rm -rf /content/Recommender_Systems_project
!git clone https://github.com/Aliaksandr-Borsuk/Recommender_Systems_project

Cloning into 'Recommender_Systems_project'...
remote: Enumerating objects: 154, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (132/132), done.[K
remote: Total 154 (delta 81), reused 52 (delta 14), pack-reused 0 (from 0)[K
Receiving objects: 100% (154/154), 693.10 KiB | 6.48 MiB/s, done.
Resolving deltas: 100% (81/81), done.


In [None]:
# подключаем диск
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 02. Импорты

In [None]:
import sys
sys.path.append("/content/Recommender_Systems_project/src")

import numpy as np
import pandas as pd
import pickle
import time
from pathlib import Path
from datetime import datetime
from pprint import pprint

from scipy.sparse import csr_matrix, load_npz
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, normalize, StandardScaler
from sklearn.decomposition import TruncatedSVD

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.optim.lr_scheduler import StepLR

from recommender.data_io import train_test_reader                 # для чтения сохранённых из 001_data_and_eda_1m_proba
from recommender.metrics import model_evaluation                  # для оценки модели
from recommender.results_logger import save_experiment_results    # для сохранения результатов

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

DATA = Path("/content/drive/MyDrive/Colab Notebooks/data/")
PROCESSED = DATA / "processed"
RESULTS_DIR = DATA / "results"
TOP_K = 10

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('DEVICE =', DEVICE)

DEVICE = cuda


In [None]:
# посмотрим на сохранёнку
saved_results = pd.read_csv(RESULTS_DIR /'all_experiments_results.csv')
saved_results

Unnamed: 0,model_name,hit_rate@10,precision@10,recall@10,ndcg@10,map@10,coverage@10,timestamp,evaluation_date
0,Most_Popular,0.51555,0.132895,0.015591,0.133171,0.068656,0.002731,20251022_194247,"2025-10-22T19:42:47.576490,,,,,,,,,"
1,itemKNN_tfidf_k=312,0.812201,0.329067,0.040669,0.345263,0.233586,0.074823,20251108_084840,"2025-11-08T08:48:40.783074,,,,,,,,,"
2,userKNN_bmp25_k=991,0.836124,0.341029,0.043678,0.361278,0.244691,0.0639,20251108_084943,"2025-11-08T08:49:43.731391,,,,,,,,,"
3,truncated_svd_n_comp=5_n_iter=42,0.851675,0.348206,0.045528,0.365538,0.24731,0.091207,20251116_190936,"2025-11-16T19:09:36.863512,,,,,,,,,"
4,als_factors=5_iter=21_alpha=0.6_reg=0.02,0.840909,0.340191,0.043517,0.358804,0.24451,0.095576,20251210_110349,"2025-12-10T11:03:49.859474,,,,,,,,,"
5,ease_lambda=108727,0.838517,0.338517,0.042712,0.356757,0.243356,0.045603,20251211_185547,"2025-12-11T18:55:47.668555,,,,,,,,,"
6,slim_alpha_0.47_l1_ratio_0.14,0.825359,0.32201,0.041529,0.342581,0.225772,0.046969,20251214_171141,
7,NCF_BPR,0.814593,0.31555,0.037944,0.328339,0.219007,0.036865,20251228_162516,2025-12-28T16:25:16.373247
8,NCF_BCE,0.801435,0.311364,0.037039,0.324943,0.217343,0.047117,20251228_162658,2025-12-28T16:26:58.143923


## 03. Грузим train, test, meta_данные.



In [None]:
train_tast_path = '/content/drive/MyDrive/Colab Notebooks/data/processed/251021_173655'

train_warm, test_warm, meta_warm = train_test_reader(train_tast_path)

pprint(meta_warm, width=80, compact=False)
print(f'\ntrain shape : {train_warm.shape}')
print(f'test shape  : {test_warm.shape}')
print( '\n', '*'*50, '\ntrain.head')
display(train_warm.head(3))
print('\n', '*'*50, '\ntest.head')
display(test_warm.head(3))

{'columns': ['user_id', 'item_id', 'rating', 'timestamp', 'title', 'genres'],
 'created_at': '2025-10-21T17:37:00.607645',
 'min_test_interactions': 10,
 'min_train_interactions': 5,
 'n_items': 3662,
 'n_test_users': 836,
 'n_train_users': 5392,
 'test_shape': [94842, 6],
 'time_treshold': '2000-12-02T14:52:18',
 'train_shape': [800142, 6]}

train shape : (800142, 6)
test shape  : (94842, 6)

 ************************************************** 
train.head


Unnamed: 0,user_id,item_id,rating,timestamp,title,genres
0,635,1251,4,975768620,8 1/2 (1963),Drama
1,635,3948,4,975768294,Meet the Parents (2000),Comedy
2,635,1270,4,975768106,Back to the Future (1985),Comedy|Sci-Fi



 ************************************************** 
test.head


Unnamed: 0,user_id,item_id,rating,timestamp,title,genres
0,635,3789,5,975768788,"Pawnbroker, The (1965)",Drama
1,635,2987,5,979141847,Who Framed Roger Rabbit? (1988),Adventure|Animation|Film-Noir
2,635,2988,4,975769007,Melvin and Howard (1980),Drama


## 04. Загрузка implicit-матрицы

In [None]:
# Получение метрик на test
# загрузка
input_dir = PROCESSED/"artifacts"

# Загрузка матрицы взаимодействий
train_matrix = load_npz(input_dir / "train_matrix.npz")

# Загрузка словарей
with open(input_dir / "user2index.pkl", "rb") as f:
    user2index = pickle.load(f)

with open(input_dir / "item2index.pkl", "rb") as f:
    item2index = pickle.load(f)

with open(input_dir / "index2user.pkl", "rb") as f:
    index2user = pickle.load(f)

with open(input_dir / "index2item.pkl", "rb") as f:
    index2item = pickle.load(f)

assert isinstance(train_matrix, csr_matrix), "train_matrix должен быть csr_matrix"
train_matrix

# заменяем реальные ID на индексы
test_mapped = test_warm.assign(
    user_id = test_warm["user_id"].map(user2index),
    item_id = test_warm["item_id"].map(item2index)
)
assert test_mapped.isna().sum().sum() == 0, 'Achtung!!! Неизвестные пользователи или айтемы!!!'

# группируем
test_dict = test_mapped.groupby('user_id')['item_id'].apply(set).to_dict()

# all_items
all_items = set(train_warm['item_id'].map(item2index).dropna().astype(int).unique())
n_users, n_items = train_matrix.shape

## 05. Подготовка контентных фичей для айтемов

In [None]:
def build_item_features(train_df, test_df, item2index,
                        max_features=5000, n_components=16,
                        random_state=RANDOM_STATE):
    # объединяем и чистим
    items_df = pd.concat([train_df, test_df]) \
                 .drop_duplicates('item_id')[['item_id', 'title', 'genres']] \
                 .dropna() \
                 .sort_values('item_id')

    items_df['item_idx'] = items_df['item_id'].map(item2index).astype(int)
    items_df = items_df.set_index('item_idx').sort_index()

    # жанры обработаем MultiLabelBinarizer
    items_df['genre_list'] = items_df['genres'].str.split('|')
    mlb = MultiLabelBinarizer()
    genre_features = mlb.fit_transform(items_df['genre_list'])
    genre_features_norm = normalize(genre_features, norm='l2')

    # titles -  TF-IDF - SVD - StandardScaler
    items_df['title_clean'] = (
        items_df['title']
        .str.replace(r'\([^)]*\)', '', regex=True)
        .str.strip()
    )
    tfidf = TfidfVectorizer(max_features=max_features, stop_words='english')
    title_tfidf = tfidf.fit_transform(items_df['title_clean'])

    svd = TruncatedSVD(n_components=n_components, random_state=random_state)
    title_svd = svd.fit_transform(title_tfidf)

    scaler = StandardScaler()
    title_features_norm = scaler.fit_transform(title_svd)

    # объединение
    item_features = np.hstack([genre_features_norm, title_features_norm])

    print("Item features shape:", item_features.shape)

    return items_df, item_features, mlb, tfidf, svd, scaler


In [None]:
items_df, item_features, mlb, tfidf, svd, scaler = build_item_features(
                        train_warm, test_warm, item2index,
                        max_features=5000, n_components=16, random_state=42)

Item features shape: (3662, 34)


## 06. DataLoader и Negative Sampling

In [None]:
class ImplicitCFDataset(Dataset):
    def __init__(self, train_matrix, num_negatives=1):
        super().__init__()
        self.users, self.items = train_matrix.nonzero()
        self.num_users, self.num_items = train_matrix.shape
        self.num_negatives = num_negatives

    def __len__(self):
        return len(self.users)

    def __getitem__(self, idx):
        user = self.users[idx]
        pos_item = self.items[idx]
        # упрощённый negative sampling, не гарантирует что негативы действительно негативы
        # иногда можем схватить позитив... но уж чё уж....
        neg_items = np.random.randint(0, self.num_items, size=self.num_negatives)
        return user, pos_item, neg_items

def collate_fn(batch):
    '''
    batch — список кортежей: [(u1, i1+, [i1-]), (u2, i2+, [i2-]), ...]
    '''
    users, pos_items, neg_items = zip(*batch)
    users = torch.LongTensor(users)
    pos_items = torch.LongTensor(pos_items)
    # получаем плоский массив из списка списков (tenzor)
    neg_items = torch.LongTensor(np.concatenate(neg_items))
    return users, pos_items, neg_items # тензоры плоские

## 07. Two-Tower Hybrid модель


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TwoTowerHybrid(nn.Module):
    def __init__(self, n_users, item_features, emb_dim=32, item_hidden=[128, 64], dropout=0.2, use_sigmoid=True):
        super().__init__()
        # User tower: обучаемые эмбеддинги
        self.user_emb = nn.Embedding(n_users, emb_dim)

        # Item tower: фиксированные контентные признаки + MLP
        self.register_buffer("item_features", torch.FloatTensor(item_features))
        layers = []
        in_f = item_features.shape[1]
        for out_f in item_hidden:
            layers.append(nn.Linear(in_f, out_f))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            in_f = out_f
        layers.append(nn.Linear(in_f, emb_dim))
        self.item_mlp = nn.Sequential(*layers)

        self.use_sigmoid = use_sigmoid
        self._init_weights()

    def _init_weights(self):
        nn.init.normal_(self.user_emb.weight, std=0.01)
        for m in self.item_mlp:
            if isinstance(m, nn.Linear):
                nn.init.xavier_normal_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)

    # Отдельное кодирование
    def encode_users(self, user_ids):
        return F.normalize(self.user_emb(user_ids), dim=1)

    def encode_items(self, item_ids):
        item_feat = self.item_features[item_ids]
        item_vec = self.item_mlp(item_feat)
        return F.normalize(item_vec, dim=1)

    # Основной forward
    def forward(self, user_ids, item_ids):
        user_vec = self.encode_users(user_ids)
        item_vec = self.encode_items(item_ids)
        scores = (user_vec * item_vec).sum(1)
        return torch.sigmoid(scores) if self.use_sigmoid else scores

    # Предсказание для инференса
    def predict(self, user_ids, item_ids):
        with torch.no_grad():
            return self.forward(user_ids, item_ids)


## 09. Обучение (BCE)

In [None]:
def bpr_loss(pos_scores, neg_scores):
    return -torch.log(torch.sigmoid(pos_scores - neg_scores)).mean()

bce_loss_fn = torch.nn.BCELoss()

def train_two_tower_fast(model, train_matrix, mode="BPR", epochs=20,
                         batch_size=2048, lr=0.001, weight_decay=1e-4,
                         step_size = 5, gamma = 0.5, num_negatives=4):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay = weight_decay)

    scheduler = StepLR(optimizer, step_size = step_size , gamma = gamma)
    dataset = ImplicitCFDataset(train_matrix, num_negatives=num_negatives)  # быстрый sampling
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

    model.train()
    for epoch in range(epochs):
        epoch_loss = 0.0
        for users, pos_items, neg_items in dataloader:
            users, pos_items, neg_items = users.to(DEVICE), pos_items.to(DEVICE), neg_items.to(DEVICE)
            optimizer.zero_grad()

            if mode == "BPR":
                model.use_sigmoid = False
                pos_scores = model(users, pos_items)
                neg_scores = model(users.repeat_interleave(num_negatives), neg_items)
                loss = bpr_loss(pos_scores, neg_scores)

            elif mode == "BCE":
                model.use_sigmoid = True
                pos_scores = model(users, pos_items)
                pos_labels = torch.ones_like(pos_scores)
                neg_scores = model(users.repeat_interleave(num_negatives), neg_items)
                neg_labels = torch.zeros_like(neg_scores)
                scores = torch.cat([pos_scores, neg_scores])
                labels = torch.cat([pos_labels, neg_labels])
                loss = bce_loss_fn(scores, labels)

            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        scheduler.step()

        print(f"Epoch {epoch+1}/{epochs}, {mode} Loss: {epoch_loss:.4f}")

In [None]:
# Гиперпараметры
EPOCHS = 30
BATCH_SIZE = 2048
LR = 0.0005
DROPOUT=0.2
EMB_DIM = 32
ITEM_HIDDEN = [128, 64, 32]
NUM_NEGATIVES = 4
WEIGHT_DECAY = 1e-3
STEP_SIZE = 5
GAMMA = 0.777

# замерим время
start = time.time()

model = TwoTowerHybrid(n_users, item_features, emb_dim=EMB_DIM,
                       item_hidden=ITEM_HIDDEN,dropout=DROPOUT,
                       use_sigmoid=True).to(DEVICE)
train_two_tower_fast(model, train_matrix, mode="BCE", epochs=EPOCHS,
                     batch_size=BATCH_SIZE, lr=LR,weight_decay= WEIGHT_DECAY,
                     step_size = STEP_SIZE, gamma = GAMMA,
                     num_negatives=NUM_NEGATIVES)

end = time.time()
study_model_time = end - start

print('\n')
print('*'*70)
print(f"Время выполнения: {study_model_time:.2f} секунд")

Epoch 1/30, BCE Loss: 206.0659
Epoch 2/30, BCE Loss: 198.6790
Epoch 3/30, BCE Loss: 196.6499
Epoch 4/30, BCE Loss: 196.3108
Epoch 5/30, BCE Loss: 196.1110
Epoch 6/30, BCE Loss: 195.8658
Epoch 7/30, BCE Loss: 195.7346
Epoch 8/30, BCE Loss: 195.6737
Epoch 9/30, BCE Loss: 195.6682
Epoch 10/30, BCE Loss: 195.6181
Epoch 11/30, BCE Loss: 195.4738
Epoch 12/30, BCE Loss: 195.4104
Epoch 13/30, BCE Loss: 195.3878
Epoch 14/30, BCE Loss: 195.3903
Epoch 15/30, BCE Loss: 195.2860
Epoch 16/30, BCE Loss: 195.2274
Epoch 17/30, BCE Loss: 195.1501
Epoch 18/30, BCE Loss: 195.1827
Epoch 19/30, BCE Loss: 195.2069
Epoch 20/30, BCE Loss: 195.1568
Epoch 21/30, BCE Loss: 195.0162
Epoch 22/30, BCE Loss: 195.0289
Epoch 23/30, BCE Loss: 195.0463
Epoch 24/30, BCE Loss: 195.0283
Epoch 25/30, BCE Loss: 194.9904
Epoch 26/30, BCE Loss: 194.9081
Epoch 27/30, BCE Loss: 194.9021
Epoch 28/30, BCE Loss: 194.8696
Epoch 29/30, BCE Loss: 194.9227
Epoch 30/30, BCE Loss: 194.8980


***********************************************

## 08. Инференс

In [None]:
def recommend_hybrid(model, user_ids, train_matrix, k=10):
    model.eval()
    recs = {}
    with torch.no_grad():
        all_item_embs = model.item_mlp(model.item_features)  # (n_items, emb_dim)
    for user in user_ids:
        user_emb = model.user_emb(torch.LongTensor([user]).to(DEVICE))  # (1, emb_dim)
        scores = (user_emb @ all_item_embs.T).squeeze()  # (n_items,)
        seen = train_matrix[user].toarray().squeeze().astype(bool)
        scores[seen] = -1e9
        topk = torch.topk(scores, k).indices.cpu().tolist()
        recs[user] = topk
    return recs

## 10. Оценка

In [None]:
recs =  recommend_hybrid(model, list(test_dict.keys()), train_matrix, k=TOP_K)
result = model_evaluation(recs, test_dict, all_items, k=TOP_K, model_name='TwoTowerHybrid')
display(result)

Unnamed: 0,hit_rate@10,precision@10,recall@10,ndcg@10,map@10,coverage@10
TwoTowerHybrid,0.76555,0.275,0.032428,0.294001,0.189015,0.028673


## 11. Сохранение

In [None]:
results_data, json_file, csv_file = save_experiment_results(
                                        result=result,
                                        model_name="TTH",
                                        meta=meta_warm,
                                        results_dir = RESULTS_DIR
                                    )

Результат добавлен в существующий CSV файл
JSON результат сохранен как: TTH_20260102_075229.json
CSV со всеми экспериментами: all_experiments_results.csv
Все результаты в: /content/drive/MyDrive/Colab Notebooks/data/results

СВОДКА ЭКСПЕРИМЕНТА
Модель: TTH
Метка времени: 20260102_075229
Дата оценки: 2026-01-02T07:52:29
Размер train: 800,142
Размер test: 94,842
Пользователей в test: 836
Уникальных предметов: 3662
HitRate@10: 76.6%
precision@10: 27.50%
recall@10: 3.24%
ndcg@10: 29.40%
map@10: 18.90%
Coverage@10: 2.87%

Последние эксперименты (10 всего):


Unnamed: 0,model_name,hit_rate@10,precision@10,recall@10,ndcg@10,map@10,coverage@10,timestamp,evaluation_date
5,ease_lambda=108727,0.838517,0.338517,0.042712,0.356757,0.243356,0.045603,20251211_185547,"2025-12-11T18:55:47.668555,,,,,,,,,"
6,slim_alpha_0.47_l1_ratio_0.14,0.825359,0.32201,0.041529,0.342581,0.225772,0.046969,20251214_171141,
7,NCF_BPR,0.814593,0.31555,0.037944,0.328339,0.219007,0.036865,20251228_162516,2025-12-28T16:25:16.373247
8,NCF_BCE,0.801435,0.311364,0.037039,0.324943,0.217343,0.047117,20251228_162658,2025-12-28T16:26:58.143923
9,TwoTowerHybrid,0.76555,0.275,0.032428,0.294001,0.189015,0.028673,20260102_075229,2026-01-02T07:52:29.633199


## 12. Выводы:
**TwoTowerHybrid** показал:

- HitRate@10 ~ 76.6%

-  Precision@10 ~ 27.5%

- Recall@10 ~ 3.2%

- NDCG@10 ~ 29.4%

- Coverage@10 ~ 2.9%

Это хуже, чем у других моделей по большинству метрик, особенно по Recall и Coverage.

**Причины:**

- User tower слабый (только ID).

- Контентные признаки ограничены (жанры + TF‑IDF по названию).

- Dot‑product не моделирует сложные взаимодействия.

- Coverage низкий - модель концентрируется на популярных фильмах.

**Но:**

- Модель умеет рекомендовать новые айтемы, если есть признаки.

- Архитектура масштабируема и легко расширяется (можно добавить BERT‑эмбеддинги, агрегировать историю пользователя).

- Легко интегрировать новые признаки.
