In this seminar, you've explored a basic implementation of the Deep Structured Semantic Model (DSSM).

Your task is to **improve this model** in one or more of the following directions:

### ‚úÖ Model Improvements
- [ ] Replace MLP towers with Transformer or RNN encoders or etc. (5 –±–∞–ª–ª–æ–≤)
- [x] Use different triplet loss. (3 –±–∞–ª–ª–∞)
- [x] Add dropout, batch normalization, or layer norm. (3 –±–∞–ª–ª–∞)
- [x] Integrate embeddings instead of one-hot vectors. (5 –±–∞–ª–ª–æ–≤)
- [ ] Visualize similarity distribution for positive vs. negative pairs. (5 –±–∞–ª–ª–æ–≤)

### ‚úÖ Evaluation & Analysis
- [x] Visualize embeddings using t-SNE or UMAP. (3 –±–∞–ª–ª–æ–≤)
- [x] Develop and improve beyond accuracy metrics. (5 –±–∞–ª–ª–æ–≤)

### üìÑ Deliverables
- [x] Explain what you changed and why in the final markdown cell. (3 –±–∞–ª–ª–∞)
- [x] Keep code modular, clean, and well-documented. (3 –±–∞–ª–ª–∞)

### üìù Production
- create service based on DSSM vectors with ANN. (8 –±–∞–ª–ª–æ–≤)

### üìù Leaderboard
- Improve score from UserKNN via DSSM (8 –±–∞–ª–ª–æ–≤)


–ú–∞–∫—Å–∏–º—É–º –±–∞–ª–ª–æ–≤, –∫–æ—Ç–æ—Ä—ã–µ –º–æ–∂–Ω–æ –ø–æ–ª—É—á–∏—Ç—å - 25.

In [1]:
# ----------------------
# 2. IMPORTS AND SETUP
# ----------------------
import os
import requests
import zipfile
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import OneHotEncoder
from collections import Counter
import warnings
import umap
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# ----------------------
# 3. DOWNLOAD AND LOAD DATA
# ----------------------
def download_and_extract():
    url = 'https://github.com/irsafilo/KION_DATASET/raw/f69775be31fa5779907cf0a92ddedb70037fb5ae/data_original.zip'
    filename = 'kion_train.zip'

    response = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        total = int(response.headers.get('content-length', 0))
        progress = tqdm(response.iter_content(1024 * 1024),
                        f"Downloading {filename}",
                        total=total // (1024 * 1024), unit='MB')
        for chunk in progress:
            f.write(chunk)

    with zipfile.ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall("data")
    os.remove(filename)

if not os.path.exists("data/data_original"):
    download_and_extract()



In [3]:
# ----------------------
# 4. DATA PREPROCESSING
# ----------------------
interactions_df = pd.read_csv('data/data_original/interactions.csv', parse_dates=["last_watch_dt"])
users_df = pd.read_csv('data/data_original/users.csv')
items_df = pd.read_csv('data/data_original/items.csv')

# Label Encoding
–í–º–µ—Å—Ç–æ –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è one-hot –≤–µ–∫—Ç–æ—Ä–æ–≤ –∑–∞–∫–æ–¥–∏—Ä—É–µ–º –∫–∞–∂–¥—É—é —Ñ–∏—á—É

In [4]:
# Input:
# - user_df - –∏—Å—Ö–æ–¥–Ω—ã–π –¥–∞—Ç–∞—Ñ—Ä–µ–π–º —Å —é–∑–µ—Ä–∞–º–∏

# Oupput:
# - user_df - –æ—Ç—Ñ–∏–ª—å—Ç—Ä–æ–≤–∞–Ω–Ω—ã–π –¥–∞—Ç–∞—Ñ—Ä–µ–π–º

# –ü—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è:
# - –§–∏–ª—å—Ç—Ä—É–µ–º –Ω–µ –Ω—É–∂–Ω—ã–µ —Ñ–∏—á–∏ –≤ –¥–∞—Ç–∞—Ñ—Ä–µ–π–º–µ

from sklearn.preprocessing import LabelEncoder

# –ó–∞–ø–æ–ª–Ω—è–µ–º —Å—Ç—Ä–æ–∫–∏ "unknown"
for col in ['income', 'sex', 'age']:
    users_df[col] = users_df[col].fillna('unknown')

USERS_FEATURES = ["age", "income", "sex", "kids_flg"]
users_df = users_df[USERS_FEATURES + ["user_id"]]

# Encoding
user_encoders = {}
for feat in USERS_FEATURES:
    le = LabelEncoder()
    users_df[feat] = le.fit_transform(users_df[feat])
    user_encoders[feat] = le

users_df.head()

Unnamed: 0,age,income,sex,kids_flg,user_id
0,1,4,2,1,973171
1,0,2,2,0,962099
2,3,3,1,0,1047345
3,3,2,1,0,721985
4,2,4,1,0,704055


In [5]:
# Input:
# - items_df - –∏—Å—Ö–æ–¥–Ω—ã–π –¥–∞—Ç–∞—Ñ—Ä–µ–π–º —Å –∞–π—Ç–µ–º–∞–º–∏

# Oupput:
# - items_df - –æ—Ç—Ñ–∏–ª—å—Ç—Ä–æ–≤–∞–Ω–Ω—ã–π –¥–∞—Ç–∞—Ñ—Ä–µ–π–º

# –ü—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è:
# - –§–∏–ª—å—Ç—Ä—É–µ–º –Ω–µ –Ω—É–∂–Ω—ã–µ —Ñ–∏—á–∏ –≤ –¥–∞—Ç–∞—Ñ—Ä–µ–π–º–µ
# - –ò–∑–±–∞–≤–ª—è–µ–º—Å—è –æ—Ç NaN –∑–Ω–∞—á–µ–Ω–∏–π

ITEMS_FEATURES = ['content_type', 'release_year', 'for_kids', 'age_rating', 'studios', 'countries', 'directors']

# –ó–∞–ø–æ–ª–Ω—è–µ–º —Å—Ç—Ä–æ–∫–∏ "unknown"
for col in ['content_type', 'studios', 'countries', 'directors']:
    items_df[col] = items_df[col].fillna('unknown')

# –ó–∞–ø–æ–ª–Ω—è–µ–º —á–∏—Å–ª–æ–≤—ã–µ —Ñ–∏—á–∏ —Å–ø–µ—Ü–∏–∞–ª—å–Ω—ã–º –∑–Ω–∞—á–µ–Ω–∏–µ–º (-1)
for col in ['release_year', 'for_kids', 'age_rating']:
    items_df[col] = items_df[col].fillna(-1)

items_df = items_df[ITEMS_FEATURES + ['item_id']]

item_encoders = {}
for feat in ITEMS_FEATURES:
    le = LabelEncoder()
    items_df[feat] = le.fit_transform(items_df[feat])
    item_encoders[feat] = le

items_df.head()

Unnamed: 0,content_type,release_year,for_kids,age_rating,studios,countries,directors,item_id
0,0,86,0,4,33,258,5671,10711
1,0,98,0,4,33,421,6546,2508
2,0,95,0,4,33,298,95,10716
3,0,99,0,4,33,57,7735,7868
4,0,62,0,3,34,419,1544,16268


–î–ª—è –∫–∞–∂–¥–æ–π —Ñ–∏—á–∏ –ø—Ä–æ–ø–∏—à–µ–º —Ä–∞–∑–º–µ—Ä–Ω–æ—Å—Ç—å —ç–º–±–µ–¥–∏–Ω–≥–æ–≤

In [7]:
items_categorical_size = {feat: items_df[feat].nunique() for feat in ITEMS_FEATURES}
items_features_info = [
    (items_categorical_size['content_type'], 8),
    (items_categorical_size['release_year'], 8),
    (items_categorical_size['for_kids'], 2),
    (items_categorical_size['age_rating'], 8),
    (items_categorical_size['studios'], 16),
    (items_categorical_size['countries'], 8),
    (items_categorical_size['directors'], 16),
]

In [8]:
users_categorical_size = {feat: users_df[feat].nunique() for feat in USERS_FEATURES}
users_features_info = [
    (users_categorical_size['age'], 16),
    (users_categorical_size['income'], 16),
    (users_categorical_size['sex'], 2),
    (users_categorical_size['kids_flg'], 2),
]

# –ü—Ä–µ-–ø—Ä–æ—Ü–µ—Å—Å–∏–Ω–≥
–í —ç—Ç–æ–π —Å–µ–∫—Ü–∏–∏ –ø—Ä–æ–∏—Å—Ö–æ–¥—è—Ç –±–∞–∑–æ–≤—ã–µ –ø—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è, –∫–∞–∫ –±—ã–ª–æ –ø–æ–∫–∞–∑–∞–Ω–æ –Ω–∞ –ª–µ–∫—Ü–∏–∏. –ò—Å–∫–ª—é—á–µ–Ω–∏–µ–º –ª–∏—à—å —è–≤–ª—è–µ—Ç—Å—è —Ç–æ, —á—Ç–æ —Ñ–∏—á–∏ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π –∏ –∞–π—Ç–µ–º–æ–≤ –Ω–µ –ø—Ä–µ–æ–±—Ä–∞–∑–æ–≤—ã–≤–∞—é—Ç—Å—è –∫–∞–∫ one-hot –≤–µ–∫—Ç–æ—Ä–∞, –ø–æ—Ç–æ–º—É —á—Ç–æ —è –±—É–¥—É –∏—Å–ø–æ–ª—å–∑–æ–≤–∞—Ç—å
–æ–±—É—á–∞–µ–º—ã–µ —ç–º–±–µ–¥–¥–∏–Ω–≥–∏

In [9]:
# Input:
# - interactions_df ‚Äî —Å—ã—Ä—ã–µ –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π —Å —Ñ–∏–ª—å–º–∞–º–∏

# Output:
# - –û—á–∏—â–µ–Ω–Ω—ã–π interactions_df, —Å–æ–¥–µ—Ä–∂–∞—â–∏–π:
#   - –¢–æ–ª—å–∫–æ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π, –ø—Ä–æ—Å–º–æ—Ç—Ä–µ–≤—à–∏—Ö –±–æ–ª–µ–µ 10 —Ñ–∏–ª—å–º–æ–≤.
#   - –¢–æ–ª—å–∫–æ —Ñ–∏–ª—å–º—ã, –ø—Ä–æ—Å–º–æ—Ç—Ä–µ–Ω–Ω—ã–µ –±–æ–ª–µ–µ —á–µ–º 10 –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è–º–∏.
#   - –¢–æ–ª—å–∫–æ –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è, –≥–¥–µ –ø—Ä–æ—Å–º–æ—Ç—Ä —Ñ–∏–ª—å–º–∞ –ø—Ä–µ–≤—ã—à–∞–µ—Ç 10%.

# –ü—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è:
# - –£–¥–∞–ª–µ–Ω—ã –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è —Å watched_pct ‚â§ 10%.
# - –û—Å—Ç–∞–≤–ª–µ–Ω—ã —Ç–æ–ª—å–∫–æ –∞–∫—Ç–∏–≤–Ω—ã–µ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–∏ (—Å –±–æ–ª–µ–µ —á–µ–º 10 –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è–º–∏).
# - –û—Å—Ç–∞–≤–ª–µ–Ω—ã —Ç–æ–ª—å–∫–æ –ø–æ–ø—É–ª—è—Ä–Ω—ã–µ —Ñ–∏–ª—å–º—ã (—Å –±–æ–ª–µ–µ —á–µ–º 10 –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è–º–∏).
# - –ò—Ç–æ–≥–æ–≤—ã–π –¥–∞—Ç–∞—Å–µ—Ç —É–º–µ–Ω—å—à–µ–Ω –≤ –æ–±—ä–µ–º–µ, –Ω–æ –æ—á–∏—â–µ–Ω –¥–ª—è –ø–æ–≤—ã—à–µ–Ω–∏—è –∫–∞—á–µ—Å—Ç–≤–∞ –æ–±—É—á–µ–Ω–∏—è –º–æ–¥–µ–ª–∏.


interactions_df = interactions_df[interactions_df.watched_pct > 10]
valid_users = []
c = Counter(interactions_df.user_id)
for user_id, entries in c.most_common():
    if entries > 10:
        valid_users.append(user_id)
valid_items = []
c = Counter(interactions_df.item_id)
for item_id, entries in c.most_common():
    if entries > 10:
        valid_items.append(item_id)

interactions_df = interactions_df[interactions_df.user_id.isin(valid_users)]
interactions_df = interactions_df[interactions_df.item_id.isin(valid_items)]

In [10]:
# Input:
# - interactions_df ‚Äî –æ—á–∏—â–µ–Ω–Ω—ã–µ –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π –∏ —Ñ–∏–ª—å–º–æ–≤.
# - users_df ‚Äî —Ç–∞–±–ª–∏—Ü–∞ —Å –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è–º–∏ –∏ –∏—Ö –ø—Ä–∏–∑–Ω–∞–∫–∞–º–∏ (–ù–ï one-hot).
# - items_df ‚Äî —Ç–∞–±–ª–∏—Ü–∞ —Å —Ñ–∏–ª—å–º–∞–º–∏ –∏ –∏—Ö –ø—Ä–∏–∑–Ω–∞–∫–∞–º–∏ (–ù–ï one-hot).

# Output:
# - interactions_df, items_df –∏ users_df ‚Äî —Å–∏–Ω—Ö—Ä–æ–Ω–∏–∑–∏—Ä–æ–≤–∞–Ω–Ω—ã–µ —Ç–∞–±–ª–∏—Ü—ã, —Å–æ–¥–µ—Ä–∂–∞—â–∏–µ —Ç–æ–ª—å–∫–æ –æ–±—â–∏—Ö –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π –∏ —Ñ–∏–ª—å–º—ã.

# –ü—Ä–µ–æ–±—Ä–∞–∑–æ–≤–∞–Ω–∏—è:
# - –ù–∞–π–¥–µ–Ω—ã –ø–µ—Ä–µ—Å–µ—á–µ–Ω–∏—è –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–µ–π –∏ —Ñ–∏–ª—å–º–æ–≤, –ø—Ä–∏—Å—É—Ç—Å—Ç–≤—É—é—â–∏—Ö –æ–¥–Ω–æ–≤—Ä–µ–º–µ–Ω–Ω–æ –≤ interactions_df –∏ users_df/items_df.
# - –£–¥–∞–ª–µ–Ω—ã –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª–∏ –∏ —Ñ–∏–ª—å–º—ã, –æ—Ç—Å—É—Ç—Å—Ç–≤—É—é—â–∏–µ –≤ –æ–±–æ–∏—Ö —Å–æ–æ—Ç–≤–µ—Ç—Å—Ç–≤—É—é—â–∏—Ö –¥–∞—Ç–∞—Å–µ—Ç–∞—Ö.
# - –ì–∞—Ä–∞–Ω—Ç–∏—Ä–æ–≤–∞–Ω–∞ –∫–æ–Ω—Å–∏—Å—Ç–µ–Ω—Ç–Ω–æ—Å—Ç—å –º–µ–∂–¥—É –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏—è–º–∏ –∏ –ø—Ä–∏–∑–Ω–∞–∫–æ–≤—ã–º–∏ —Ç–∞–±–ª–∏—Ü–∞–º–∏ (users_df –∏ items_df).


common_users = set(interactions_df.user_id.unique()).intersection(set(users_df.user_id.unique()))
common_items = set(interactions_df.item_id.unique()).intersection(set(items_df.item_id.unique()))

print(len(common_users))
print(len(common_items))

interactions_df = interactions_df[interactions_df.item_id.isin(common_items)]
interactions_df = interactions_df[interactions_df.user_id.isin(common_users)]

items_df = items_df[items_df.item_id.isin(common_items)]
users_df = users_df[users_df.user_id.isin(common_users)]

65974
6901


In [11]:
common_users = set(interactions_df.user_id.unique()).intersection(set(users_df.user_id.unique()))
common_items = set(interactions_df.item_id.unique()).intersection(set(items_df.item_id.unique()))

print(len(common_users))
print(len(common_items))

interactions_df = interactions_df[interactions_df.item_id.isin(common_items)]
interactions_df = interactions_df[interactions_df.user_id.isin(common_users)]

items_df = items_df[items_df.item_id.isin(common_items)]
users_df = users_df[users_df.user_id.isin(common_users)]

65974
6897


# –ü–æ—Å—Ç-–ø—Ä–æ—Ü–µ—Å—Å–∏–Ω–≥ : train –∏ test

### –í—Ä–µ–º–µ–Ω–Ω–æ–π —Å–ø–ª–∏—Ç –¥–ª—è –ø–æ—Å—Ç—Ä–æ–µ–Ω–∏—è interaction –º–∞—Ç—Ä–∏—Ü:
- –î–µ–ª–∏–º interactions_df –ø–æ –¥–∞—Ç–µ –Ω–∞ train/test –ø–æ –ø–æ—Å–ª–µ–¥–Ω–∏–º N –¥–Ω–µ–π.
- –ö–æ–¥–∏—Ä—É–µ–º user_id –∏ item_id –≤ train —á–µ—Ä–µ–∑ –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω—ã–µ uid –∏ iid.
- –ü—Ä–∏–º–µ–Ω—è–µ–º —Ç–æ—Ç –∂–µ –º–∞–ø–ø–∏–Ω–≥ –∫ test.
- –°—Ç—Ä–æ–∏–º interaction –º–∞—Ç—Ä–∏—Ü—ã –æ—Ç–¥–µ–ª—å–Ω–æ –¥–ª—è train –∏ test.
- –¢–∞–∫–∏–º –æ–±—Ä–∞–∑–æ–º, –æ–±—É—á–∞–µ–º –º–æ–¥–µ–ª—å –Ω–∞ –∏—Å—Ç–æ—Ä–∏–∏, –∞ —Ç–µ—Å—Ç–∏—Ä—É–µ–º –Ω–∞ –ø—Ä–µ–¥—Å–∫–∞–∑–∞–Ω–∏–∏ –±—É–¥—É—â–∏—Ö –∏–Ω—Ç–µ—Ä–µ—Å–æ–≤ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è.


In [12]:
# –†–∞–∑–±–∏–≤–∞–µ–º –¥–∞—Ç–∞—Å–µ—Ç –Ω–∞ —Ç—Ä–µ–π–Ω –∏ —Ç–µ—Å—Ç–æ–≤—É—é —á–∞—Å—Ç—å
N_DAYS = 7
max_date = interactions_df['last_watch_dt'].max()

train_df = interactions_df[interactions_df['last_watch_dt'] <= max_date - pd.Timedelta(days=N_DAYS)]
test_df = interactions_df[interactions_df['last_watch_dt'] > max_date - pd.Timedelta(days=N_DAYS)]


# –¢–æ–ª—å–∫–æ —é–∑–µ—Ä—ã, –∫–æ—Ç–æ—Ä—ã–µ –±—ã–ª–∏ –≤ train
test_df = test_df[test_df['user_id'].isin(train_df['user_id'])]
# –¢–æ–ª—å–∫–æ –∞–π—Ç–µ–º—ã, –∫–æ—Ç–æ—Ä—ã–µ –±—ã–ª–∏ –≤ train
test_df = test_df[test_df['item_id'].isin(train_df['item_id'])]

In [13]:

# –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω–æ –∑–∞–∫–æ–¥–∏—Ä—É–µ–º user_id –∏ item_id –ø–æ train
train_df['uid'] = train_df['user_id'].astype('category').cat.codes
train_df['iid'] = train_df['item_id'].astype('category').cat.codes

# –°–¥–µ–ª–∞–µ–º –º–∞–ø–ø–∏–Ω–≥ user_id –∏ item_id –≤ uid –∏ iid

uid_to_user_id = dict(zip(train_df['uid'], train_df['user_id']))
iid_to_item_id = dict(zip(train_df['iid'], train_df['item_id']))

user_id_to_uid = dict(zip(train_df['user_id'], train_df['uid']))
item_id_to_iid = dict(zip(train_df['item_id'], train_df['iid']))

# –°–æ–∑–¥–∞–¥–∏–º –∫–æ–ª–æ–Ω–∫–∏ uid –∏ iid –≤ test_df –∏ –ø—Ä–∏–º–µ–Ω–∏–º –º–∞–ø–ø–∏–Ω–≥

test_df['uid'] = test_df['user_id'].map(user_id_to_uid)
test_df['iid'] = test_df['item_id'].map(item_id_to_iid)

print(f"Test: {test_df.shape}")
print(f"Train: {train_df.shape}")

Test: (83707, 7)
Train: (1375329, 7)


In [14]:
# –û—Å—Ç–∞–≤–∏–º —Ç–æ–ª—å–∫–æ —Ç–µ –∞–π—Ç–µ–º—ã –∫–æ—Ç–æ—Ä—ã–µ –µ—Å—Ç—å –≤ train

items_df = items_df[items_df['item_id'].isin(train_df['item_id'])]
items_df = items_df.set_index('item_id').loc[train_df['item_id'].unique()].reset_index()

# –£–±–µ–¥–∏–º—Å—è, —á—Ç–æ –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö –∞–π—Ç–æ–º–æ–≤ –≤ items_df —Å–æ–≤–ø–∞–¥–∞–µ—Ç —Å –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ–º —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö –∞–π—Ç–µ–º–æ–≤ –≤ train_df
assert items_df.item_id.nunique() == train_df.item_id.nunique()

In [15]:
# –û—Å—Ç–∞–≤–∏–º —Ç–æ–ª—å–∫–æ —Ç–µ—Ö —é–∑–µ—Ä–æ–≤ –∫–æ—Ç–æ—Ä—ã–µ –µ—Å—Ç—å –≤ train

users_df = users_df[users_df['user_id'].isin(train_df['user_id'])]
users_df = users_df.set_index('user_id').loc[train_df['user_id'].unique()].reset_index()

# –£–±–µ–¥–∏–º—Å—è, —á—Ç–æ –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤ –≤ users_df —Å–æ–≤–ø–∞–¥–∞–µ—Ç —Å –∫–æ–ª–∏—á–µ—Å—Ç–≤–æ–º —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤ –≤ train_df
assert users_df.user_id.nunique() == train_df.user_id.nunique()

In [16]:
# –ü—Ä–æ–≤–µ—Ä–∏–º, —á—Ç–æ –º–∞–ø–ø–∏–Ω–≥ –±—ã–ª –∫–æ—Ä—Ä–µ–∫—Ç–Ω—ã–π
assert test_df[test_df['uid'] == 4375].user_id.values[0] == train_df[train_df['uid'] == 4375].user_id.values[0]

–°–¥–µ–ª–∞–µ–º –º–∞—Ç—Ä–∏—Ü—É –∏–Ω—Ç–µ—Ä–∞–∫—Ü–∏—è –¥–ª—è train –∏ test. –†–∞–∑–º–µ—Ä–Ω–æ—Å—Ç—å –æ–¥–∏–Ω–∞–∫–æ–≤–∞—è

In [17]:
import numpy as np

n_users = train_df['uid'].nunique()
n_items = train_df['iid'].nunique()

# ==========================

train_vec = np.zeros((n_users, n_items))
for uid, iid in zip(train_df['uid'], train_df['iid']):
    train_vec[uid, iid] += 1

# –Ω–æ—Ä–º–∞–ª–∏–∑–∞—Ü–∏—è
train_vec = train_vec / train_vec.sum(axis=1, keepdims=True)

# ==========================

# –¥–ª—è test
test_vec = np.zeros((n_users, n_items))
for uid, iid in zip(test_df['uid'], test_df['iid']):
    test_vec[uid, iid] += 1

test_vec = test_vec / test_vec.sum(axis=1, keepdims=True)

In [18]:
print(f"Train interaction matrix shape : {train_vec.shape}")
print(f"Test interaction matrix shape : {test_vec.shape}")

Train interaction matrix shape : (65792, 6862)
Test interaction matrix shape : (65792, 6862)


In [20]:
print("Train: –†–∞–∑–º–µ—Ä —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤", train_df.uid.nunique())
print("Test: –†–∞–∑–º–µ—Ä —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤", test_df.uid.nunique())


Train: –†–∞–∑–º–µ—Ä —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤ 65792
Test: –†–∞–∑–º–µ—Ä —É–Ω–∏–∫–∞–ª—å–Ω—ã—Ö —é–∑–µ—Ä–æ–≤ 25676


In [21]:
N_FACTORS = 128

ITEM_MODEL_SHAPE = (items_df.drop(["item_id"], axis=1).shape[1], )
USER_META_MODEL_SHAPE = (users_df.drop(["user_id"], axis=1).shape[1], )

USER_INTERACTION_MODEL_SHAPE = (train_vec.shape[1], )

print(f"N_FACTORS: {N_FACTORS}")
print(f"ITEM_MODEL_SHAPE: {ITEM_MODEL_SHAPE}")
print(f"USER_META_MODEL_SHAPE: {USER_META_MODEL_SHAPE}")
print(f"USER_INTERACTION_MODEL_SHAPE: {USER_INTERACTION_MODEL_SHAPE}")

N_FACTORS: 128
ITEM_MODEL_SHAPE: (7,)
USER_META_MODEL_SHAPE: (4,)
USER_INTERACTION_MODEL_SHAPE: (6862,)


# –ú–æ–¥–µ–ª—å

In [22]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ItemModel(nn.Module):
    def __init__(self, items_features_info, hidden_dim=128, output_dim=128, dropout_rate=0.2):
        """
        Args:
            items_features_info: —Å–ø–∏—Å–æ–∫ –∫–æ—Ä—Ç–µ–∂–µ–π (n_categories, embedding_dim) –¥–ª—è –∫–∞–∂–¥–æ–≥–æ –ø—Ä–∏–∑–Ω–∞–∫–∞.
            hidden_dim: —Ä–∞–∑–º–µ—Ä —Å–∫—Ä—ã—Ç–æ–≥–æ —Å–ª–æ—è.
            output_dim: —Ä–∞–∑–º–µ—Ä –∏—Ç–æ–≥–æ–≤–æ–≥–æ —ç–º–±–µ–¥–¥–∏–Ω–≥–∞ –∞–π—Ç–µ–º–∞.
        """
        super(ItemModel, self).__init__()

        # Embedding —Å–ª–æ–∏ –¥–ª—è –≤—Å–µ—Ö –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω—ã—Ö –ø—Ä–∏–∑–Ω–∞–∫–æ–≤
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_categories, emb_dim) for num_categories, emb_dim in items_features_info
        ])

        self.emb_total_dim = sum(emb_dim for _, emb_dim in items_features_info)

        # MLP –ø–æ—Å–ª–µ —ç–º–±–µ–¥–¥–∏–Ω–≥–æ–≤
        self.mlp = nn.Sequential(
            nn.Linear(self.emb_total_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, item_features):
        """
        item_features: [batch_size, num_features] (–∏–Ω–¥–µ–∫—Å—ã –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω—ã—Ö –ø—Ä–∏–∑–Ω–∞–∫–æ–≤)
        """
        emb_list = []
        for i, emb_layer in enumerate(self.embeddings):
            emb_list.append(emb_layer(item_features[:, i]))

        x = torch.cat(emb_list, dim=-1)
        x = self.mlp(x)
        return x


In [24]:
class UserModel(nn.Module):
    def __init__(self,
                 categorical_feat_info,
                 interaction_input_dim,
                 hidden_dim=128,
                 output_dim=128,
                 dropout_rate=0.2):
        super(UserModel, self).__init__()

        # Embedding layers –¥–ª—è –≤—Å–µ—Ö –∫–∞—Ç–µ–≥–æ—Ä–∏–∞–ª—å–Ω—ã—Ö –ø—Ä–∏–∑–Ω–∞–∫–æ–≤
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_categories, emb_dim) for num_categories, emb_dim in categorical_feat_info
        ])

        self.emb_total_dim = sum(emb_dim for _, emb_dim in categorical_feat_info)

        # MLP –¥–ª—è –ø—Ä–∏–∑–Ω–∞–∫–æ–≤ –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è (meta features)
        self.meta_mlp = nn.Sequential(
            nn.Linear(self.emb_total_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        # MLP –¥–ª—è –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏–π –ø–æ–ª—å–∑–æ–≤–∞—Ç–µ–ª—è
        self.interaction_mlp = nn.Sequential(
            nn.Linear(interaction_input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )

        # –§–∏–Ω–∞–ª—å–Ω—ã–π —Å–ª–æ–π –ø–æ—Å–ª–µ —Å–∫–ª–µ–π–∫–∏
        self.fc_final = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, meta_features, interaction_features):
        """
        meta_features: [batch_size, num_categorical_features] (–∏–Ω–¥–µ–∫—Å—ã –ø—Ä–∏–∑–Ω–∞–∫–æ–≤)
        interaction_features: [batch_size, interaction_input_dim] (float32 –ø—Ä–∏–∑–Ω–∞–∫–∏)
        """

        # –û–±—Ä–∞–±–æ—Ç–∫–∞ –º–µ—Ç–∞-—Ñ–∏—á–µ–π —á–µ—Ä–µ–∑ —ç–º–±–µ–¥–¥–∏–Ω–≥–∏
        emb_list = []
        for i, emb_layer in enumerate(self.embeddings):
            emb_list.append(emb_layer(meta_features[:, i]))

        meta_embedded = torch.cat(emb_list, dim=-1)

        # –û–±—Ä–∞–±–æ—Ç–∫–∞ —á–µ—Ä–µ–∑ MLP
        meta_vec = self.meta_mlp(meta_embedded)

        # –û–±—Ä–∞–±–æ—Ç–∫–∞ –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏–π
        interaction_vec = self.interaction_mlp(interaction_features)

        # –°–∫–ª–µ–π–∫–∞ –º–µ—Ç–∞ + –≤–∑–∞–∏–º–æ–¥–µ–π—Å—Ç–≤–∏–π
        combined = torch.cat([meta_vec, interaction_vec], dim=1)

        # –§–∏–Ω–∞–ª—å–Ω—ã–π —Å–ª–æ–π
        user_embedding = self.fc_final(combined)

        return user_embedding


# Loss

In [25]:
import torch.nn.functional as F

def cosine_triplet_loss(anchor, positive, negative, alpha=0.4):
    # –ù–æ—Ä–º–∞–ª–∏–∑—É–µ–º –≤–µ–∫—Ç–æ—Ä–∞
    anchor = F.normalize(anchor, p=2, dim=1)
    positive = F.normalize(positive, p=2, dim=1)
    negative = F.normalize(negative, p=2, dim=1)

    # –ö–æ—Å–∏–Ω—É—Å–Ω–æ–µ —Ä–∞—Å—Å—Ç–æ—è–Ω–∏–µ: 1 - cos_sim
    pos_sim = torch.sum(anchor * positive, dim=1)
    neg_sim = torch.sum(anchor * negative, dim=1)

    basic_loss = neg_sim - pos_sim + alpha
    loss = torch.clamp(basic_loss, min=0.0)
    return loss.mean()

# Dataset

In [26]:
# Define the dataset
class RecSysDataset(Dataset):
    def __init__(self, items, users, interactions, uids: list[int]):
        self.items = items
        self.users = users
        self.interactions = interactions
        self.uids = uids

    def __len__(self):
        return len(self.uids)

    def __getitem__(self, idx):
        uid = self.uids[idx]
        pos_i = np.random.choice(range(self.interactions.shape[1]), p=self.interactions[uid])
        neg_i = np.random.choice(range(self.interactions.shape[1]))
        uid_meta = self.users.iloc[uid].values
        uid_interaction = self.interactions[uid]
        pos = self.items.iloc[pos_i].values
        neg = self.items.iloc[neg_i].values

        return torch.tensor(uid_meta, dtype=torch.int), torch.tensor(uid_interaction, dtype=torch.float32), torch.tensor(pos, dtype=torch.int), torch.tensor(neg, dtype=torch.int), uid

# –ò–Ω–∏—Ü–∏–∞–ª–∏–∑–∞—Ü–∏—è –ø–µ—Ä–µ–¥ —Ç—Ä–µ–Ω–∏—Ä–æ–≤–∫–æ–π

In [27]:
i2v = ItemModel(items_features_info)
u2v = UserModel(users_features_info, USER_INTERACTION_MODEL_SHAPE[0])
optimizer = optim.Adam(list(i2v.parameters()) + list(u2v.parameters()), lr=0.001)

dataset = RecSysDataset(items=items_df.drop(["item_id"], axis=1), users=users_df.drop(["user_id"], axis=1), interactions=train_vec, uids=train_df.uid.unique())
test_dataset = RecSysDataset(items=items_df.drop("item_id", axis=1), users=users_df.drop("user_id", axis=1), interactions=test_vec, uids=sorted(test_df.uid.unique()))

dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Accuracy function

In [28]:
def map_at_k(reco_df: pd.DataFrame, ground_truth_df: pd.DataFrame, k: int = 10) -> float:
    """
    Calculate Mean Average Precision at K.
    
    Parameters:
        reco_df: pd.DataFrame with columns [user_id, item_id, rank] ‚Äî predicted top-K items.
        ground_truth_df: pd.DataFrame with columns [user_id, item_id] ‚Äî actual relevant items.
        k: cutoff for recommendations.
    
    Returns:
        float: MAP@K score
    """

    # –ü—Ä–∏–≤–µ–¥—ë–º ground truth –≤ —Å—Ç—Ä—É–∫—Ç—É—Ä—É {user_id: set(item_ids)}
    gt_dict = ground_truth_df.groupby("user_id")["item_id"].apply(set).to_dict()

    # –ü—Ä–∏–≤–µ–¥—ë–º —Ä–µ–∫–æ–º–µ–Ω–¥–∞—Ü–∏–∏ –≤ —Å—Ç—Ä—É–∫—Ç—É—Ä—É {user_id: [item_ids]}
    reco_df = reco_df[reco_df["rank"] <= k]
    reco_dict = reco_df.sort_values(["user_id", "rank"]).groupby("user_id")["item_id"].apply(list).to_dict()

    average_precisions = []

    for user_id, pred_items in reco_dict.items():
        if user_id not in gt_dict:
            continue  # —é–∑–µ—Ä–∞ –Ω–µ—Ç –≤ ground truth

        true_items = gt_dict[user_id]
        if not true_items:
            continue

        num_hits = 0
        score = 0.0

        for i, item in enumerate(pred_items):
            if item in true_items:
                num_hits += 1
                precision_at_i = num_hits / (i + 1)
                score += precision_at_i

        if num_hits > 0:
            ap = score / min(len(true_items), k)
            average_precisions.append(ap)

    return np.mean(average_precisions) if average_precisions else 0.0


In [29]:
import torch
import torch.nn.functional as F
import pandas as pd

def evaluate_map_k(model_user, model_item, test_dataloader, test_df, uid_to_user_id, iid_to_item_id, k=10):
    model_user.eval()
    model_item.eval()

    # –ü—Ä–µ–¥–≤–∞—Ä–∏—Ç–µ–ª—å–Ω–æ —Å—á–∏—Ç–∞–µ–º —ç–º–±–µ–¥–¥–∏–Ω–≥–∏ –≤—Å–µ—Ö –∞–π—Ç–µ–º–æ–≤
    item_feats = torch.tensor(items_df.drop("item_id", axis=1).values, dtype=torch.int)
    with torch.no_grad():
        item_emb = model_item(item_feats)
        item_emb = F.normalize(item_emb, dim=1)

    reco = []

    with torch.no_grad():
        for batch in test_dataloader:
            uid_meta, uid_interaction, _, _, batch_uids = batch

            user_emb = model_user(uid_meta, uid_interaction)
            user_emb = F.normalize(user_emb, dim=1)

            sim = torch.matmul(user_emb, item_emb.T)  # [batch_size, n_items]
            _, topk_indices = torch.topk(sim, k=k, dim=1)

            for i, item_idxs in enumerate(topk_indices):
                real_uid = batch_uids[i].item()  # uid –∏–∑ –¥–∞—Ç–∞—Å–µ—Ç–∞
                real_user_id = uid_to_user_id[real_uid]

                for rank, iid in enumerate(item_idxs):
                    reco.append({
                        "user_id": real_user_id,
                        "item_id": iid_to_item_id[iid.item()],
                        "rank": rank + 1
                    })

    reco_df = pd.DataFrame(reco)
    ground_truth = test_df[["user_id", "item_id"]].drop_duplicates()

    score = map_at_k(reco_df, ground_truth, k=k)
    return score


# Training

In [30]:
import logging

# Set up logging
logging.basicConfig(filename='training.log', level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

losses = []
metrics = []

# Training loop
for epoch in range(2):
    for idx, batch in enumerate(dataloader):
        uid_meta, uid_interaction, pos, neg, _ = batch
        optimizer.zero_grad()
        anchor = u2v(uid_meta, uid_interaction)
        positive = i2v(pos)
        negative = i2v(neg)
        loss = cosine_triplet_loss(anchor, positive, negative)
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
        print(f"Epoch {epoch + 1}, Batch {idx}, Loss: {loss.item()}")
        
        if idx % 100 == 0:
            mapk = map_score = evaluate_map_k(
                        model_user=u2v,
                        model_item=i2v,
                        test_dataloader=test_dataloader,
                        test_df=test_df,
                        uid_to_user_id=uid_to_user_id,
                        iid_to_item_id=iid_to_item_id,
                        k=10
                    )
            metrics.append(mapk)
            print(f"MAP@10: {mapk:.4f}")

    print(f"Epoch {epoch + 1}, Epoch Loss: {loss.item()}")

Epoch 1, Batch 0, Loss: 0.37432432174682617
MAP@10: 0.1187
Epoch 1, Batch 1, Loss: 0.38911646604537964
Epoch 1, Batch 2, Loss: 0.40407219529151917
Epoch 1, Batch 3, Loss: 0.4013963043689728
Epoch 1, Batch 4, Loss: 0.4049313962459564
Epoch 1, Batch 5, Loss: 0.4036625623703003
Epoch 1, Batch 6, Loss: 0.37378084659576416
Epoch 1, Batch 7, Loss: 0.39249545335769653
Epoch 1, Batch 8, Loss: 0.3652392625808716
Epoch 1, Batch 9, Loss: 0.335674911737442
Epoch 1, Batch 10, Loss: 0.37144309282302856
Epoch 1, Batch 11, Loss: 0.40704619884490967
Epoch 1, Batch 12, Loss: 0.36516064405441284
Epoch 1, Batch 13, Loss: 0.40018022060394287
Epoch 1, Batch 14, Loss: 0.4063614308834076
Epoch 1, Batch 15, Loss: 0.3496641516685486
Epoch 1, Batch 16, Loss: 0.30386215448379517
Epoch 1, Batch 17, Loss: 0.37570780515670776
Epoch 1, Batch 18, Loss: 0.38620081543922424
Epoch 1, Batch 19, Loss: 0.3442865312099457
Epoch 1, Batch 20, Loss: 0.40338319540023804
Epoch 1, Batch 21, Loss: 0.35248076915740967
Epoch 1, Batch

–ü–æ—Å—Ç—Ä–æ–∏–º –≥—Ä–∞—Ñ–∏–∫ –õ–æ—Å—Å–∞ –∏ –º–µ—Ç—Ä–∏–∫–∏

In [None]:
loss_steps = list(range(1, len(losses) + 1))
metric_steps = list(range(1, len(metrics) + 1))

def moving_average(data, window_size=50):
    return np.convolve(data, np.ones(window_size)/window_size, mode='valid')

losses_smooth = moving_average(losses, window_size=50)
smooth_steps = loss_steps[:len(losses_smooth)]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(loss_steps, losses, color='red', alpha=0.4, label='Raw Losses')
ax1.plot(smooth_steps, losses_smooth, color='darkred', linewidth=2, label='Smoothed Trend')
ax1.set_title('Losses')
ax1.set_xlabel('Step / Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

# –ì—Ä–∞—Ñ–∏–∫ Metrics
ax2.plot(metric_steps, metrics, color='blue')
ax2.set_title('MAP@10')
ax2.set_xlabel('Steps (100 batches)')
ax2.set_ylabel('Metric')
ax2.grid(True)

plt.tight_layout()
plt.show()