# VK Recommendation Model

This notebook demonstrates a machine learning pipeline for training a recommendation model using the RecBole library. The model predicts group recommendations for VK social network users based on their past interactions.

## Steps Covered:
- GPU Setup Verification
- Library Imports and Dataset Loading
- Data Preparation for RecBole
- Model Training
- Predictions


In [1]:
!nvidia-smi -L

# Check if the notebook is using GPU with correct configurations.
# Display available GPUs (if any) for use by PyTorch/RecBole.

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-0e435c7a-545e-84f2-3563-fa616fa547f4)


## Step 1: Library Imports
We begin by importing all the necessary libraries and packages for:
- Data handling (Pandas)
- Memory management
- RecBole configuration, dataset handling, model, trainer, and utilities

We also check if RecBole, PyTorch, and Ray libraries are installed, and install them if not present.

In [2]:
import pandas as pd
import os
import gc
import logging
from logging import getLogger
from typing import List, Tuple
from collections import defaultdict

# Install necessary libraries if missing
lst = !pip list
avail_libs = set(x.split()[0] for x in lst)
if 'recbole' not in avail_libs:
    !pip install recbole
if 'torch' not in avail_libs:
    !pip install torch
if 'ray' not in avail_libs:
    !pip install ray
if 'kmeans-pytorch' not in avail_libs:
    !pip install kmeans-pytorch

from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.model.sequential_recommender import GRU4Rec
from recbole.trainer import Trainer
from recbole.utils import init_logger
from recbole.utils.case_study import full_sort_topk
import torch

# Set device based on GPU availability
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Collecting recbole
  Downloading recbole-1.2.0-py3-none-any.whl.metadata (1.4 kB)
Collecting colorlog==4.7.2 (from recbole)
  Downloading colorlog-4.7.2-py2.py3-none-any.whl.metadata (9.9 kB)
Collecting colorama==0.4.4 (from recbole)
  Downloading colorama-0.4.4-py2.py3-none-any.whl.metadata (14 kB)
Collecting thop>=0.1.1.post2207130030 (from recbole)
  Downloading thop-0.1.1.post2209072238-py3-none-any.whl.metadata (2.7 kB)
Downloading recbole-1.2.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Downloading colorlog-4.7.2-py2.py3-none-any.whl (10 kB)
Downloading thop-0.1.1.post2209072238-py3-none-any.whl (15 kB)
Installing collected packages: colorlog, colorama, thop, recbole
  Attempting uninstall: colorlog
    Found existing installation: colorlog 6.8.2
    Uninstalling colorlog-6.8.2:
      Successfully uninst

In [3]:
# Check GPU count
torch.cuda.device_count()

1

## Step 2: Data Loading and Preprocessing
We load the training dataset, preprocess interaction records, and map group interactions to numerical values. The final dataframe includes the `customer_id`, `community_id`, `status`, and timestamp information for each user interaction.

In [4]:
df_train = pd.read_csv("/kaggle/input/vk-recsys-train/train_df.tsv", sep='\t')

# Map customer interaction types to numerical ratings
df_train['status'] = df_train['status'].map({'Y': 1, 'I': 1, 'B': 2, 'F': 2, 'P': 2, 'R': 3, 'A': 4})

# Fill missing join_request_date with zeroes
df_train['join_request_date'] = df_train['join_request_date'].fillna(0)

# Sort by join_request_date and drop duplicates for consistent records
df_train = df_train.sort_values('join_request_date').drop_duplicates(subset=['customer_id', 'community_id'], keep='last')

# Select up to 7 recent interactions per customer for the model
df_train = df_train.groupby('customer_id').tail(7)

### Creating RecBole-Compatible Data
To prepare the data for RecBole, we create the interaction (`.inter`) and item (`.item`) files that store user-item interactions and item metadata, respectively. These files are saved in a specific format for RecBole's dataset functions.

In [5]:
# Create INTER file for interactions
df_inter = df_train[['customer_id', 'community_id', 'status', 'join_request_date']]
df_inter.columns = ['user_id:token', 'item_id:token', 'rating:float', 'timestamp:float']

# Save to CSV
new_folder = '/kaggle/working/vk_data/full_train'
if not os.path.exists(new_folder):
    os.makedirs(new_folder)

new_csv = '/kaggle/working/vk_data/full_train/full_train.inter'
if not os.path.exists(new_csv):
    df_inter.to_csv(new_csv, sep='\t', index=False)

# Create ITEM file for item details
df_item = df_train[['community_id', 'description', 'customers_count', 'messages_count', 'type', 'region_id', 'themeid', 'business_category', 'business_parent']]
df_item.columns = ['item_id:token', 'description:token_seq', 'customers_count:float', 'messages_count:float', 'type:token', 'region_id:token', 'themeid:token', 'business_category:token', 'business_parent:token']

# Save ITEM data to CSV
item_csv = '/kaggle/working/vk_data/full_train/full_train.item'
if not os.path.exists(item_csv):
    df_item.to_csv(item_csv, sep='\t', index=False)

del df_item, df_inter, df_train
gc.collect()

17

## Step 3: Model Training
Using RecBole's `GRU4Rec` model, we configure hyperparameters and initiate training. The model parameters include embedding sizes, hidden layer sizes, batch size, and dropout probability. Here, we also specify evaluation metrics and model stopping criteria.

In [6]:
config_dict = {
    "USER_ID_FIELD": "user_id",
    "ITEM_ID_FIELD": "item_id",
    "TIME_FIELD": "timestamp",
    'load_col': {'inter': ["user_id", "item_id", "rating", "timestamp"], 'item': ['item_id', 'description', 'customers_count', 'messages_count', 'type', 'region_id', 'themeid', 'business_category', 'business_parent']},
    "ITEM_LIST_LENGTH_FIELD": "item_length",
    "LIST_SUFFIX": "_list",
    "MAX_ITEM_LIST_LENGTH": 7,  # max sequence
    "embedding_size": 256, # embedding size
    "hidden_size": 512, # hidden layers
    "num_layers": 2,  # hidden layers count
    "dropout_prob": 0.3,  # dropout rate
    "loss_type": "CE",  # loss function
    "epochs": 4,
    "train_batch_size": 2048,
    "eval_batch_size": 2048,
    "train_neg_sample_args": None, # negative sampling -> DISABLED
     # Validation params
    "eval_args": {
        "group_by": "user", # users groups
        "order": "TO", # timestamps sort
        "split": {"LS": "valid_only"}, # leave-one-out validation
        "mode": "full", # use all data for inference
    },
    "metrics": ["Recall", "MRR", "NDCG", "Hit", "Precision", "MAP"],
    "topk": 14,
    "valid_metric": "MAP@14", # Validation metric
    # соревнования
    "data_path": "/kaggle/working/vk_data/",
    "stopping_step": 3, # Early Stopping
    "device": DEVICE,
}

In [7]:
config = Config(model='GRU4Rec', dataset='full_train', config_dict=config_dict)
logger = getLogger()
init_logger(config)

dataset = create_dataset(config)
train_data, valid_data, _ = data_preparation(config, dataset)
model = GRU4Rec(config, train_data.dataset).to(config['device'])
trainer = Trainer(config, model)

best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[field].fillna(value="", inplace=True)
  split_point = np.cumsum(feat[field].agg(len))[:-1]
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  feat[field].fillna(value=0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermed

## Step 4: Making Predictions
Finally, we generate predictions for users in the test dataset and save the results in a CSV file.

In [8]:
test_customers = pd.read_csv("/kaggle/input/vk-recsys-test/test_customer_ids.csv")

topk_items = []
ext_users = []
test_user_indices = []

for ext_user_id in test_customers['customer_id'].values:
    try:
        user_index = dataset.token2id(dataset.uid_field, ext_user_id)
        ext_users.append(ext_user_id)
        test_user_indices.append(user_index)
    except ValueError:
        continue

In [9]:
batch_size = 5000  # Batch size
result_df = pd.DataFrame(columns=['User', 'Groups'])

for i in range(0, len(test_user_indices), batch_size):
    batch_indices = test_user_indices[i:i + batch_size]
    batch_ext_users = ext_users[i:i + batch_size]
    
    topk_iid_list_batch = full_sort_topk(batch_indices, model, valid_data, k=14, device=DEVICE)
    last_topk_iid_list = topk_iid_list_batch.indices
    external_item_list = dataset.id2token(dataset.iid_field, last_topk_iid_list.cpu()).tolist()
    
    temp_df = pd.DataFrame({'User': batch_ext_users, 'Groups': external_item_list})

    result_df = pd.concat([result_df, temp_df], ignore_index=True)

In [10]:
csv_file_path = 'result.csv'
result_df.to_csv(csv_file_path, index=False)
print(f'Results saved to {csv_file_path}')

Results saved to result.csv
