### Task


**Introduction**
https://www.kaggle.com/competitions/hse-recommender-systems-course-challenge-2023/overview

Greetings and welcome to the Recommender System challenge that provides a quick dive into the realm of predictive models using movie ratings data.

Your mission: craft a recommender system capable of accurately predicting which movies users are most likely to watch next, leveraging their historical viewing patterns.

**Guidelines**

To ensure fair evaluation, the test data has specific characteristics:

* Time Split: Users have rated movies both before and after a global split time point.
* Data Integrity: The holdout part includes the immediate next item for each test user post the split timepoint, discarding movies rated afterward.
* Training and Evaluation Protocol: The evaluation is designed to test against strong generalization, i.e., all test users are warm-start. Strictly utilize only the provided training dataset for model training. Reserve the test data solely for generating recommendations.

**Evaluation Metric**

Your models will be assessed based on the NDCG@20 metric, a reliable measure of ranking quality. Refer to the Evaluation tab for more details on this evaluation metric

**Submission Guidelines**

Prepare your submission in CSV format with two columns: userid and movieid:

* userid: Represents all users within the set.
* movieid: Indicates the recommended movies for each user.

### Set up kaggle and download data

#### Mount GoogleDrive

In [1]:
import os

def find_path(name, path='/content'):
    '''
    поиск файла или папки
    '''
    result = []
    for root, dirs, files in os.walk(path):
        if name in files+dirs:
          result.append([os.path.join(root, name), root])
    return result

if len(find_path('MyDrive')) == 0 and True:
  from google.colab import drive
  drive.mount('/content/drive')

Mounted at /content/drive


#### Kaggle

In [2]:
!pip -q install kaggle

In [3]:
competition_name = 'hse-recommender-systems-course-challenge-2023'

In [4]:
import os
import json

kaggle_cred_file = find_path('kaggle.json')[0][0]
# Opening JSON file
with open(kaggle_cred_file) as json_file:
    data = json.load(json_file)

os.environ['KAGGLE_USERNAME'] = data['username']
os.environ['KAGGLE_KEY'] = data['key']

In [5]:
import kaggle
kaggle.api.authenticate()

In [6]:
!kaggle competitions download -c {competition_name}
!unzip {competition_name}.zip

Downloading hse-recommender-systems-course-challenge-2023.zip to /content
 92% 81.0M/87.6M [00:02<00:00, 46.9MB/s]
100% 87.6M/87.6M [00:02<00:00, 42.4MB/s]
Archive:  hse-recommender-systems-course-challenge-2023.zip
  inflating: sample_submission.csv   
  inflating: testset                 
  inflating: training                


In [7]:
def save_submission(recomendations, configuration_test, item_index_origin, name='kaggle_submission.csv'):
    result = []
    for new_userid, user_recs in enumerate(recomendations):
        old_userid = conf_data_test['user_index'][new_userid]
        for new_itemid in user_recs:
          old_itemid = item_index_origin[new_itemid]
          result.append([old_userid, old_itemid])
    pd.DataFrame(result, columns=['userid', 'movieid']).to_csv(name, index=False)
    print('Save file: ', name)

In [8]:
def send_submission(submission_title, kaggle_submission_file_name='kaggle_submission.csv', competition_name=competition_name):
  print('Submission_title: ', submission_title)
  print('Send file to kaggle: ', kaggle_submission_file_name)
  !kaggle competitions submit -c {competition_name} -f {kaggle_submission_file_name} -m {submission_title}

#### Usefull py files

In [9]:
# polara and gitfiles
!pip -q install --upgrade git+https://github.com/evfro/polara.git@develop#egg=polara
! wget -q https://raw.githubusercontent.com/Personalization-Technologies-Lab/RecSys-Course-HSE-Fall23/main/Seminar5/dataprep.py -O dataprep.py
! wget -q https://raw.githubusercontent.com/Personalization-Technologies-Lab/RecSys-Course-HSE-Fall23/main/Seminar5/evaluation.py -O evaluation.py
!pip -q install lightfm
!cp "{find_path('lfm.py')[0][0]}" "{os.getcwd()}"

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for polara (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone


### Imports

In [10]:
# lightfm hw3
import numpy as np
import pandas as pd

from scipy.sparse import csr_matrix
from scipy.sparse import hstack as sp_hstack
from scipy.sparse import diags as spdiags
from scipy.sparse import eye as speye

from tqdm import tqdm
from polara.preprocessing.dataframes import matrix_from_observations

from lfm import build_lfm_model, encode_virus_features
from evaluation import topn_recommendations

# pureSVD seminar
from scipy.sparse.linalg import svds, LinearOperator
import seaborn as sns
sns.set_theme(style='white', context='paper')
%config InlineBackend.figure_format = "svg"

from polara import get_movielens_data
from dataprep import transform_indices, leave_last_out, verify_time_split, reindex_data, generate_interactions_matrix
from evaluation import topn_recommendations, model_evaluate, downvote_seen_items

### Preprocessing data

#### info

In [11]:
data_ = pd.read_csv('training')
data_.head()

Unnamed: 0,userid,movieid,rating,timestamp
0,158862,1600,4.0,978307288
1,49693,9102,4.0,985466347
2,65237,2408,3.0,978308050
3,82765,2590,2.0,985139361
4,120946,285,2.0,978308465


In [12]:
data_test_ = pd.read_csv('testset')
data_test_.head()

Unnamed: 0,userid,movieid,rating,timestamp
0,55,1631,3.5,1274778835
1,55,493,1.0,1274781574
2,55,945,4.0,1301556480
3,55,2072,1.5,1274781305
4,55,2073,4.5,1274781645


In [13]:
def take_data_information(data=data_, print_data_info=False):
    if print_data_info: print(data.info())
    print('Size of table: ', data.shape)
    print('Amunt of users: ', data.userid.unique().shape)
    print('Amunt of films: ', data.movieid.unique().shape)

In [14]:
take_data_information(data_)

Size of table:  (11782768, 4)
Amunt of users:  (127282,)
Amunt of films:  (18264,)


In [15]:
take_data_information(data_test_)

Size of table:  (1368134, 4)
Amunt of users:  (2963,)
Amunt of films:  (17102,)


In [16]:
print('Amount of new users in testset: ', len(set(data_test_.userid.unique()) - set(data_.userid.unique())))
print('Amount of new movies in testset: ', len(set(data_test_.movieid.unique()) - set(data_.movieid.unique())))

Amount of new users in testset:  2963
Amount of new movies in testset:  0


In [17]:
# reindex movies from 0 to  n
#movieid_idx, movieid_index_origin = pd.factorize(data_['movieid'], sort=True)

#data = data_.copy()
#data['movieid'] = movieid_idx
#data_test = data_test_.copy()
#data_test['movieid'] = data_test_.movieid.apply(lambda mi: np.where(movieid_index_origin==mi)[0][0])

In [18]:
concat_movie_idx, movie_index_origin = pd.factorize(pd.concat((data_, data_test_))['movieid'], sort=True)

data = data_.copy()
data['movieid'] = concat_movie_idx[:data_.shape[0]]
data_test = data_test_.copy()
data_test['movieid'] = concat_movie_idx[data_.shape[0]:]

#### Make train, validation, holdout datasets

In [19]:
def create_base_config(data, origin_data=data_):
  config = {
      'users': 'userid',
      'items': 'movieid',
      'feedback': 'rating',
      'n_users': len(set(data.userid.unique())),
      'n_items': len(set(origin_data.movieid.unique())),
      'users_arr': np.sort(data.userid.unique()),
      'items_arr': np.sort(data.movieid.unique()),
      'average_rating': origin_data['rating'].mean()
  }
  return config

In [20]:
conf_data = create_base_config(data)
conf_data_test = create_base_config(data_test)
conf_data_test['n_items'] = conf_data['n_items']
conf_data

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 127282,
 'n_items': 18264,
 'users_arr': array([     0,      1,      3, ..., 283225, 283226, 283227]),
 'items_arr': array([    0,     1,     2, ..., 18261, 18262, 18263]),
 'average_rating': 3.502377624680381}

Devide data to train and val

In [21]:
#persent =  conf_data_test['n_users'] / conf_data['n_users'] #0.023
persent=0.04
np.random.seed(16)
val_users_ = np.random.choice(data.userid.unique(), int(persent*conf_data['n_users']), replace=False)

val_dataset_ = data[data.userid.isin(val_users_)]
train_dataset_ = data[~data.userid.isin(val_users_)]

conf_train = create_base_config(train_dataset_)
conf_val = create_base_config(val_dataset_)
#conf_train['n_items'] = conf_data['n_items']
#conf_val['n_items'] = conf_data['n_items']

In [22]:
conf_train

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 122191,
 'n_items': 18264,
 'users_arr': array([     0,      1,      3, ..., 283222, 283225, 283226]),
 'items_arr': array([    0,     1,     2, ..., 18261, 18262, 18263]),
 'average_rating': 3.502377624680381}

In [23]:
conf_val

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 5091,
 'n_items': 18264,
 'users_arr': array([   153,    182,    192, ..., 283199, 283224, 283227]),
 'items_arr': array([    0,     1,     2, ..., 18175, 18176, 18255]),
 'average_rating': 3.502377624680381}

Create holdout for val

In [24]:
test_dataset_, holdout_ = leave_last_out(val_dataset_, 'userid', 'timestamp')

holdout_ = holdout_[holdout_.userid.isin(test_dataset_.userid.unique())]

# check correct time splitting
verify_time_split(test_dataset_, holdout_)
assert holdout_.userid.unique().shape == test_dataset_.userid.unique().shape

In [25]:
conf_test = create_base_config(test_dataset_)
#conf_test['n_items'] = conf_data['n_items']
conf_test

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 4897,
 'n_items': 18264,
 'users_arr': array([   153,    182,    195, ..., 283199, 283224, 283227]),
 'items_arr': array([    0,     1,     2, ..., 18175, 18176, 18255]),
 'average_rating': 3.502377624680381}

In [26]:
conf_hold = create_base_config(holdout_)
#conf_hold['n_items'] = conf_data['n_items']
conf_hold

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 4897,
 'n_items': 18264,
 'users_arr': array([   153,    182,    195, ..., 283199, 283224, 283227]),
 'items_arr': array([    0,     1,     2, ..., 17674, 17726, 18061]),
 'average_rating': 3.502377624680381}

#### reindex users

In [27]:
#train_userid_idx, train_userid_index_origin = pd.factorize(train_dataset_['userid'], sort=True)

#train_dataset = train_dataset_.copy()
#train_dataset['userid'] = train_userid_idx

## Models

### PureSVD

#### model

Lets make simple PureSVD model. Easy compute, easy make fold_in technique for new warm users

In [31]:
from scipy.sparse import diags
from scipy.sparse.linalg import norm as spnorm

def sparse_normalize(matrix, scaling, axis):
    '''Function to scale either rows or columns of the sparse rating matrix'''
    if scaling == 1: # no scaling (standard SVD case)
        return matrix

    norm = spnorm(matrix, axis=axis, ord=2) # compute Euclidean norm of rows or columns
    scaling_matrix = diags(np.power(norm, scaling-1, where=norm!=0))

    if axis == 0: # scale columns
        return matrix.dot(scaling_matrix)
    if axis == 1: # scale rows
        return scaling_matrix.dot(matrix)

In [32]:
def generate_interactions_matrix(data, data_description, rebase_users=False, verbose=0):
    n_users = data_description['n_users']
    n_items = data_description['n_items']
    # get indices of observed data
    user_idx = data[data_description['users']].values
    if rebase_users: # handle non-contiguous index of test users
        # This ensures that all user ids are contiguous and start from 0,
        # which helps ensure data consistency at the scoring stage.
        user_idx, user_index = pd.factorize(user_idx, sort=True)
        n_users = len(user_index)
        data_description['user_index'] = user_index
    item_idx = data[data_description['items']].values
    feedback = data[data_description['feedback']].values
    # construct rating matrix
    if verbose:
        print(f'n_users={n_users}, n_items={n_items}')
        print(f'user_idx.shape={user_idx.shape}, item_idx.shape={item_idx.shape}')
        print(f'max userid ', max(user_idx))
        print(f'max itemid ', max(item_idx))
    return csr_matrix((feedback, (user_idx, item_idx)), shape=(n_users, n_items))

In [33]:
def build_svd_model(config, data, data_description, normalize=False, scaling=1, scaling_axis=1):
    # input- your data with reindex movies
    # we create interactions_matrix and also reindex our train users
    source_matrix = generate_interactions_matrix(data, data_description, rebase_users=True)
    if normalize: source_matrix = sparse_normalize(source_matrix, scaling=scaling, axis=scaling_axis)
    _, s, vt = svds(
        source_matrix.astype('f8'),
        k=config['rank'],
        return_singular_vectors='vh'
    )
    sidx = np.argsort(-s)
    singular_values = s[sidx]
    item_factors = np.ascontiguousarray(vt[sidx, :].T)
    return item_factors, singular_values

#### Example

##### example train

In [None]:
%%time
svd_config = {'rank': 200}
V, sigma = svd_params = build_svd_model(svd_config, train_dataset_, conf_train)

# verify orthogonality
#np.testing.assert_almost_equal(
#    V.T @ V,
#    np.eye(svd_config['rank']), decimal=14
#)

CPU times: user 1min 14s, sys: 31 s, total: 1min 45s
Wall time: 1min 23s


In [34]:
def svd_model_scoring(params, data, data_description): #, normalize=False, scaling=1, scaling_axis=1):
    item_factors, sigma = params
    test_matrix = generate_interactions_matrix(data, data_description, rebase_users=True, verbose=False)
    #if normalize: test_matrix = sparse_normalize(test_matrix, scaling=scaling, axis=scaling_axis); print('norm')
    scores = test_matrix.dot(item_factors) @ item_factors.T
    return scores

In [None]:
%%time
scores = svd_model_scoring(svd_params, test_dataset_, conf_test)

CPU times: user 2.26 s, sys: 509 ms, total: 2.77 s
Wall time: 1.7 s


In [35]:
def downvote_seen_items(scores, data, data_description):
    assert isinstance(scores, np.ndarray), 'Scores must be a dense numpy array!'
    itemid = data_description['items']
    userid = data_description['users']
    # get indices of observed data, corresponding to scores array
    # we need to provide correct mapping of rows in scores array into
    # the corresponding user index (which is assumed to be sorted)
    row_idx, test_users = pd.factorize(data[userid], sort=True)
    assert len(test_users) == scores.shape[0]
    assert (test_users == data_description['user_index']).all()
    col_idx = data[itemid].values
    # downvote scores at the corresponding positions
    scores[row_idx, col_idx] = scores.min() - 1

In [None]:
downvote_seen_items(scores, test_dataset_, conf_test)
svd_recs = topn_recommendations(scores, topn=20)

In [36]:
def model_evaluate(recommended_items, holdout, holdout_description, topn=20, rebase_users=True, verbose=0):
    # reindex users and sort
    userid = holdout_description['users']
    user_idx, user_index = pd.factorize(holdout[userid], sort=True)
    holdout_description['user_index'] = user_index
    holdout[userid] = user_idx
    holdout = holdout.sort_values('userid')

    itemid = holdout_description['items']
    holdout_items = holdout[itemid].values
    assert recommended_items.shape[0] == len(holdout_items)
    hits_mask = recommended_items[:, :topn] == holdout_items.reshape(-1, 1)

    # HR calculation
    hr = np.mean(hits_mask.any(axis=1))

    # MRR calculation
    n_test_users = recommended_items.shape[0]
    hit_rank = np.where(hits_mask)[1] + 1.0
    mrr = np.sum(1 / hit_rank) / n_test_users

    # CONV calculation
    n_items = holdout_description['n_items']
    cov = np.unique(recommended_items).size / n_items

    # NDCG canculation
    # Calculate DCG for each user
    dcg = np.sum(hits_mask / np.log2(np.arange(2, hits_mask.shape[1] + 2)), axis=1)
    # Create ideal ranking for each user
    ideal_ranking = np.flip(np.sort(hits_mask, axis=1), axis=1)
    # Calculate IDCG for each user
    idcg = np.sum(ideal_ranking / np.log2(np.arange(2, hits_mask.shape[1] + 2)), axis=1)
    # Avoid division by zero
    idcg[idcg == 0] = 1
    # Calculate NDCG for each user
    ndcg_values = dcg / idcg
    # Take mean
    # calc ndcg for each user then mean
    macro_ndcg = np.mean(ndcg_values)
    # sum dcg and  idcg then devide
    #micro_ndsg = np.sum(dcg) / np.sum(idcg) # no diferences cause only

    if verbose:
        print(f'NDCG@{topn}: {macro_ndcg}, \nHR@{topn}: {hr}, \nMRR@{topn}: {mrr}, \nCOV@{topn}: {cov}')
    return macro_ndcg, hr, mrr, cov

In [None]:
hit_mask = model_evaluate(svd_recs, holdout_, conf_hold, topn=20, verbose=1)

NDCG@20: 0.047632817841282905, 
HR@20: 0.11190524811108842, 
MRR@20: 0.030065278797127004, 
COV@20: 0.08831581252737626


##### example send solution

train model and take recomendations

In [None]:
%%time
svd_config = {'rank': 200}
V, sigma = svd_params = build_svd_model(svd_config, data, conf_data)

CPU times: user 1min 7s, sys: 28.9 s, total: 1min 35s
Wall time: 1min 18s


In [None]:
conf_data_test

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 2963,
 'n_items': 17102,
 'users_arr': array([    55,    133,    136, ..., 282999, 283047, 283183]),
 'items_arr': array([    0,     1,     2, ..., 18244, 18254, 18260])}

In [None]:
conf_data_test

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 2963,
 'n_items': 17102,
 'users_arr': array([    55,    133,    136, ..., 282999, 283047, 283183]),
 'items_arr': array([    0,     1,     2, ..., 18244, 18254, 18260]),
 'user_index': array([    55,    133,    136, ..., 282999, 283047, 283183])}

In [None]:
%%time
scores = svd_model_scoring(svd_params, data_test, conf_data_test)

n_users=2963, n_items=18264
user_idx.shape=(1368134,), item_idx.shape=(1368134,)
max userid  2962
max itemid  18260
CPU times: user 1.79 s, sys: 197 ms, total: 1.99 s
Wall time: 1.27 s


In [None]:
downvote_seen_items(scores, data_test, conf_data_test)
svd_recs = topn_recommendations(scores, topn=20)

reindex users create file and send it

In [None]:
save_submission(svd_recs, conf_data_test, movie_index_origin)

Save file:  kaggle_submission.csv


In [None]:
send_submission('pureSVD_200')

Submission_title:  pureSVD_200
Send file to kaggle:  kaggle_submission.csv
100% 636k/636k [00:00<00:00, 2.05MB/s]
Successfully submitted to HSE Recommender Systems Course Challenge 2023

#### GridSearch

In [37]:
def grid_search_pureSVD(rank_list,
                        buildSVDmodel_function=build_svd_model,
                        modelScoring_function=svd_model_scoring,
                        trainset=train_dataset_,
                        conf_train=conf_train,
                        testset=test_dataset_,
                        conf_test=conf_test,
                        holdout=holdout_,
                        conf_hold=conf_hold,
                        topn=20,
                        calc_metrics=False,
                        metric_verbose=1,
                        normalize=False,
                        scaling=1, scaling_axis=1,
                        verbose=0):
    print(buildSVDmodel_function)
    metrics_list = []
    svd_recs_list = []
    svd_config = {'rank': max(rank_list)}
    V, sigma = svd_params = buildSVDmodel_function(svd_config, trainset, conf_train, normalize=normalize, scaling=scaling, scaling_axis=scaling_axis)

    for r in rank_list:
        if verbose:
          print('##########')
          print('Rank: ', r)
        svd_config = {'rank': r}
        svd_params_grid = V[:,:r], sigma[:r]
        scores = modelScoring_function(svd_params_grid, testset, conf_test) #normalize=normalize, scaling=scaling, scaling_axis=scaling_axis)
        downvote_seen_items(scores, testset, conf_test)
        svd_recs = topn_recommendations(scores, topn=topn)
        svd_recs_list.append(svd_recs)
        if calc_metrics:
            metrics = model_evaluate(svd_recs, holdout, conf_hold, topn=topn, verbose=metric_verbose)
            metrics_list.append(metrics)
    return svd_recs_list, svd_params, metrics_list

In [None]:
%%time
# just simple pureSVD
rank_list = list(range(24,800,24))
_ = grid_search_pureSVD(rank_list, calc_metrics=True)

<function build_svd_model at 0x790ff1d9e290>
##########
Rank:  24
NDCG@20: 0.04721731069857013, 
HR@20: 0.11313048805391056, 
MRR@20: 0.029310202870324664, 
COV@20: 0.05294568550153307
##########
Rank:  48
NDCG@20: 0.052678029725075404, 
HR@20: 0.12456606085358382, 
MRR@20: 0.033090589251940326, 
COV@20: 0.06482698204117389
##########
Rank:  72
NDCG@20: 0.05146761265121468, 
HR@20: 0.12109454768225444, 
MRR@20: 0.03238322527491324, 
COV@20: 0.0705759964958388
##########
Rank:  96
NDCG@20: 0.05069438014738378, 
HR@20: 0.11843986113947315, 
MRR@20: 0.03209760893303481, 
COV@20: 0.07429916776171704
##########
Rank:  120
NDCG@20: 0.05092331175378975, 
HR@20: 0.11925668776802124, 
MRR@20: 0.032220599937779724, 
COV@20: 0.07846035917652212
##########
Rank:  144
NDCG@20: 0.050919483124861935, 
HR@20: 0.11701041453951398, 
MRR@20: 0.03274944186556212, 
COV@20: 0.08190976784932107
##########
Rank:  168
NDCG@20: 0.05063358404279323, 
HR@20: 0.11394731468245865, 
MRR@20: 0.03326924266605223, 
COV

pureSVD when all ranks 1

In [None]:
train_dataset_.assign(feedback=1).head()

Unnamed: 0,userid,movieid,rating,timestamp,feedback
0,158862,1556,4.0,978307288,1
1,49693,7674,4.0,985466347,1
2,65237,2303,3.0,978308050,1
5,136531,2201,2.0,978312286,1
6,6843,160,5.0,978309715,1


In [None]:
%%time
# just simple pureSVD
train_dataset_.assign(feedback=1)
rank_list = list(range(24,600,24))
_ = grid_search_pureSVD(rank_list,
                        trainset=train_dataset_.assign(feedback=1),
                        testset=test_dataset_.assign(feedback=1),
                        calc_metrics=True)

<function build_svd_model at 0x790ff1d9e290>
##########
Rank:  24
NDCG@20: 0.04721731069857013, 
HR@20: 0.11313048805391056, 
MRR@20: 0.029310202870324664, 
COV@20: 0.05294568550153307
##########
Rank:  48
NDCG@20: 0.052678029725075404, 
HR@20: 0.12456606085358382, 
MRR@20: 0.033090589251940326, 
COV@20: 0.06482698204117389
##########
Rank:  72
NDCG@20: 0.05146761265121468, 
HR@20: 0.12109454768225444, 
MRR@20: 0.03238322527491324, 
COV@20: 0.0705759964958388
##########
Rank:  96
NDCG@20: 0.05069438014738378, 
HR@20: 0.11843986113947315, 
MRR@20: 0.03209760893303481, 
COV@20: 0.07429916776171704
##########
Rank:  120
NDCG@20: 0.05092331175378975, 
HR@20: 0.11925668776802124, 
MRR@20: 0.032220599937779724, 
COV@20: 0.07846035917652212
##########
Rank:  144
NDCG@20: 0.050919483124861935, 
HR@20: 0.11701041453951398, 
MRR@20: 0.03274944186556212, 
COV@20: 0.08190976784932107
##########
Rank:  168
NDCG@20: 0.05063358404279323, 
HR@20: 0.11394731468245865, 
MRR@20: 0.03326924266605223, 
COV

In [None]:
rank_list = [35, 38, 45, 48, 50, 56, 64, 300, 400, 450]
svd_recs_list, _, _ = grid_search_pureSVD(rank_list,
                        trainset=data.assign(feedback=1),
                        conf_train=conf_data,
                        testset=data_test.assign(feedback=1),
                        conf_test=conf_data_test,
                        calc_metrics=False)
for i, r in enumerate(rank_list):
    fname=f'pureSVD_{r}_feeadback1'
    save_submission(svd_recs_list[i], conf_data_test, movie_index_origin, name=fname)
    send_submission(f'pureSVD_{r}', kaggle_submission_file_name=fname)

<function build_svd_model at 0x790ff1d9e290>
##########
Rank:  35
##########
Rank:  38
##########
Rank:  45
##########
Rank:  48
##########
Rank:  50
##########
Rank:  56
##########
Rank:  64
##########
Rank:  300
##########
Rank:  400
##########
Rank:  450
Save file:  pureSVD_35_feeadback1
Submission_title:  pureSVD_35
Send file to kaggle:  pureSVD_35_feeadback1
100% 624k/624k [00:01<00:00, 463kB/s] 
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:  pureSVD_38_feeadback1
Submission_title:  pureSVD_38
Send file to kaggle:  pureSVD_38_feeadback1
100% 624k/624k [00:01<00:00, 543kB/s]
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:  pureSVD_45_feeadback1
Submission_title:  pureSVD_45
Send file to kaggle:  pureSVD_45_feeadback1
100% 625k/625k [00:01<00:00, 616kB/s]
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:  pureSVD_48_feeadback1
Submission_title:  pureSVD_48
Send file to kaggle:  pure

### PureSVD with shifted scoring

#### model

In [42]:
from scipy import sparse
def build_shifted_model(config, data, data_description, normalize=False, scaling=1, scaling_axis=1):
    source_matrix = generate_interactions_matrix(data, data_description, rebase_users=True)
    #data_description['average_rating'] = data['rating'].mean()
    average_rating = data_description['average_rating']
    centered_matrix = source_matrix._with_data(source_matrix.data - average_rating)

    # define matvecs for the LinearOperator of the shifted matrix
    def shifted_mv(v):
        return centered_matrix.dot(v) + average_rating*v.sum()

    def shifted_rmv(v):
        return centered_matrix.T.dot(v) + average_rating*v.sum()

    shifted_matrix = LinearOperator(
        source_matrix.shape,
        shifted_mv,
        shifted_rmv
    )
    #if normalize: shifted_matrix = sparse_normalize(sparse.csr_matrix(shifted_matrix), scaling=scaling, axis=scaling_axis)
    _, s, vt = svds(shifted_matrix, k=config['rank'])
    sidx = np.argsort(-s)
    singular_values = s[sidx]
    item_factors = np.ascontiguousarray(vt[sidx, :].T)
    return item_factors, singular_values

In [39]:
def shifted_model_scoring(params, testset, data_description, normalize=False, scaling=1, scaling_axis=1):
    item_factors, sigma = params
    average_rating = data_description['average_rating']
    test_matrix = generate_interactions_matrix(testset, data_description, rebase_users=True)
    centered_matrix = test_matrix._with_data(test_matrix.data - average_rating)
    user_factors = centered_matrix.dot(item_factors) + average_rating*item_factors.sum(axis=0)
    scores = user_factors @ item_factors.T
    return scores

#### example

In [None]:
svd_config = {'rank': 48}
V_shift, sigma_shift = shifted_params = build_shifted_model(
    svd_config,
    train_dataset_,
    conf_train
)

In [None]:
shifted_scores = shifted_model_scoring(
    shifted_params,
    test_dataset_,
    conf_test
)
downvote_seen_items(shifted_scores, test_dataset_, conf_test)
shifted_recs = topn_recommendations(shifted_scores, topn=20)
model_evaluate(shifted_recs, holdout_, conf_hold, topn=20, verbose=True)

NDCG@20: 0.037120312549409895, 
HR@20: 0.08803928446159243, 
MRR@20: 0.02331470047341291, 
COV@20: 0.045280332895313184


(0.037120312549409895,
 0.08803928446159243,
 0.02331470047341291,
 0.045280332895313184)

#### gridSearch

In [None]:
%%time
# pureSVD_shifted
rank_list = list(range(24,600,24))
_ = grid_search_pureSVD(rank_list, build_shifted_model, shifted_model_scoring, calc_metrics=True)

<function build_shifted_model at 0x790ff1d9eb90>
##########
Rank:  24
NDCG@20: 0.03829624903420122, 
HR@20: 0.08985092914028997, 
MRR@20: 0.024251946796156276, 
COV@20: 0.046156373193166886
##########
Rank:  48
NDCG@20: 0.038275016993397096, 
HR@20: 0.08862568919746784, 
MRR@20: 0.024538878119180332, 
COV@20: 0.050536574682435394
##########
Rank:  72
NDCG@20: 0.035445911502377865, 
HR@20: 0.08025321625484991, 
MRR@20: 0.02318591896008979, 
COV@20: 0.053274200613228205
##########
Rank:  96
NDCG@20: 0.03389204669129915, 
HR@20: 0.07780273636920564, 
MRR@20: 0.02194514845114884, 
COV@20: 0.05590232150678931
##########
Rank:  120
NDCG@20: 0.03314134073951498, 
HR@20: 0.07535225648356136, 
MRR@20: 0.02166732433697866, 
COV@20: 0.05951598773543583
##########
Rank:  144
NDCG@20: 0.031712846474930854, 
HR@20: 0.07371860322646519, 
MRR@20: 0.020260079433802716, 
COV@20: 0.06225361366622865
##########
Rank:  168
NDCG@20: 0.0307611640274713, 
HR@20: 0.07004288339799877, 
MRR@20: 0.020056184680724

In [None]:
r = 49
svd_recs, _, _ = grid_search_pureSVD([r], build_shifted_model, shifted_model_scoring,
                        trainset=data.assign(feedback=1),
                        conf_train=conf_data,
                        testset=data_test.assign(feedback=1),
                        conf_test=conf_data_test,
                        calc_metrics=False)

fname='pureSVD_49_feeadback1'
save_submission(svd_recs, conf_data_test, movie_index_origin, name=fname)
send_submission(f'pureSVD_{r}', kaggle_submission_file_name=fname)

In [None]:
rank_list = [18, 24, 32, 35, 38, 45, 48, 50, 56, 64, 300, 400, 450, 650]
svd_recs_list, _, _ = grid_search_pureSVD(rank_list, build_shifted_model, shifted_model_scoring,
                        trainset=data,
                        conf_train=conf_data,
                        testset=data_test,
                        conf_test=conf_data_test,
                        calc_metrics=False)

for i, r in enumerate(rank_list):
    fname=f'pureSVD_shift_{r}'
    save_submission(svd_recs_list[i], conf_data_test, movie_index_origin, name=fname)
    send_submission(f'pureSVD_shift_{r}', kaggle_submission_file_name=fname)

<function build_shifted_model at 0x790ff7a3a3b0>
##########
Rank:  18
##########
Rank:  24
##########
Rank:  32
##########
Rank:  35
##########
Rank:  38
##########
Rank:  45
##########
Rank:  48
##########
Rank:  50
##########
Rank:  56
##########
Rank:  64
##########
Rank:  300
##########
Rank:  400
##########
Rank:  450
##########
Rank:  650
Save file:  pureSVD_shift_18
Submission_title:  pureSVD_shift_18
Send file to kaggle:  pureSVD_shift_18
100% 620k/620k [00:01<00:00, 569kB/s] 
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:  pureSVD_shift_24
Submission_title:  pureSVD_shift_24
Send file to kaggle:  pureSVD_shift_24
100% 620k/620k [00:01<00:00, 489kB/s]  
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:  pureSVD_shift_32
Submission_title:  pureSVD_shift_32
Send file to kaggle:  pureSVD_shift_32
100% 621k/621k [00:01<00:00, 549kB/s] 
Successfully submitted to HSE Recommender Systems Course Challenge 2023Save file:

### PureSVD scale

#####model

In [None]:
from scipy.sparse import diags
from scipy.sparse.linalg import norm as spnorm

def sparse_normalize(matrix, scaling, axis):
    '''Function to scale either rows or columns of the sparse rating matrix'''
    if scaling == 1: # no scaling (standard SVD case)
        return matrix

    norm = spnorm(matrix, axis=axis, ord=2) # compute Euclidean norm of rows or columns
    scaling_matrix = diags(np.power(norm, scaling-1, where=norm!=0))

    if axis == 0: # scale columns
        return matrix.dot(scaling_matrix)
    if axis == 1: # scale rows
        return scaling_matrix.dot(matrix)

##### GS compare diff normalisers

In [None]:
%%time
# just simple pureSVD
rank_list = list(range(16,60,2))
_ = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=False, scaling=0.2)

<function build_svd_model at 0x790fe64032e0>
##########
Rank:  16
NDCG@20: 0.0438958127546873, 
HR@20: 0.1035327751684705, 
MRR@20: 0.027681618593221137, 
COV@20: 0.045444590451160755
##########
Rank:  18
NDCG@20: 0.04411861161445788, 
HR@20: 0.10475801511129262, 
MRR@20: 0.02749336486120743, 
COV@20: 0.048127463863337716
##########
Rank:  20
NDCG@20: 0.04428734130116839, 
HR@20: 0.10843373493975904, 
MRR@20: 0.026826768199732377, 
COV@20: 0.04982479194042926
##########
Rank:  22
NDCG@20: 0.045386465147652676, 
HR@20: 0.10945476822544414, 
MRR@20: 0.028102158819676527, 
COV@20: 0.05119360490582567
##########
Rank:  24
NDCG@20: 0.04721731069857013, 
HR@20: 0.11313048805391056, 
MRR@20: 0.029310202870324664, 
COV@20: 0.05294568550153307
##########
Rank:  26
NDCG@20: 0.048975760413475504, 
HR@20: 0.11558096793955483, 
MRR@20: 0.03096187753735402, 
COV@20: 0.05464301357862462
##########
Rank:  28
NDCG@20: 0.0481460596323896, 
HR@20: 0.11701041453951398, 
MRR@20: 0.029594338121878516, 
COV@

In [None]:
%%time
# pureSVD with normalaize train data
rank_list = list(range(16,60,2))
_ = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=0.2)

<function build_svd_model at 0x790fe64032e0>
##########
Rank:  16
NDCG@20: 0.04922387512205143, 
HR@20: 0.12089034102511742, 
MRR@20: 0.029552296731853493, 
COV@20: 0.03520586946999562
##########
Rank:  18
NDCG@20: 0.04959500011364643, 
HR@20: 0.12109454768225444, 
MRR@20: 0.029989399179956865, 
COV@20: 0.03608190976784932
##########
Rank:  20
NDCG@20: 0.049683743432514124, 
HR@20: 0.1225239942822136, 
MRR@20: 0.029616540697620538, 
COV@20: 0.036903197547087166
##########
Rank:  22
NDCG@20: 0.05132849866210642, 
HR@20: 0.12619971411068, 
MRR@20: 0.03085145934182054, 
COV@20: 0.03772448532632501
##########
Rank:  24
NDCG@20: 0.05104081852904824, 
HR@20: 0.12415764753930979, 
MRR@20: 0.030991115323059246, 
COV@20: 0.03903854577310556
##########
Rank:  26
NDCG@20: 0.05155870037006109, 
HR@20: 0.1225239942822136, 
MRR@20: 0.03209379908784706, 
COV@20: 0.039640823477879984
##########
Rank:  28
NDCG@20: 0.051457860126874824, 
HR@20: 0.12231978762507657, 
MRR@20: 0.032100871245659454, 
COV@20

In [None]:
%%time
# pureSVD with normalaize train data and test data when scoring
rank_list = list(range(16,60,2))
_ = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=0.2)

<function build_svd_model at 0x790fe64032e0>
##########
Rank:  16
norm
NDCG@20: 0.04922387512205143, 
HR@20: 0.12089034102511742, 
MRR@20: 0.029552296731853493, 
COV@20: 0.03520586946999562
##########
Rank:  18
norm
NDCG@20: 0.04959500011364643, 
HR@20: 0.12109454768225444, 
MRR@20: 0.029989399179956865, 
COV@20: 0.03608190976784932
##########
Rank:  20
norm
NDCG@20: 0.049683743432514124, 
HR@20: 0.1225239942822136, 
MRR@20: 0.029616540697620538, 
COV@20: 0.036903197547087166
##########
Rank:  22
norm
NDCG@20: 0.05132849866210642, 
HR@20: 0.12619971411068, 
MRR@20: 0.03085145934182054, 
COV@20: 0.03772448532632501
##########
Rank:  24
norm
NDCG@20: 0.05104081852904824, 
HR@20: 0.12415764753930979, 
MRR@20: 0.030991115323059246, 
COV@20: 0.03903854577310556
##########
Rank:  26
norm
NDCG@20: 0.05155870037006109, 
HR@20: 0.1225239942822136, 
MRR@20: 0.03209379908784706, 
COV@20: 0.039640823477879984
##########
Rank:  28
norm
NDCG@20: 0.051457860126874824, 
HR@20: 0.12231978762507657, 
MR

In [None]:
%%time
# pureSVD with normalaize train data and test data when scoring
rank_list = list(range(16,60,2))
_ = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=0.2, scaling_axis=0, verbose=1)

<function build_svd_model at 0x788d92f6a290>
##########
Rank:  16
NDCG@20: 0.02891690721553982, 
HR@20: 0.07167653665509495, 
MRR@20: 0.017202276179877303, 
COV@20: 0.06208935611038108
##########
Rank:  18
NDCG@20: 0.029819224955178407, 
HR@20: 0.07310598325505412, 
MRR@20: 0.01797212871737396, 
COV@20: 0.06674332019272887
##########
Rank:  20
NDCG@20: 0.030775629433257026, 
HR@20: 0.07759852971206861, 
MRR@20: 0.018054126693103547, 
COV@20: 0.06849540078843627
##########
Rank:  22
NDCG@20: 0.031111232359349705, 
HR@20: 0.07780273636920564, 
MRR@20: 0.01836322691511564, 
COV@20: 0.07178055190538765
##########
Rank:  24
NDCG@20: 0.03182615553450152, 
HR@20: 0.07964059628343884, 
MRR@20: 0.018821829856141316, 
COV@20: 0.07561322820849758
##########
Rank:  26
NDCG@20: 0.032400849088795904, 
HR@20: 0.08209107616908311, 
MRR@20: 0.018947083309623466, 
COV@20: 0.08026719229084538
##########
Rank:  28
NDCG@20: 0.033410610174565425, 
HR@20: 0.08352052276904227, 
MRR@20: 0.019793944180731503, 


##### GS optimal scale factor and rank

Find optimal rank

In [None]:
rank_list = list(range(10,200, 20))
scaling_list = np.array((range(1, 11, 1)))/10
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=s, metric_verbose=0)
    scaling_metrics.append(m)

In [None]:
pd.DataFrame(np.array(scaling_metrics)[:, :, 0], columns=rank_list, index=scaling_list)

Unnamed: 0,10,30,50,70,90,110,130,150,170,190
0.1,0.046123,0.051762,0.052315,0.049191,0.046773,0.045804,0.047469,0.048342,0.04826,0.047685
0.2,0.046039,0.052869,0.051197,0.050738,0.048017,0.046215,0.047364,0.048973,0.04904,0.047409
0.3,0.045396,0.053535,0.052276,0.051619,0.048445,0.046446,0.048233,0.048937,0.048191,0.047046
0.4,0.045291,0.052102,0.053116,0.050187,0.04884,0.047261,0.048703,0.048467,0.048396,0.048017
0.5,0.044582,0.053574,0.053864,0.051295,0.048838,0.047305,0.048538,0.049256,0.048511,0.046392
0.6,0.042635,0.05346,0.053206,0.051812,0.049275,0.049254,0.048818,0.04847,0.048371,0.047471
0.7,0.041078,0.052839,0.052806,0.051878,0.050317,0.050082,0.04922,0.048684,0.04837,0.047583
0.8,0.04079,0.05124,0.053302,0.052391,0.051341,0.050596,0.050217,0.049491,0.048569,0.047589
0.9,0.039541,0.050205,0.053088,0.053027,0.050913,0.050544,0.049644,0.05044,0.048344,0.046951
1.0,0.038773,0.049231,0.052868,0.051853,0.050595,0.05063,0.051146,0.050349,0.049765,0.047928


In [None]:
np.array((range(10, 80, 5)))/100

array([0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 ,
       0.65, 0.7 , 0.75])

In [None]:
rank_list = list(range(10,51, 2))
scaling_list = np.array((range(10, 80, 5)))/100
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=s, metric_verbose=0)
    scaling_metrics.append(m)

In [None]:
pd.DataFrame(np.array(scaling_metrics)[:, :, 0], columns=rank_list, index=scaling_list)

Unnamed: 0,10,12,14,16,18,20,22,24,26,28,...,32,34,36,38,40,42,44,46,48,50
0.1,0.046123,0.048908,0.049146,0.049729,0.049697,0.049751,0.04957,0.051072,0.051132,0.05158,...,0.051444,0.05164,0.050979,0.051904,0.050784,0.05107,0.049919,0.051256,0.052014,0.052315
0.15,0.046114,0.048096,0.048447,0.049549,0.049665,0.049472,0.050665,0.050708,0.051096,0.052103,...,0.052168,0.052566,0.052204,0.052543,0.051878,0.052013,0.051046,0.051547,0.052409,0.051701
0.2,0.046039,0.046936,0.048539,0.049224,0.049595,0.049684,0.051328,0.051041,0.051559,0.051458,...,0.051777,0.051602,0.051635,0.052401,0.052393,0.05288,0.052214,0.052135,0.051347,0.051197
0.25,0.045579,0.0471,0.049701,0.046969,0.049457,0.049968,0.051113,0.050666,0.051532,0.050616,...,0.051641,0.052029,0.051593,0.052442,0.052825,0.053029,0.053136,0.052445,0.052068,0.051732
0.3,0.045396,0.046796,0.049144,0.047346,0.049185,0.050203,0.050596,0.051083,0.051177,0.051286,...,0.05215,0.052749,0.05188,0.053627,0.05264,0.053345,0.052903,0.05267,0.052888,0.052276
0.35,0.045652,0.047083,0.048892,0.04704,0.047865,0.048835,0.050548,0.051157,0.052161,0.052526,...,0.052181,0.051229,0.05245,0.052835,0.053591,0.05258,0.053932,0.053013,0.053281,0.053189
0.4,0.045291,0.047178,0.048435,0.046746,0.047537,0.048957,0.050527,0.050885,0.051945,0.05259,...,0.051564,0.052134,0.052875,0.053042,0.05397,0.052778,0.053905,0.052969,0.052876,0.053116
0.45,0.045224,0.04669,0.048133,0.04648,0.047509,0.048355,0.049537,0.05067,0.051436,0.053191,...,0.051855,0.052125,0.052007,0.053237,0.052766,0.053082,0.053082,0.052811,0.052138,0.053122
0.5,0.044582,0.046207,0.047745,0.047142,0.047536,0.048829,0.049926,0.050677,0.052499,0.052899,...,0.052246,0.051838,0.052071,0.052417,0.052981,0.053386,0.052806,0.053259,0.052532,0.053864
0.55,0.043256,0.045463,0.047734,0.046616,0.047161,0.048444,0.048877,0.049702,0.050287,0.051576,...,0.053132,0.051686,0.051574,0.052566,0.051307,0.052922,0.053549,0.052956,0.052226,0.053664


##### GS scaling axis=0

In [None]:
#scaling axis=0
rank_list = list(range(1,600, 20))[1:]
scaling_list = np.array((range(0, 11, 1)))/10
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=s, scaling_axis=0, metric_verbose=0, verbose=0)
    scaling_metrics.append(m)

In [None]:
np.save('axis0_scaling0_1_0.1_rank1_600_20', np.array(scaling_metrics)[:,:,0])

In [None]:
pd.DataFrame(np.array(scaling_metrics)[:, :, 0], columns=rank_list, index=scaling_list)

Unnamed: 0,21,41,61,81,101,121,141,161,181,201,...,401,421,441,461,481,501,521,541,561,581
0.0,0.022687,0.027273,0.029599,0.032066,0.033131,0.033742,0.035322,0.036008,0.03618,0.03697,...,0.042082,0.0432,0.043014,0.043987,0.044082,0.044266,0.044976,0.04526,0.045641,0.046082
0.1,0.02659,0.032068,0.03543,0.036654,0.038371,0.039148,0.040971,0.042318,0.043842,0.044099,...,0.049729,0.049845,0.050458,0.05041,0.050539,0.050626,0.051207,0.051709,0.050966,0.051691
0.2,0.031041,0.036971,0.03923,0.04124,0.043723,0.046525,0.046864,0.048125,0.049735,0.049918,...,0.05215,0.051774,0.052633,0.052865,0.053777,0.054482,0.054116,0.055037,0.055277,0.055499
0.3,0.034922,0.041194,0.043445,0.047437,0.049595,0.049929,0.050507,0.051525,0.051375,0.051391,...,0.058603,0.058829,0.060199,0.060057,0.059798,0.059838,0.061038,0.061407,0.06059,0.061356
0.4,0.037864,0.044037,0.04873,0.050804,0.051678,0.052334,0.053928,0.055921,0.055812,0.056922,...,0.06372,0.06431,0.064579,0.065374,0.065238,0.065518,0.065055,0.06522,0.065394,0.065375
0.5,0.041046,0.047973,0.051332,0.053371,0.055323,0.056111,0.057389,0.058537,0.058667,0.061049,...,0.065048,0.065196,0.065685,0.065134,0.064027,0.06413,0.062557,0.06196,0.061878,0.061724
0.6,0.042368,0.049709,0.053245,0.055546,0.056356,0.057926,0.059589,0.059208,0.061179,0.062547,...,0.059319,0.058609,0.0573,0.057307,0.057048,0.056474,0.056245,0.055833,0.054119,0.053353
0.7,0.04373,0.052079,0.054835,0.056081,0.057633,0.058964,0.059089,0.05812,0.058358,0.059408,...,0.052177,0.052107,0.05104,0.050153,0.049612,0.048972,0.047956,0.047153,0.047696,0.046166
0.8,0.043715,0.053492,0.055388,0.054469,0.05647,0.056432,0.055743,0.054706,0.055802,0.056123,...,0.046776,0.045735,0.045611,0.045388,0.044947,0.044218,0.042235,0.042332,0.042168,0.041492
0.9,0.044717,0.052754,0.053122,0.053964,0.053495,0.052966,0.052345,0.052881,0.052625,0.050702,...,0.04293,0.042182,0.042996,0.042058,0.040912,0.040809,0.04028,0.039705,0.039201,0.038068


In [None]:
# best scaling
np.argmax(np.array(scaling_metrics)[:, :, 0], axis=0)

array([9, 8, 8, 7, 7, 7, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       4, 4, 4, 4, 4, 4, 4])

In [None]:
# best by ranks
print(scaling_list)
print(np.argmax(np.array(scaling_metrics)[:, :, 0], axis=1))
print(np.array(rank_list)[np.argmax(np.array(scaling_metrics)[:, :, 0], axis=1)])

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
[28 26 28 26 24 21  9  9  4  3  2]
[581 541 581 541 501 441 201 201 101  81  61]


In [None]:
grid_arr = np.array(scaling_metrics)[:, :, 0]

In [None]:
for i, col in enumerate(np.argmax(np.array(scaling_metrics)[:, :, 0], axis=1)):
    print(f'scaling {i/10}, rank {rank_list[col]}, ndcg {grid_arr[i, col]}')

scaling 0.0, rank 581, ndcg 0.046081707671267715
scaling 0.1, rank 541, ndcg 0.051708825986896675
scaling 0.2, rank 581, ndcg 0.05549940155385166
scaling 0.3, rank 541, ndcg 0.06140725239673814
scaling 0.4, rank 501, ndcg 0.06551753330661234
scaling 0.5, rank 441, ndcg 0.06568540607912274
scaling 0.6, rank 201, ndcg 0.06254680769844276
scaling 0.7, rank 201, ndcg 0.0594075014899382
scaling 0.8, rank 101, ndcg 0.05646966189293169
scaling 0.9, rank 81, ndcg 0.053964308162230176
scaling 1.0, rank 61, ndcg 0.052516162733901015


In [None]:
#scaling axis=0
rank_list = list(range(500,900, 20))
scaling_list = [0.3, 0.35, 0.4]
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=s, scaling_axis=0, metric_verbose=0, verbose=0)
    scaling_metrics.append(m)

0
<function build_svd_model at 0x788d92f6a290>
1
<function build_svd_model at 0x788d92f6a290>
2
<function build_svd_model at 0x788d92f6a290>


In [None]:
np.save('axis0_scaling03_035_04_rank500_900_20', np.array(scaling_metrics)[:,:,0])

In [None]:
pd.DataFrame(np.array(scaling_metrics)[:, :, 0], columns=rank_list, index=scaling_list)

Unnamed: 0,500,520,540,560,580,600,620,640,660,680,700,720,740,760,780,800,820,840,860,880
0.3,0.059971,0.061007,0.061413,0.060831,0.061277,0.061241,0.061853,0.062328,0.062466,0.06261,0.063405,0.063323,0.063457,0.063729,0.064123,0.06391,0.063954,0.064875,0.064727,0.065337
0.35,0.062711,0.062921,0.063331,0.063823,0.064098,0.064138,0.064864,0.065127,0.065769,0.066089,0.065686,0.066553,0.066817,0.066369,0.065847,0.066028,0.0669,0.066682,0.066994,0.066746
0.4,0.065351,0.064781,0.065455,0.065384,0.065611,0.065811,0.065927,0.065789,0.065675,0.065955,0.065794,0.066153,0.066094,0.06646,0.065763,0.066312,0.066161,0.066178,0.065624,0.065804


In [None]:
grid_arr = np.array(scaling_metrics)[:, :, 0]
for i, col in enumerate(np.argmax(np.array(scaling_metrics)[:, :, 0], axis=1)):
    print(f'scaling {i/10}, rank {rank_list[col]}, ndcg {grid_arr[i, col]}')

scaling 0.0, rank 880, ndcg 0.06533731860316094
scaling 0.1, rank 860, ndcg 0.06699409001300762
scaling 0.2, rank 760, ndcg 0.06646025396569916


In [None]:
!cp "axis0_scaling03_035_04_rank500_900_20.npy" "find_path('kaggle.json')[0][1]"

#### send to kaggle

In [40]:
#rank_list = [35, 38, 45, 48, 50, 56, 64, 300, 400, 450]
#rank_list = list(range(16, 35, 2))
#rank_list=list(range(16, 24, 2))
#rank_list = list(range(496,501,1)) + list(range(502,506,1))
rank_list = [860]
s=0.15
scaling_axis=0
svd_recs_list, _, _ = grid_search_pureSVD(rank_list,
                        trainset=data,
                        conf_train=conf_data,
                        testset=data_test,
                        conf_test=conf_data_test,
                        calc_metrics=False,
                        normalize=True,
                        scaling=s,
                        scaling_axis=scaling_axis)

for i, r in enumerate(rank_list):
    fname=f'pureSVD_{r}_scale{s}'
    save_submission(svd_recs_list[i], conf_data_test, movie_index_origin, name=fname)
    send_submission(f'pureSVD_{r}_scale{s}_axis{scaling_axis}', kaggle_submission_file_name=fname)

<function build_svd_model at 0x7e77d0dd1750>
Save file:  pureSVD_860_scale0.15
Submission_title:  pureSVD_860_scale0.15_axis0
Send file to kaggle:  pureSVD_860_scale0.15
100% 628k/628k [00:02<00:00, 285kB/s] 
Successfully submitted to HSE Recommender Systems Course Challenge 2023

##### gs axis 1

In [None]:
#scaling axis=1
rank_list = list(range(500,900, 20))
scaling_list = [0.3, 0.35, 0.4]
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, calc_metrics=True, normalize=True, scaling=s, scaling_axis=1, metric_verbose=0, verbose=0)
    scaling_metrics.append(m)

0
<function build_svd_model at 0x788d92f6a290>
1
<function build_svd_model at 0x788d92f6a290>
2
<function build_svd_model at 0x788d92f6a290>


In [None]:
pd.DataFrame(np.array(scaling_metrics)[:, :, 0], columns=rank_list, index=scaling_list)

### PureSVD scale plus shifted


In [43]:
#scaling axis=0
rank_list = list(range(401,600, 20))[1:]
scaling_list = [0.3, 0.4, 0.5]
scaling_metrics = []
for i, s in enumerate(scaling_list):
    print(i)
    _, _, m = grid_search_pureSVD(rank_list, build_shifted_model, shifted_model_scoring, calc_metrics=True, normalize=True, scaling=s, scaling_axis=0, metric_verbose=0, verbose=0)
    scaling_metrics.append(m)

0
<function build_shifted_model at 0x7e7797f91ea0>


TypeError: ignored

In [None]:
np.save('axis0_shiftscaling03_04_05_rank401_600_20', np.array(scaling_metrics)[:,:,0])

### LightFM

#### model

In [None]:
lfm_config = dict(
    no_components = 60,
    loss = 'warp',
    max_sampled = 1,
    max_epochs = 60,
    learning_schedule = 'adagrad',
    user_alpha = 1e-3,
    item_alpha = 1e-3,
    random_state = 7032023
)
topn = 20

In [None]:
# Build the model
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k

# Assuming you have the following data structure: user_id, item_id, feedback
# Replace this with your actual data
# For demonstration purposes, I'll use randomly generated data
np.random.seed(42)
num_users = 100
num_items = 50
num_feedback = 500

user_ids = np.random.randint(1, num_users + 1, size=num_feedback)
item_ids = np.random.randint(1, num_items + 1, size=num_feedback)
feedback = np.random.choice([-1, 1], size=num_feedback)

data = list(zip(user_ids, item_ids, feedback))

# Split the data into train and test sets
train_data = data[:int(0.8 * len(data))]
test_data = data[int(0.8 * len(data)):]

# Create a LightFM dataset
dataset = Dataset()
dataset.fit(users=user_ids, items=item_ids)

# Build the interaction matrix for train and test data
train_interactions, _ = dataset.build_interactions(train_data)
test_interactions, _ = dataset.build_interactions(test_data)

In [None]:
# Build the model
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k
train_interactions = generate_interactions_matrix(train_dataset_, conf_train, rebase_users=True)

model = LightFM(loss='warp')
model.fit(train_interactions, epochs=30, num_threads=2, verbose=True)

<lightfm.lightfm.LightFM at 0x790ff201ea10>

In [None]:
train_interactions.shape

(122191, 18264)

In [None]:
test_interactions = generate_interactions_matrix(test_dataset_, conf_test, rebase_users=True)
test_interactions.shape

(4897, 18264)

In [None]:
predicted_scores = model.predict(np.arange(test_interactions.shape[0]), np.arange(test_interactions.shape[1]), user_features=test_interactions)

# Reshape the predicted scores to match the shape of user_item_matrix_new_users
#predicted_scores = predicted_scores.reshape(user_item_matrix_new_users.shape)

# Extract the predicted scores for the items with which the new users have interactions
#predicted_scores_for_interactions = predicted_scores[user_item_matrix_new_users == 1]

# Print or use the predicted scores as needed
#print(predicted_scores_for_interactions)

ValueError: ignored

In [None]:
predicted_scores = model.predict(train_interactions)

TypeError: ignored

### Есть мысль генерить топовыми моделями 30/40 предсказаний и потом агрегировать из них 20 лучших

### Сделать двухуровневую модель, то есть генерить фичи юзер-айтем (скоры от разныхмоделей, вектор фич из модели и потом просто обучить бустинг на этом)

train1 - обучаем модели наши лучшие предыдущие модели

train2 + holdout2 - достаем фичи пользаков и потом на холдауте обучаем бустинг

test3 + holdout3 - получаем фичи из моделей и проверяем качество бустинга

In [None]:
###

#Trash

In [None]:
rank_list = [50, 100, 150, 200]
metrics_list = []
for r in rank_list:
    print('##########')
    print('Rank: ', r)
    svd_config = {'rank': r}
    V, sigma = svd_params = build_svd_model(svd_config, train_dataset_, conf_train)
    scores = svd_model_scoring(svd_params, test_dataset_, conf_test)
    downvote_seen_items(scores, test_dataset_, conf_test)
    svd_recs = topn_recommendations(scores, topn=20)
    metrics = model_evaluate(svd_recs, holdout_, conf_hold, topn=20, verbose=1)
    metrics_list.append(metrics)
    print()

In [None]:
rank_list = [24, 32, 48, 64, 96, 110, 128, 164, 192, 220, 250]
metrics_list = []
for r in rank_list:
    print('##########')
    print('Rank: ', r)
    svd_config = {'rank': r}
    V_shift, sigma_shift = shifted_params = build_shifted_model(svd_config, train_dataset_, conf_train)
    shifted_scores = shifted_model_scoring(shifted_params, test_dataset_, conf_test)
    downvote_seen_items(shifted_scores, test_dataset_, conf_test)
    shifted_recs = topn_recommendations(shifted_scores, topn=20)
    metrics = model_evaluate(shifted_recs, holdout_, conf_hold, topn=20, verbose=1)
    metrics_list.append(metrics)
    print()