### Installing


In [1]:
# polara and gitfiles
!pip -q install --upgrade git+https://github.com/evfro/polara.git@develop#egg=polara
! wget -q https://raw.githubusercontent.com/Personalization-Technologies-Lab/RecSys-Course-HSE-Fall23/main/Seminar5/dataprep.py -O dataprep.py
! wget -q https://raw.githubusercontent.com/Personalization-Technologies-Lab/RecSys-Course-HSE-Fall23/main/Seminar5/evaluation.py -O evaluation.py

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for polara (setup.py) ... [?25l[?25hdone


### Imports

In [2]:
import numpy as np
import pandas as pd
import tqdm
from scipy.sparse import csr_matrix, diags
from sklearn.metrics.pairwise import cosine_similarity

from polara import get_movielens_data

from dataprep import leave_last_out, transform_indices, reindex_data, verify_time_split, generate_interactions_matrix
from evaluation import topn_recommendations, model_evaluate, downvote_seen_items

# Problem 1 (10 pts)

You’re given the matrix of interactions between 3 users and 6 items:
```
1 0 0 1 0 0
0 0 1 0 0 1
0 1 0 0 1 0
```
Is it possible to build a personalized recommendation model with this data?
Explain your answer.


I think no, to build personalazed model we dont have enough data.
To build a personalized recommendation model, it's essential to have more information than just this interaction matrix. Typically, recommendation models require additional data, such as user demographics, item features, or explicit user feedback (e.g., ratings).

In this case, without additional information, it would be challenging to build a traditional personalized recommendation model. We have only intersection matrix, without users or item features, so we can use ordinary approaches such as User-Based Collaborative Filtering, SVD, Matrix Factorizatio, K-Nearest Neighbors (KNN) Based Recommender Systems or others.

But in this case the matrix doesn't have needed information to create good rec. model cause nor users preferensys, nor item sequenses dosent intersect and dont have some logical connections that matrix based model can use to fited.
In summary, the absence of user-item intersections poses a fundamental challenge for recommendation models These models are designed to capture patterns and relationships within the interaction matrix, and when no interactions are observed, they lack the necessary information to generate personalized recommendations.

# Problem 2 (20 pts)


Implement two variants of user-based KNN for the top-$n$ recommendations task when similarity matrix is calculated:
1. with neighborhood subsampling,
2. with additional weighting.

Recall, there's no reason for implementing row-wise weighting scheme in user-based KNN. So choose the weighting scheme wisely.

 In your experiments:   

- Use Movielens-1M data.
- Test your solution against both weak and strong generalization.
  - In total you’ll have 4 different experiments.
- Follow the ”most-recent-item” sampling strategy for constructing holdout.
  - Explain potential issues of this scheme in relation to both weak and strong
generalization.
- Report evaluation metrics, compare the models, and analyse the results

**Note**: you can reuse the code from seminars if necessary.

In [3]:
data_ = get_movielens_data(include_time=True)
data, data_index = transform_indices(data_, 'userid', 'movieid')


data_description = dict(
    users = 'userid',
    items = 'movieid',
    feedback = 'rating',
    n_users = len(data.userid.unique()),
    n_items = len(data.movieid.unique())
)
data_description

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 6040,
 'n_items': 3706}

## Weak generalization test

### Preparing data (1 pts)

Your task is
- split data into training and holdout parts
- build a new internal contiguous representation of user and item index based on the training data
- make sure same index is used in the holdout data

In [None]:
# split most recent holdout item from each user
training_, holdout_ = leave_last_out(data, 'userid', 'timestamp')

# check correct time splitting
verify_time_split(training_, holdout_)

In [None]:
assert holdout_.userid.unique().shape == training_.userid.unique().shape

In [None]:
# reindex data to make contiguous index starting from 0 for user and item IDs
training, data_index = transform_indices(training_, 'userid', 'movieid')

# apply new index to the holdout data
holdout = reindex_data(holdout_, data_index, filter_invalid=True)
holdout = holdout.sort_values('userid')
#assert set(holdout.userid) == set(training.userid)

- Let's also populate data description dictionary for convenience.
- It allows using uniform names for users and items field.
  - This way the code does't depend on the actual names in you dataset.
  - So later you can easily switch to another dataset without changing the code fo the pipeline.


In [None]:
#data_description['test_users'] = holdout[data_index['users'].name].values
#data_description
data_description = dict(
    users = data_index['users'].name,
    items = data_index['items'].name,
    feedback = 'rating',
    n_users = len(data_index['users']),
    n_items = len(data_index['items']),
    test_users = holdout[data_index['users'].name].values
)
data_description

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 6040,
 'n_items': 3704,
 'test_users': array([   0,    1,    2, ..., 6037, 6038, 6039])}

As previously, let's also explicitly store our testset (i.e., ratings of test users excluding holdout items).

In [None]:
userid = data_description['users']
seen_idx_mask = training[userid].isin(data_description['test_users'])
testset = training[seen_idx_mask]

### Models implementation

#### Unweighted case (5 pts) and Weighted case (5 pts)

Unweighted case
- You can consult the code from seminars or implement your own solution as long as it is fast enough.  
- **Make sure to implement some kind of neighborhood subsampling.**
  - Recall that subsampling of the neighborhood not only makes the algorithm run faster, but can also improve the results.

Weighted case
- Your task here is to implement user-based KNN with asymmetric similarity.

In [None]:
def generate_interactions_matrix(data, data_description, rebase_users=False):
    "csr matrix user-item intersection, value of cell is rating by user"
    '''
    Convert pandas dataframe with interactions into a sparse matrix.
    Allows reindexing user ids, which help ensure data consistency
    at the scoring stage (assumes user ids are sorted in scoring array).
    '''
    n_users = data_description['n_users']
    n_items = data_description['n_items']
    # get indices of observed data
    user_idx = data[data_description['users']].values # type your code here
    if rebase_users:
        user_idx, user_index = pd.factorize(user_idx, sort=True)
        n_users = len(user_index)
    item_idx = data[data_description['items']].values # type your code here
    feedback = data[data_description['feedback']].values # type your code here
    # construct rating matrix
    return csr_matrix((feedback, (user_idx, item_idx)), shape=(n_users, n_items))

def cosine_similarity_zd(matrix):
    '''Build cosine similarity matrix with zero diagonal.'''
    similarity = cosine_similarity(matrix, dense_output=False) # type your code here
    similarity.setdiag(0)
    similarity.eliminate_zeros()
    return similarity

def truncate_similarity(similarity, k):
    '''
    For every row in similarity matrix, pick at most k entities
    with the highest similarity scores. Disregard everything else.
    '''
    similarity = similarity.tocsr()
    inds = similarity.indices
    ptrs = similarity.indptr
    data = similarity.data
    new_ptrs = [0]
    new_inds = []
    new_data = []
    for i in range(len(ptrs)-1):
        start, stop = ptrs[i], ptrs[i+1]
        if start < stop:
            data_ = data[start:stop]
            topk = min(len(data_), k)
            idx = np.argpartition(data_, -topk)[-topk:]
            new_data.append(data_[idx])
            new_inds.append(inds[idx+start])
            new_ptrs.append(new_ptrs[-1]+len(idx))
        else:
            new_ptrs.append(new_ptrs[-1])
    new_data = np.concatenate(new_data)
    new_inds = np.concatenate(new_inds)
    truncated = csr_matrix(
        (new_data, new_inds, new_ptrs),
        shape=similarity.shape
    )
    return truncated

If $A$ is a matrix of ratings and  $K$ is an user-similarity matrix ($k_{ij}\in[0, 1]$), then KNN-scores matrix $R$ is computed as:  

- for elementwise weighting:
$$
R=K A  \oslash\left(K B \right),\quad
b_{u i}=\left\{\begin{array}{lr}
1, & \text { if } a_{u i} \text { is known } \\
0 & \text { otherwise }
\end{array}\right.
$$

- for row-wise weighting:
$$
R=D_K^{-1} K A ,\quad
D_K=\operatorname{diag}(K\mathbf{e})
$$

- for unweighted case:
$$
R=K A
$$

In [None]:
def build_uknn_model(config, data, data_description):
    user_item_mtx = generate_interactions_matrix(data, data_description, rebase_users=False)

    # compute similarity matrix
    user_similarity = cosine_similarity_zd(user_item_mtx)
    # truncate similarity from irrelefant users
    if config['n_neighbors']:
        user_similarity = truncate_similarity( user_similarity, config['n_neighbors'])
    return user_item_mtx, user_similarity, config['weighting']


def uknn_model_scoring(params, testset, testset_description):
    """
    params = [user_item_mtx_interections, user_similarity_matrix, weighting_type]
    """
    # implement the scoring function to assign scores
    # to all items for test users
    user_item_mtx, user_similarity, weighting = params

    # write your code for scoring, don't forget to return a dense array
    user_item_mtx = generate_interactions_matrix(
        testset, testset_description, rebase_users=False
    )

    # R_unw = K*A
    unweighted_scores = user_similarity.dot(user_item_mtx)

    if weighting is None:
        # R_unw = K*A
        return unweighted_scores.A

    if weighting == 'el_wise' and False:
        # R_elw = K*A // K*B
        # R_elw = R_unw // weights
        weights = np.abs(user_similarity).dot(user_item_mtx.astype('bool')) # почему мы здесь берем модуль?
        return np.divide(
            unweighted_scores.A,
            weights.A,
            where=weights.A!=0
        )

    if weighting == 'row_wise' and False:
        # R_rw = Dk^-1*K*A
        # R_rw = R_unw*weights
        weights = np.abs(user_similarity).sum(axis=1).A.squeeze()
        Dk1 = np.divide(1., weights, where=weights!=0)
        scores = diags(Dk1)@ user_similarity  @ user_item_mtx
        return  scores.A

    if weighting == 'col_wise':
        # R_rw = Dk^-1*K*A
        # R_rw = R_unw*weights
        weights = np.abs(user_similarity).sum(axis=1).A.squeeze()
        Dk1 = np.divide(1., weights, where=weights!=0)
        scores = user_similarity @ diags(Dk1) @ user_item_mtx
        return  scores.A


    raise ValueError('Unrecognized weighting scheme')

In [None]:
%%time
import timeit

n_neighbors = 10
models_names_list = ['unweihgted', 'colwise_weihgted']
models_names_list += [i+f'_trunc{n_neighbors}' for i in models_names_list]

conf_list = [{'weighting': None, 'n_neighbors': None},
             {'weighting': 'col_wise', 'n_neighbors': None},
             {'weighting': None, 'n_neighbors': n_neighbors},
             {'weighting': 'col_wise', 'n_neighbors': n_neighbors}
             ]
models_param_list = []
models_scores_list = []

for name, conf in zip(models_names_list, conf_list):
    print(f'##########_____{name}_____##########')
    start = timeit.default_timer()

    uknn_params = build_uknn_model(conf, training, data_description)
    uknn_scores = uknn_model_scoring(uknn_params, testset, data_description)
    models_param_list.append(uknn_params)
    models_scores_list.append(uknn_scores)

    stop = timeit.default_timer()
    print(f'Time: {stop - start:.2f}')

##########_____unweihgted_____##########
Time: 36.26
##########_____colwise_weihgted_____##########
Time: 51.89
##########_____unweihgted_trunc10_____##########
Time: 5.81
##########_____colwise_weihgted_trunc10_____##########
Time: 7.08
CPU times: user 1min 26s, sys: 1.88 s, total: 1min 28s
Wall time: 1min 41s


### Evaluation (1 pts)

Generate top-$n$ recommendations for both models and Calculate metrics

Function wich delete scored items of users

In [None]:
def downvote_seen_items(scores, data, data_description):
    assert isinstance(scores, np.ndarray), 'Scores must be a dense numpy array!'
    itemid = data_description['items']
    userid = data_description['users']
    # get indices of observed data, corresponding to scores array
    # we need to provide correct mapping of rows in scores array into
    # the corresponding user index (which is assumed to be sorted)
    #row_idx, test_users = pd.factorize(data[userid], sort=True)
    #print(len(test_users) , scores.shape[0])
    #assert len(test_users) == scores.shape[0]
    row_idx = data[userid].values
    col_idx = data[itemid].values
    # downvote scores at the corresponding positions
    scores[row_idx, col_idx] = scores.min() - 1

Fuction to calculate metrics from our predictions and holdout

In [None]:
def model_evaluate(recommended_items, holdout, holdout_description, topn=10):
    itemid = holdout_description['items']
    holdout_items = holdout[itemid].values
    assert recommended_items.shape[0] == len(holdout_items)
    hits_mask = recommended_items[:, :topn] == holdout_items.reshape(-1, 1)
    # HR calculation
    hr = np.mean(hits_mask.any(axis=1))
    # MRR calculation
    n_test_users = recommended_items.shape[0]
    hit_rank = np.where(hits_mask)[1] + 1.0
    mrr = np.sum(1 / hit_rank) / n_test_users
    # coverage calculation
    n_items = holdout_description['n_items']
    cov = np.unique(recommended_items).size / n_items
    return hr, mrr, cov

Get metrics from our models

In [None]:
metrics_list = []

for name, conf, scores in zip(models_names_list, conf_list, models_scores_list):
    print(f'##########_____{name}_____##########')
    print(conf)

    uknn_scores = scores.copy()
    # delete scored items
    downvote_seen_items(uknn_scores, testset, data_description)
    # take topN recomendastions
    uknn_recs = topn_recommendations(uknn_scores)

    metrics = model_evaluate(uknn_recs[holdout.userid.tolist()], holdout, data_description, topn=10)
    metrics_list.append(metrics)
    print('HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics))

##########_____unweihgted_____##########
{'weighting': None, 'n_neighbors': None}
HR=0.0474, MRR=0.0169, COV=0.0526

##########_____colwise_weihgted_____##########
{'weighting': 'col_wise', 'n_neighbors': None}
HR=0.0487, MRR=0.0174, COV=0.0508

##########_____unweihgted_trunc10_____##########
{'weighting': None, 'n_neighbors': 10}
HR=0.0831, MRR=0.0285, COV=0.365

##########_____colwise_weihgted_trunc10_____##########
{'weighting': 'col_wise', 'n_neighbors': 10}
HR=0.0826, MRR=0.0283, COV=0.378



In [None]:
n_neighbors_list =  list(range(1, 15)) + [20,25,30,40,50,60]

for n_neighbors in n_neighbors_list:
    print(f'##########_____N_neighbors:{n_neighbors}_____##########')
    start = timeit.default_timer()
    for weight in [None, 'col_wise']:
        conf = {'weighting': weight, 'n_neighbors': n_neighbors}
        uknn_params = build_uknn_model(conf, training, data_description)
        uknn_scores = uknn_model_scoring(uknn_params, testset, data_description)

        # delete scored items
        downvote_seen_items(uknn_scores, testset, data_description)
        # take topN recomendastions
        uknn_recs = topn_recommendations(uknn_scores)
        metrics = model_evaluate(uknn_recs[holdout.userid.tolist()], holdout, data_description, topn=10)
        print(f'weighting:{weight}')
        print('HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics))

    stop = timeit.default_timer()
    print(f'Time: {stop - start:.2f}')

##########_____N_neighbors:1_____##########
weighting:None
HR=0.0489, MRR=0.014, COV=0.701

weighting:col_wise
HR=0.0489, MRR=0.014, COV=0.701

Time: 9.40
##########_____N_neighbors:2_____##########
weighting:None
HR=0.0628, MRR=0.0225, COV=0.591

weighting:col_wise
HR=0.0611, MRR=0.0216, COV=0.599

Time: 8.74
##########_____N_neighbors:3_____##########
weighting:None
HR=0.0681, MRR=0.0251, COV=0.525

weighting:col_wise
HR=0.0696, MRR=0.0251, COV=0.536

Time: 10.02
##########_____N_neighbors:4_____##########
weighting:None
HR=0.0757, MRR=0.025, COV=0.479

weighting:col_wise
HR=0.075, MRR=0.0245, COV=0.49

Time: 10.08
##########_____N_neighbors:5_____##########
weighting:None
HR=0.0798, MRR=0.0269, COV=0.446

weighting:col_wise
HR=0.0787, MRR=0.0267, COV=0.454

Time: 9.08
##########_____N_neighbors:6_____##########
weighting:None
HR=0.0808, MRR=0.0263, COV=0.426

weighting:col_wise
HR=0.0821, MRR=0.026, COV=0.433

Time: 12.37
##########_____N_neighbors:7_____##########
weighting:None
HR

Upper n=6 we have very goof results. The main reason is that we use weak generalization. And after a certain threshold, the model already has enough information for such quality

## Strong generalization test

- Recall that in the strong generalization test you work with the warm-start scenario.
- It means that the set of test users is disjoint from the set of users in the training.
- You're provided with the basic functions to help you perform correct splitting, but there're still a few places where your input is required. Make sure you understand the logic of data splitting in this scenario.

### Preparing data (2 pts)

- Your task is to select **a subset of users who have the most recent interactions in their history** across entire dataset. These are going to be the **test users**.
- You will apply **holdout splitting to only this subset**.
  - Think, why simply taking all users (as in weak generalization test) makes no sense in this scenario.

In [None]:
def split_by_time(data, time_q=0.95, timeid='timestamp'):
    '''
    Split the input `data` DataFrame into two parts based on the timestamp, with the split point
    being determined by the quantile value `time_q`. The function returns a tuple `(before, after)`
    containing the two DataFrames. The `after` DataFrame contains the rows with timestamps greater
    than or equal to the split point, while the `before` DataFrame contains the remaining rows.

    Details:
    The `quantile` method of the pandas DataFrame is used to calculate the time point (i.e., timestamp)
    that divides the data into two parts based on the given quantile value `time_q`. Specifically,
    the time point `split_timepoint` is calculated as the `time_q`th quantile of the values in the `timeid`
    column of the `data` DataFrame, using the interpolation method of `nearest`. This means that
    `split_timepoint` is the timestamp at or immediately after which `time_q` percent of the data points occur.
    '''
    split_timepoint = data[timeid].quantile(q=time_q, interpolation='nearest')
    after = data.query(f'{timeid} >= @split_timepoint')
    before = data.drop(after.index)
    return before, after

Firstly, you need to select a candidate subset of observations, from which you'll construct the the training, testset, and holdout datssets. Check the `split_by_time` function below and its description in the above cell.

In [None]:
data_ = get_movielens_data(include_time=True)
data, data_index = transform_indices(data_, 'userid', 'movieid')

data_description = dict(
    users = 'userid',
    items = 'movieid',
    feedback = 'rating',
    n_users = len(data.userid.unique()),
    n_items = len(data.movieid.unique())
)
data_description

{'users': 'userid',
 'items': 'movieid',
 'feedback': 'rating',
 'n_users': 6040,
 'n_items': 3706}

In [None]:
before, after = split_by_time(data, time_q=0.95)

- Now it's time to perform holdout sampling based on the obtained timepoint splitting.
- Remember, you only sample from the test users.
  - Test users's last ratings must be the most recent across the entire dataset. Use the global timepoint splitting obtained above.

Make holdout sampling on after to create holdout data and testset_part_1

In [None]:
testset_part_1, holdout_ = leave_last_out(after, userid='userid', timeid='timestamp')

# verify correctness of time-based splitting,
# i.e., for each test user, the holdout contains only future interactions w.r.t to testset
verify_time_split(testset_part_1, holdout_)
assert len(set(testset_part_1.userid.unique()) - set(holdout_.userid.unique()))==0

- Prepare the data for training.
  - Take the correspoding part of the timepoint split.
  - Recall that **training and testset must be disjoint by users**.

Then for strong generalisation we must delete all after(holdout) users from before data history, so we get our trainset

In [None]:
trainset_ = before[~before.userid.isin(after.userid.unique())]

And all deleted rows is our testset_part_2

In [None]:
testset_part_2 = before[before.userid.isin(after.userid.unique())]

- Note that `testset_part_` only contains interactions of the test users **after the timepoint**.
- You need to combine it with the remaining histories of these users.
  - i.e., everything that's filtered out from the training data

In [None]:
# combine all test users data into a single `testset_` Dataframe.
testset_ = pd.concat([testset_part_1, testset_part_2], axis=0, ignore_index=False)
assert len(set(testset_.userid)) == len((set(holdout_.userid)))
assert len(set(testset_.userid).difference((set(holdout_.userid)))) == 0
assert len(set(holdout_.userid).difference((set(testset_.userid)))) == 0

#### Building internal representation of user and item index

Use the `transform_indices` function for building a contiguous index starting from 0.

In [None]:
# reindex data to make contiguous index starting from 0 for user and item IDs
#trainset, train_data_index = transform_indices(trainset_, 'userid', 'movieid') нам не нужно реиндексировать фильмы

train_user_idx, train_user_index = pd.factorize(trainset_['userid'], sort=True)

trainset = trainset_.copy()
trainset['userid'] = train_user_idx
assert len(set(trainset.userid)) == len(set(trainset_.userid))

In [None]:
train_data_description = dict(
    users = 'userid',
    items = 'movieid',
    feedback = 'rating',
    n_users = len(train_user_index),
    n_items = data_description['n_items']
)

- Before applying new index to the test data note that:
  - the users in the `testset` must be the same as the users in the `holdout`;
  - the users in both `testset` and `holdout` must be ordered the same way.
- Below is the corresponding function `align_test_by_users` that ensures these two datasets' alignment.

In [None]:
def align_test_by_users(testset, holdout):
    test_users = np.intersect1d(holdout['userid'].values, testset['userid'].values)
    # only allow the same users to be present in both datasets
    testset = testset.query('userid in @test_users').sort_values('userid')
    holdout = holdout.query('userid in @test_users').sort_values('userid')
    return testset, holdout

Let's apply new item index to test data and finalize the test split:

In [None]:
test_data_description = dict(
    users = 'userid',
    items = 'movieid',
    feedback = 'rating',
    n_users = len(testset_.userid.unique()),
    n_items = data_description['n_items']
)
print(test_data_description)

{'users': 'userid', 'items': 'movieid', 'feedback': 'rating', 'n_users': 813, 'n_items': 3706}


In [None]:
test_user_idx, test_user_index = pd.factorize(testset_['userid'], sort=True)

testset = testset_.copy()
testset['userid'] = test_user_idx

holdout_reind = holdout_.copy()
hold_user_indx, hold_user_index = pd.factorize(holdout_.userid, sort=True)
assert list(hold_user_index) == list(test_user_index)
holdout_reind['userid'] = hold_user_indx

- Think why we do not apply new index to users here.

### Models implementation

- In this section you'll need to implement user-based KNN models for the warm-start scenario.
- Think carefully which data must be generated at the build time and which data must be generated in the scoring function.
  - Recall that test users are not part of the training data.
- The notes on neighborhood subsampling remain the same as before.

#### Unweighted case (5 pts) and Weighted case (5 pts)

In [None]:
def cosine_similarity_warm_start(matrix_test, matrix_train):
    '''Build cosine similarity matrix with zero diagonal.'''
    similarity = cosine_similarity(matrix_test, matrix_train, dense_output=False) # type your code here
    similarity.setdiag(0)
    similarity.eliminate_zeros()
    return similarity

In [None]:
def build_uknn_model(config, data, data_description):
    user_item_mtx = generate_interactions_matrix(data, data_description, rebase_users=False)
    # compute similarity matrix
    user_similarity = cosine_similarity_zd(user_item_mtx)
    # truncate similarity from irrelefant users
    if config['n_neighbors']:
        user_similarity = truncate_similarity(user_similarity, config['n_neighbors'])
    return user_item_mtx, user_similarity, config

def uknn_model_scoring(params, testset, testset_description):
    """
    params = [user_item_mtx_interections, user_similarity_matrix, weighting_type]
    """
    # implement the scoring function to assign scores
    # to all items for test users
    user_item_mtx, user_similarity_train, config = params

    # write your code for scoring, don't forget to return a dense array
    user_item_mtx_test = generate_interactions_matrix(
        testset, testset_description, rebase_users=True
    )
    user_similarity = cosine_similarity_warm_start(user_item_mtx_test, user_item_mtx) # test_users*train_users
    if config['n_neighbors']:
        user_similarity = truncate_similarity(user_similarity, config['n_neighbors'])

    # R_unw = K*A
    unweighted_scores = user_similarity.dot(user_item_mtx)

    if config['weighting'] is None:
        # R_unw = K*A
        return unweighted_scores.A

    if config['weighting'] == 'col_wise':
        # R_rw = Dk^-1*K*A
        # R_rw = R_unw*weights
        weights = np.abs(user_similarity).sum(axis=0).A.squeeze()
        Dk1 = np.divide(1., weights, where=weights!=0)
        scores = user_similarity @ diags(Dk1) @ user_item_mtx
        return  scores.A


    raise ValueError('Unrecognized weighting scheme')

In [None]:
%%time
import timeit

n_neighbors = 20
models_names_list = ['unweihgted', 'colwise_weihgted']
models_names_list += [i+f'_trunc{n_neighbors}' for i in models_names_list]

conf_list = [{'weighting': None, 'n_neighbors': None},
             {'weighting': 'col_wise', 'n_neighbors': None},
             {'weighting': None, 'n_neighbors': n_neighbors},
             {'weighting': 'col_wise', 'n_neighbors': n_neighbors}
             ]
models_param_list = []
models_scores_list = []

for name, conf in zip(models_names_list, conf_list):
    print(f'##########_____{name}_____##########')
    start = timeit.default_timer()

    uknn_params = build_uknn_model(conf, trainset, data_description)
    uknn_scores = uknn_model_scoring(uknn_params, testset, data_description)
    models_param_list.append(uknn_params)
    models_scores_list.append(uknn_scores)

    stop = timeit.default_timer()
    print(f'Time: {stop - start:.2f}')

##########_____unweihgted_____##########
Time: 6.02
##########_____colwise_weihgted_____##########
Time: 11.74
##########_____unweihgted_trunc20_____##########
Time: 3.72
##########_____colwise_weihgted_trunc20_____##########
Time: 3.66
CPU times: user 20.9 s, sys: 775 ms, total: 21.7 s
Wall time: 25.1 s


### Evaluation (1 pts)

Generate recommendations for both models and Calculate metrics

In [None]:
metrics_list = []

for name, conf, scores in zip(models_names_list, conf_list, models_scores_list):
    print(f'##########_____{name}_____##########')
    print(conf)

    uknn_scores = scores.copy()
    # delete scored items
    downvote_seen_items(uknn_scores, testset, data_description)
    # take topN recomendastions
    uknn_recs = topn_recommendations(uknn_scores)

    metrics = model_evaluate(uknn_recs[holdout_reind.userid.tolist()], holdout_reind, data_description, topn=10)
    metrics_list.append(metrics)
    print('HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics))

##########_____unweihgted_____##########
{'weighting': None, 'n_neighbors': None}
HR=0.0443, MRR=0.0175, COV=0.0515

##########_____colwise_weihgted_____##########
{'weighting': 'col_wise', 'n_neighbors': None}
HR=0.0492, MRR=0.0186, COV=0.0502

##########_____unweihgted_trunc20_____##########
{'weighting': None, 'n_neighbors': 20}
HR=0.0541, MRR=0.0167, COV=0.166

##########_____colwise_weihgted_trunc20_____##########
{'weighting': 'col_wise', 'n_neighbors': 20}
HR=0.0603, MRR=0.016, COV=0.237



### Tuning (2 pts)
- Try to find a neighborhood size that gives you better results.
- Perform a simple grid-search experiment and report your findings.
- Optional: try improving results with a different similarity measure.

We will gridsearch our neighbors using the best model

In [None]:
n_neighbors_list =  list(range(1, 60))

for n_neighbors in n_neighbors_list:
    print(f'##########_____N_neighbors:{n_neighbors}_____##########')
    start = timeit.default_timer()

    conf = {'weighting': 'col_wise', 'n_neighbors': n_neighbors}
    uknn_params = build_uknn_model(conf, trainset, data_description)
    uknn_scores = uknn_model_scoring(uknn_params, testset, data_description)

    # delete scored items
    downvote_seen_items(uknn_scores, testset, data_description)
    # take topN recomendastions
    uknn_recs = topn_recommendations(uknn_scores)

    metrics = model_evaluate(uknn_recs[holdout_reind.userid.tolist()], holdout_reind, data_description, topn=10)
    print('HR={:.3}, MRR={:.3}, COV={:.3}\n'.format(*metrics))

    stop = timeit.default_timer()
    print(f'Time: {stop - start:.2f}')

##########_____N_neighbors:1_____##########


  self._set_arrayXarray(i, j, x)


HR=0.0234, MRR=0.00838, COV=0.407

Time: 3.58
##########_____N_neighbors:2_____##########
HR=0.048, MRR=0.0188, COV=0.34

Time: 3.22
##########_____N_neighbors:3_____##########
HR=0.0443, MRR=0.0138, COV=0.32

Time: 4.26
##########_____N_neighbors:4_____##########
HR=0.0529, MRR=0.0158, COV=0.311

Time: 3.42
##########_____N_neighbors:5_____##########
HR=0.0504, MRR=0.0146, COV=0.298

Time: 3.25
##########_____N_neighbors:6_____##########
HR=0.0517, MRR=0.0143, COV=0.289

Time: 3.33
##########_____N_neighbors:7_____##########
HR=0.0443, MRR=0.0156, COV=0.279

Time: 4.87
##########_____N_neighbors:8_____##########
HR=0.048, MRR=0.0172, COV=0.275

Time: 3.96
##########_____N_neighbors:9_____##########
HR=0.0517, MRR=0.0201, COV=0.274

Time: 4.32
##########_____N_neighbors:10_____##########
HR=0.0529, MRR=0.0195, COV=0.267

Time: 6.59
##########_____N_neighbors:11_____##########
HR=0.0517, MRR=0.0186, COV=0.264

Time: 4.07
##########_____N_neighbors:12_____##########
HR=0.0541, MRR=0.0173

We get pretty good results with n= 4,5,6. The best is in n=19,20, and n=51-56

In the first option, thanks to this, we cut off unnecessary neighbors. In the second, apparently with such an average hold, we capture the information we need from their previously discarded users, significantly improving the metric. And third take a lot of users and take stability plato arund 0.057-0.06

## Final analysis (3 pts)

1. Provide an analysis on which model performs the best and explain why.
2. Explain the difference in computational complexity of your models. Consider how the training and the recommendation generation differ for different models in terms of
    - the amount of RAM,
    - the amount of disk storage,
    - the load on CPU.
3. How else would you modify the model to improve either the quality of recommendations or computational performance? Describe at least one modification and its envisioned effect.

1

The best performance was shown by models with weak generalisation.Which is logical since with such generalization we train our model on data about the user on whom we will test the metrics in the future, compared to strong generalization when we use warm start. Of cause it isnt cold start, but model didnot fitted on test data

2

We have fairly similar models that are not that different.

During the training process, we create an intersection matrix from a dataframe and calculate the $K$ similarity matrix $O(m^2*n)$. This happens in both strong and weak generalization, the only thing in strong generalization is that the value $m$ is a little less since we removed test users from there.

When receiving recommendations (weak generalisation), the complexity in the unweigted version is simpler since we simply multiply two matrices $K*A$, but in wheghted case we take product of 3 matrix $K*D^-1*A$, where $D^-1$ is also must be compute. So it was more hard variant for time/memory/cpu.

The strong  generalisation recommendation recive is lighter cause we do computation only with test simularity matrix $Ktest$, which is significantly less $K$ (less time, cpu, memory)(compare time duration)

P.S: But it is worth considering that we use neighbor sampling methods, which simplifies calculations when building and predicting models (compare time duration)



3

Of course, the first thing that comes to mind is to add contextual information to the model. So that when building a model, information not only on views (ratings), but also other additional information about users and films (genre, director, city, age, etc.) is taken into account. In this way it is possible to improve the values of the calculated.

  Also we can use different simmularity function, not only cosine.

# Problem 3 (20 pts)

* Using the code from seminars, implement efficient version of a weighted matrix factorization based on SGD for the strong generalization test. Reuse the code from the KNN part of the homework. Recall, you need to implement folding-in for this scenario.

* To make the model learn better, implement the negative sampling as well. You can use 1:1 ratio of negative samples, i.e., 1 negative example for each positive example.
* Try slightly tuning the hyper-parameters of the model (i.e., rank, learning rate, regularization, negative samples ratio) to obtain better recommendations quality.
* Report the results.

In [None]:
def generate_interactions_matrix(data, data_description, rebase_users=False):
    "csr matrix user-item intersection, value of cell is rating by user"
    '''
    Convert pandas dataframe with interactions into a sparse matrix.
    Allows reindexing user ids, which help ensure data consistency
    at the scoring stage (assumes user ids are sorted in scoring array).
    '''
    n_users = data_description['n_users']
    n_items = data_description['n_items']
    # get indices of observed data
    user_idx = data[data_description['users_col']].values # type your code here
    user_index = []
    if rebase_users:
        user_idx, user_index = pd.factorize(user_idx, sort=True)
        n_users = len(user_index)
    item_idx = data[data_description['items_col']].values # type your code here
    feedback = data[data_description['feedback_col']].values # type your code here
    # construct rating matrix
    return csr_matrix((feedback, (user_idx, item_idx)), shape=(n_users, n_items)), user_index

Download data, and reindex users and items

In [None]:
data_ = get_movielens_data(include_time=True)
data, data_index = transform_indices(data_, 'userid', 'movieid')

In [None]:
data_description = dict(
    users_col = 'userid',
    items_col = 'movieid',
    feedback_col = 'rating',
    n_users = len(data.userid.unique()),
    n_items = len(data.movieid.unique())
)
data_description

{'users_col': 'userid',
 'items_col': 'movieid',
 'feedback_col': 'rating',
 'n_users': 6040,
 'n_items': 3706}

In [None]:
def split_by_time(data, time_q=0.95, timeid='timestamp'):
    '''
    Split the input `data` DataFrame into two parts based on the timestamp, with the split point
    being determined by the quantile value `time_q`. The function returns a tuple `(before, after)`
    containing the two DataFrames. The `after` DataFrame contains the rows with timestamps greater
    than or equal to the split point, while the `before` DataFrame contains the remaining rows.

    Details:
    The `quantile` method of the pandas DataFrame is used to calculate the time point (i.e., timestamp)
    that divides the data into two parts based on the given quantile value `time_q`. Specifically,
    the time point `split_timepoint` is calculated as the `time_q`th quantile of the values in the `timeid`
    column of the `data` DataFrame, using the interpolation method of `nearest`. This means that
    `split_timepoint` is the timestamp at or immediately after which `time_q` percent of the data points occur.
    '''
    split_timepoint = data[timeid].quantile(q=time_q, interpolation='nearest')
    after = data.query(f'{timeid} >= @split_timepoint')
    before = data.drop(after.index)
    return before, after

Split data base on split_time to before and after

In [None]:
before, after = split_by_time(data, time_q=0.95)

Make holdout sampling on after to create holdout data and testset_part_1

In [None]:
testset_part_1, holdout_ = leave_last_out(after, userid='userid', timeid='timestamp')

# verify correctness of time-based splitting,
# i.e., for each test user, the holdout contains only future interactions w.r.t to testset
verify_time_split(testset_part_1, holdout_)
assert len(set(testset_part_1.userid.unique()) - set(holdout_.userid.unique()))==0

testset_part_2 = before[before.userid.isin(after.userid.unique())]

Then for strong generalisation we must delete all after(holdout) users from before data history, so we get our trainset

In [None]:
trainset_ = before[~before.userid.isin(after.userid.unique())]

And all deleted rows is our testset_part_2

In [None]:
testset_part_2 = before[before.userid.isin(after.userid.unique())]

Combine testset_part_1 and testset_part_2 we create our testset wich use to make prediction for test users

In [None]:
# combine all test users data into a single `testset_` Dataframe.
testset_ = pd.concat([testset_part_1, testset_part_2], axis=0, ignore_index=False)
assert len(set(testset_.userid)) == len((set(holdout_.userid)))
assert len(set(testset_.userid).difference((set(holdout_.userid)))) == 0
assert len(set(holdout_.userid).difference((set(testset_.userid)))) == 0

Building internal representation of user and item index. We must building a contiguous index for train users starting from 0 cause for matrix representation fit form. Dont chane movieid, it was reindexed when download data.

In [None]:
# reindex data to make contiguous index starting from 0 for user and item IDs
#trainset, train_data_index = transform_indices(trainset_, 'userid', 'movieid') нам не нужно реиндексировать фильмы

train_user_idx, train_user_index = pd.factorize(trainset_['userid'], sort=True)

trainset = trainset_.copy()
trainset['userid'] = train_user_idx
assert len(set(trainset.userid)) == len(set(trainset_.userid))

Write function wich can create negative samplings

In [None]:
import pandas as pd
import numpy as np

def create_negative_samples(data, data_description, rating=0, gamma=1):
    itemsid_set = set(range(data_description['n_items']))

    df_users_no_rating_items_ = data.groupby(['userid']).agg({'movieid': lambda x: np.random.choice(
        list(itemsid_set-set(x)),
        size=int(gamma*len(set(x))),
        replace=False).tolist()})

    df_users_no_rating_items = df_users_no_rating_items_.reset_index()
    negative_samples = []
    #assert df_users_no_rating_items.index.tolist() == data.userid.unique().tolist()

    for user_id, noitems_list in df_users_no_rating_items.values:
      for item_id in noitems_list:
        negative_samples.append([user_id, item_id, rating])

    data_negative = pd.DataFrame(negative_samples, columns=['userid', 'movieid', 'rating'])
    return data_negative

Create trainset_with_negative

In [None]:
trainset_negative = create_negative_samples(trainset, data_description, rating=0, gamma=1)
trainset_with_negative = pd.concat([trainset,trainset_negative])

In [None]:
train_data_description = dict(
    users_col = 'userid',
    items_col = 'movieid',
    feedback_col = 'rating',
    n_users = len(train_user_index),
    n_items = data_description['n_items']
)

sgd_config = dict(
    learning_rate = 0.002,
    regularization = 1,
    n_epochs = 25,
    rank = 35,
    seed = 16
)

Our main implementation for  Weighted matrix
factorization based on SGD (weak weighted)

In [None]:
from numba import njit, objmode, prange


def mf_sgd_build(config, data, data_description):
    useridx = data[data_description['users_col']].values
    itemidx = data[data_description['items_col']].values
    ratings = data[data_description['feedback_col']].values
    learning_rate = config['learning_rate']
    regularization = config['regularization']
    n_epochs = config['n_epochs']
    rank = config['rank']

    n_users = data_description['n_users']
    n_items = data_description['n_items']
    rng = np.random.default_rng(config.get('seed', None))

    P, Q, mse_history, bestP, bestQ  = sgd_epochs(
        useridx, itemidx, ratings,
        learning_rate, regularization, n_epochs,
        rank, n_users, n_items,
        rng
    )
    return P, Q, mse_history, bestP, bestQ

@njit
def sgd_epochs(
    useridx, itemidx, ratings,
    learning_rate, regularization, n_epochs,
    rank, n_users, n_items,
    rng
):
    P = rng.normal(0, 0.01, (n_users, rank))
    Q = rng.normal(0, 0.01, (n_items, rank))
    print('SGD start')
    best_mse = 1e3
    history = []
    for epoch in range(n_epochs):
        mse = sgd_step(P, Q, useridx, itemidx, ratings, learning_rate, regularization, rng)
        history.append(mse)
        if mse < best_mse:
            best_mse = mse
            bestP, bestQ = P, Q
    return P, Q, history, bestP, bestQ

def check_metric_growth(testset, holdout, data_description):
    def update_target_metric(metrics):
        hr, mrr, cov = metrics
        eval_callback.target_metrics.append(hr)

    def eval_callback(epoch, P, Q):
        mf_params = P, Q, None
        sgd_scores = mf_sgd_scoring(mf_params, None, data_description)
        downvote_seen_items(sgd_scores, testset, data_description)
        sgd_recs = topn_recommendations(sgd_scores, topn=10)
        metrics = model_evaluate(sgd_recs, holdout, data_description)
        update_target_metric(metrics)
        stopping_criteria = 1
        if len(eval_callback.target_metrics) >= 2:
            stopping_criteria = eval_callback.target_metrics[-1] > eval_callback.target_metrics[-2]
        return int(stopping_criteria)
    eval_callback.target_metrics = []
    return eval_callback

@njit
def sgd_step(P, Q, useridx, itemidx, ratings, learning_rate, regularization, rng):
    n_interactions = len(ratings)
    squared_err = 0.

    # cause we idx by exist intersections, so we use weak weighting scheeme
    for idx in rng.permutation(n_interactions):
        userid = useridx[idx]
        itemid = itemidx[idx]
        rating = ratings[idx]

        pi = P[userid]
        qj = Q[itemid]
        error = rating - pi @ qj

        pi += learning_rate * (error*qj - regularization*pi)
        qj += learning_rate * (error*pi - regularization*qj)

        squared_err += error*error

    mse = squared_err / n_interactions
    return mse

def mf_sgd_scoring(params, data, data_description):
    P, Q, _ = params
    test_users = data_description['test_users']
    scores = P[test_users] @ Q.T
    return scores

Train our model to create P and Q matricies

In [None]:
%%time
sgd_params = mf_sgd_build(sgd_config, trainset, train_data_description)

CPU times: user 14.3 s, sys: 84.6 ms, total: 14.4 s
Wall time: 16.9 s


In [None]:
_, _, _, P, Q = sgd_params
print('MSE_history:', sgd_params[2])
assert Q.shape[0] == data_description['n_items']

MSE_history: [14.151049458218015, 14.14813069205689, 14.036691091129663, 11.657425864984777, 5.674298276504863, 3.368613271996213, 2.649693742459554, 2.335175087785384, 2.1699710284773532, 2.07051803255666, 2.007889219652979, 1.9674925556202039, 1.9387958003683479, 1.9168311362855928, 1.901736670786542, 1.8897402345114196, 1.882402770906767, 1.8742419307232039, 1.8684641680793952, 1.8664384916481511]


We re-index test users so that we can then create an intersection matrix. But don’t forget to also correct holdout_ later when we predict the results

In [None]:
test_data_description = dict(
    users_col = 'userid',
    items_col = 'movieid',
    feedback_col = 'rating',
    n_users = len(testset_.userid.unique()),
    n_items = data_description['n_items']
)
print(test_data_description)

test_int_matrix, test_user_index = generate_interactions_matrix(testset_, test_data_description, rebase_users=True)

{'users_col': 'userid', 'items_col': 'movieid', 'feedback_col': 'rating', 'n_users': 813, 'n_items': 3706}


Write function wich get p vector and make prediction for new user

In [None]:
def create_newuser_p_vector(new_user_intr_vector, Q, config):
    regularization = config['regularization']
    learning_rate = config['learning_rate']
    n_epohs = config['n_epohs']

    assert len(new_user_intr_vector.shape) == 1
    item_indexes = np.nonzero(new_user_intr_vector)[0]

    rng = np.random.default_rng(25)
    pi  = rng.normal(0, 0.01, (Q.shape[1])) # shape (d,)
    rating = new_user_intr_vector.copy()

    for i in range(n_epohs):
      np.random.shuffle(item_indexes)
      for idx in item_indexes:
          qj = Q[idx] # shape (d,)
          error = rating[idx] - pi @ qj # number
          pi += learning_rate * (error*qj - regularization*pi)

    ratingsi = pi @ Q.T
    return pi, ratingsi

Make prediction for all test users

In [None]:
%%time
compute_new_user_p_config = dict(
    regularization = 1,
    learning_rate = 0.003,
    n_epohs = 25
)

test_p_vectors = np.zeros((test_int_matrix.A.shape[0], sgd_config['rank']))
test_rating_pred = np.zeros((test_int_matrix.A.shape[0], data_description['n_items']))
for i in tqdm.tqdm(range(test_int_matrix.A.shape[0])):
    p_i, ratings_i = create_newuser_p_vector(test_int_matrix.A[i], Q=Q, config=compute_new_user_p_config)
    test_p_vectors[i] = p_i
    test_rating_pred[i] = ratings_i

100%|██████████| 813/813 [02:03<00:00,  6.61it/s]

CPU times: user 1min 42s, sys: 46.1 s, total: 2min 28s
Wall time: 2min 3s





Delete items from recomendation wich user was scoring

In [None]:
def downvote_seen_items(scores, data, data_description, test_user_index=test_user_index):
    assert isinstance(scores, np.ndarray), 'Scores must be a dense numpy array!'
    itemid = data_description['items_col']
    userid = data_description['users_col']
    # get indices of observed data, corresponding to scores array
    # we need to provide correct mapping of rows in scores array into
    # the corresponding user index (which is assumed to be sorted)
    row_idx, test_users = pd.factorize(data[userid], sort=True)
    assert len(test_users) == scores.shape[0] == data_description['n_users']
    assert list(test_user_index) == list(test_users)
    col_idx = data[itemid].values
    # downvote scores at the corresponding positions
    scores[row_idx, col_idx] = scores.min() - 1

In [None]:
sgd_scores = test_rating_pred.copy()
downvote_seen_items(sgd_scores, testset_, test_data_description)

Take topN recomendations

In [None]:
sgd_recs = topn_recommendations(sgd_scores, topn=10)

Before valuate our model we need reindex by test_user_index our holdout data and sort it to male consistency

In [None]:
holdout_reind = holdout_.copy()
hold_user_indx, hold_user_index = pd.factorize(holdout_.userid, sort=True)
assert list(hold_user_index) == list(test_user_index)
holdout_reind['userid'] = hold_user_indx

Function calculate metrics

In [None]:
def model_evaluate(recommended_items, holdout, holdout_description, topn=10):
    itemid = holdout_description['items_col']
    holdout_items = holdout[itemid].values
    assert recommended_items.shape[0] == len(holdout_items)
    hits_mask = recommended_items[:, :topn] == holdout_items.reshape(-1, 1)
    # HR calculation
    hr = np.mean(hits_mask.any(axis=1))
    # MRR calculation
    n_test_users = recommended_items.shape[0]
    hit_rank = np.where(hits_mask)[1] + 1.0
    mrr = np.sum(1 / hit_rank) / n_test_users
    # coverage calculation
    n_items = holdout_description['n_items']
    cov = np.unique(recommended_items).size / n_items
    return hr, mrr, cov

In [None]:
metrics = model_evaluate(sgd_recs, holdout_reind.sort_values(by=['userid']), test_data_description, topn=10)
print(f'HR: {metrics[0]}, \nMRR: {metrics[1]}, \nCOV: {metrics[2]}')

HR: 0.01968019680196802, 
MRR: 0.004327983756028037, 
COV: 0.01645979492714517


### Test several variants of models

In [None]:
trainset_with_negative05 = pd.concat([trainset, create_negative_samples(trainset, data_description, rating=0, gamma=0.5)])

In [None]:
models_names = ["sgd_n20_r25_pn15",
                "sgd_n20_r25_pn15_negative1",
                "sgd_n20_r25_pn15_negative0.5",
                "sgd_n40_r25_pn15",
                "sgd_n20_r50_pn15",
                "sgd_n20_r25_pn30",
                "sgd_n40_r50_pn15_negative1",
                "sgd_n40_r50_pn15_negative0.5"
                ]

sgd_config_list = [{'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 20, 'rank': 25, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 20, 'rank': 25, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 20, 'rank': 25, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 40, 'rank': 32, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 20, 'rank': 50, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 20, 'rank': 25, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 40, 'rank': 50, 'seed': 16},
                   {'learning_rate': 0.002, 'regularization': 1, 'n_epochs': 40, 'rank': 50, 'seed': 16}
                    ]

compute_new_user_p_config_list = [{'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 30},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15},
                                  {'regularization': 1, 'learning_rate': 0.002, 'n_epohs': 15}]
trainset_list = [trainset,
                 trainset_with_negative,
                 trainset_with_negative05,
                 trainset,
                 trainset,
                 trainset,
                 trainset_with_negative,
                 trainset_with_negative05]

In [None]:
len(list(zip(models_names, sgd_config_list, compute_new_user_p_config_list, trainset_list)))

6

In [None]:
metrics_list = []
for model_name, grid_sgd_config, grid_p_config_list,grid_trainset in list(zip(models_names, sgd_config_list, compute_new_user_p_config_list, trainset_list)):
      print(f'#####__{model_name}__#####')
      # compute model
      _, _, _, P, Q = mf_sgd_build(grid_sgd_config, grid_trainset, train_data_description)

      # make predictions on test
      test_p_vectors = np.zeros((test_int_matrix.A.shape[0], grid_sgd_config['rank']))
      test_rating_pred = np.zeros((test_int_matrix.A.shape[0], data_description['n_items']))
      for i in tqdm.tqdm(range(test_int_matrix.A.shape[0])):
          p_i, ratings_i = create_newuser_p_vector(test_int_matrix.A[i], Q=Q, config=grid_p_config_list)
          test_p_vectors[i] = p_i
          test_rating_pred[i] = ratings_i

      # clear prediction
      sgd_scores = test_rating_pred.copy()
      downvote_seen_items(sgd_scores, testset_, test_data_description)
      sgd_recs = topn_recommendations(sgd_scores, topn=10)

      # calculate metrics
      metrics = list(model_evaluate(sgd_recs, holdout_reind.sort_values(by=['userid']), test_data_description, topn=10))
      metrics_list.append(metrics)

#####__sgd_n40_r50_pn15_negative1__#####
SGD start


100%|██████████| 813/813 [01:17<00:00, 10.54it/s]


#####__sgd_n40_r50_pn15_negative0.5__#####
SGD start


100%|██████████| 813/813 [01:15<00:00, 10.81it/s]


In [None]:
pd.DataFrame(metrics_list, columns=['hr', 'mrr', 'cov'], index=models_names)

Unnamed: 0,hr,mrr,cov
sgd_n20_r25_pn15,0.00861,0.003189,0.01646
sgd_n20_r25_pn15_negative1,0.04428,0.016332,0.045872
sgd_n20_r25_pn15_negative0.5,0.04182,0.014722,0.040745
sgd_n40_r25_pn15,0.01476,0.003507,0.01592
sgd_n20_r50_pn15,0.01599,0.003705,0.01646
sgd_n20_r25_pn30,0.00861,0.003189,0.01646
sgd_n40_r50_pn15_negative1,0.04305,0.014754,0.045062
sgd_n40_r50_pn15_negative0.5,0.04305,0.013736,0.041015


Adding negative sampling significantly improves the model. Increasing training epochs and increasing the rank dimension also helps improve metrics. A significant increase in epochs when obtaining vector p_i of a new user does not have much effect

Advantages of WMF over SVD:
* Efficiency. SGD is inherently more scalable and computationally efficient than methods that require full matrix factorization like SVD. Suitable for large datasets and online learning scenarios.
* Adaptability. Handles sparse and dynamic datasets well. Can be updated incrementally with new data.
* Generalization. Negative sampling enhances the model's ability to generalize by learning to differentiate between positive and negative instances.