# User-user neighborhood model

This notebook present the user-user neighborhood model training and testing


<a id=content><a>
## Table of contents
1. ### [Data preprocessing](#preprocessing)
    * [Load cleaned datasets](#load_datasets)
    * [Divide users in groups](#split_users_in_groups)  
2. ### [Model (Find neighbors)](#find_neighbors)
3. ### [Predictions](#compute_predictions)
4. ### [Evaluation](#model_evaluation)
 

In [1]:
import sys
from tqdm.auto import tqdm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

sys.path.append('src')
from train_test import train_test_split
from metrics import compute_metrics, rmse
import neighborhood_helpers as uunm

tqdm.pandas()
%load_ext autoreload
%autoreload 2

<a id=preprocessing><a>
## Data preprocessing 

<a id=load_datasets><a>
### Load datasets
[Back to content](#content)

The dataset has already been preprocessed in "./preprocessing.ipynb" and split into training and test set. 

In [2]:
DATA_PATH = './data.nosync/lastfm-dataset-360K/'
MY_DIR = './data.nosync/user_neighborhood/'

In [3]:
# Load data files
train = pd.read_csv(DATA_PATH + 'train.csv')
test = pd.read_csv(DATA_PATH + 'test.csv')
lastfm_360_behav = pd.read_csv(DATA_PATH + 'behav-360k-processed.csv')
lastfm_360_demo = pd.read_csv(DATA_PATH + 'demo-360k-processed.csv')
lastfm_360_demo = lastfm_360_demo.set_index('user_email')
test_users = np.load(DATA_PATH + 'test_users.npy')

In [4]:
train.shape, test.shape

((5644266, 3), (30022346, 3))

<a id="split_users_in_groups"><a>
### Split users in groups
[Back to content](#content)

Our train dataset contains 67k users. Comparing all users is expensive in time and ressource and lead to memory issues. 
    
To fix this issue, before computing the user's similarity, we split them into groups based on the demographic features. This operation enables us to speed the model training, against precision. 
    
We make the groups smaller than 20k users. 
    
We initially divided using 'country' and then 'age' parameters. (Country have been removed after we selected only USA in the processed dataset). 


In [5]:
users = train['user_email'].unique()
len(users)

66928

In [6]:
train_groups = uunm.compute_groups(train, lastfm_360_demo)

HBox(children=(FloatProgress(value=0.0, max=66928.0), HTML(value='')))




In [7]:
[len(train_groups[i]) for i in range(len(train_groups))]

[9079, 17226, 10001, 11977, 15131]

<a id='find_neighbors'><a>
## 2. Find user neighbors
[Back to content](#content)

Because of the big size of the dataset 67k users, the pairwise correlation cannot be compute on all pairs, therefore, we are going to chunk the users dataset using the 'age' demographic parameter. We are going to split the users in chunk of 5 years. 


Before of the very high number of artists (84k) and sparse data, we speed this process by removing artists that have less than 100 users interactions in train dataset. 

In [8]:
len(train['artist_id'].unique())

84497

In [9]:
model = uunm.compute_neighborhood_model(train, train_groups, verbose=True)
model

Number of selected artists: 5354
User groups size: [9079, 17226, 10001, 11977, 15131]


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

  c /= stddev[:, None]
  c /= stddev[None, :]


Correlation matrix computation: 7.093887090682983 seconds.


HBox(children=(FloatProgress(value=0.0, max=9076.0), HTML(value='')))


Correlation matrix computation: 32.12835121154785 seconds.


HBox(children=(FloatProgress(value=0.0, max=17223.0), HTML(value='')))


Correlation matrix computation: 9.899366855621338 seconds.


HBox(children=(FloatProgress(value=0.0, max=10001.0), HTML(value='')))


Correlation matrix computation: 14.087893962860107 seconds.


HBox(children=(FloatProgress(value=0.0, max=11974.0), HTML(value='')))


Correlation matrix computation: 25.097601652145386 seconds.


HBox(children=(FloatProgress(value=0.0, max=15126.0), HTML(value='')))





Unnamed: 0_level_0,neighbors
user_email,Unnamed: 1_level_1
10,"[(65694, 0.24142768171938941), (29650, 0.23797..."
13,"[(53572, 0.29683674643611463), (31573, 0.29285..."
18,"[(12659, 0.24548255672375344), (18236, 0.23209..."
19,"[(58102, 0.2252037000274356), (19164, 0.214010..."
20,"[(23933, 0.2783164590124584), (34775, 0.230828..."
...,...
67019,"[(37733, 0.2532350708273024), (27881, 0.220762..."
67028,"[(20309, 0.21340624226104818), (41630, 0.18850..."
67029,"[(60139, 0.30738219877520817), (65518, 0.29884..."
67033,"[(28359, 0.3082681591830806), (32815, 0.294213..."


In [10]:
# Save the model
model.to_csv(MY_DIR + "user_neighborhood_model.csv")

<a id="compute_predictions"><a>
## Compute user predictions
    
[Back to content](#content)


After building our model, we now compute the predictions on the train dataset. (Note: negative samples have been added to the train data).
    
Due to the high computation time, we have saved the predictions in snapshots to compute it in multiple runs.

In [11]:
# Load the model (ignore if computed above)
model = pd.read_csv(MY_DIR + "user_neighborhood_model.csv", index_col='user_email')
model['neighbors'] = model['neighbors'].apply(eval)

In [None]:
# Split the test set per user. 
test_split = [(user, user_df) for user, user_df in tqdm(test.groupby('user_email'))]

# Filter the number of artist to reduce the prediction time
selected_artists = uunm.filter_artists(train, artist_threshold=300)

In [None]:
pred_ratings_dict = {}
true_dict = {}

In [None]:
# Compute all the predictions and save them into dict above
# Note: This cell takes 20 hours to run and have been left blank here. 
failed = []
for i, (user, user_df) in enumerate(tqdm(test_split)):
    if (i % 100 == 0): # Save temp file in case of failure
        uunm.save_dict(pred_ratings_dict, MY_DIR, 'user_n_model_snapshot')
        uunm.save_dict(true_dict, MY_DIR, 'user_n_model_true_snapshot')
    try:
        # Compute predictions
        artists = user_df['artist_id'].values
        pred_ratings_dict[user] = np.stack([artists, 
                               np.array(uunm.compute_user_predictions(train, user, artists, model))])
        # Get true values
        true_dict[user] = np.stack([artists, user_df['rating'].values])
    except ValueError:
        failed.append((user, user_df))
        # Add empty values to failed artists
        artists = user_df['artist_id'].values
        pred_ratings_dict[user] = np.stack([artists, [0]*artists])
        true_dict[user] = np.stack([artists, user_df['rating'].values])
        print(f"Failed for user: {user}")
    

In [None]:
# Save full predictions dict
uunm.save_dict(pred_ratings_dict, MY_DIR, 'user_n_model_pred')
uunm.save_dict(true_dict, MY_DIR, 'user_n_model_true')

<a id="model_evaluation"><a>
## Model evaluation

[Back to content](#content)
    
    
After computing the predictions, we want to evaluate our model. 
In this part, we compute the following metrics:
    
    1. Root Mean Squared Error
    2. Precision @ 10
    3. Recall @ 10
    4. Normalized Discounted Cumulative Gain @ 10
    5. Hit rate @ 10
    6. Average Reciprocal Hit Rate @ 10

In [12]:
# Load predictions and true data
pred_ratings_dict = uunm.load_dict(MY_DIR, 'user_n_model_pred')
true_dict = uunm.load_dict(MY_DIR, 'user_n_model_true')

In [13]:
rmse_arr = []
for user in tqdm(pred_ratings_dict):
    u_true = true_dict[user][1]
    u_pred = pred_ratings_dict[user][1]
    rmse_arr.append(rmse(u_true, u_pred))
    
print(f"Average RMSE: {np.mean(rmse_arr)}")

HBox(children=(FloatProgress(value=0.0, max=66928.0), HTML(value='')))


Average RMSE: 0.20006307812344187


In [14]:
k = 10
_, _, _, _, _  = compute_metrics(test.drop(test[test.rating == 0].index),
                                 users, pred_ratings_dict, k)

Computing precision & recall...


HBox(children=(FloatProgress(value=0.0, max=66879.0), HTML(value='')))


Computing normalized discounted cumulative gain...


HBox(children=(FloatProgress(value=0.0, max=66879.0), HTML(value='')))


Computing hit rate...


HBox(children=(FloatProgress(value=0.0, max=66879.0), HTML(value='')))


Computing average reciprocal hit ranking...


HBox(children=(FloatProgress(value=0.0, max=66879.0), HTML(value='')))



Metrics: 

Precision @ 10: 0.3492366811704721
Recall    @ 10: 0.7106346192679347
Ndcg @ 10: 0.6071280220485821
Hit rate: 3.472749293500202
Arhr: 1.1396479356107105
