## Baseline: Collaborative filtering model

In my solution I used User-Based Top-N Recommendation Algorithm. I identified the k most similar users (nearest neighbors) to the active user using Cosine similarity. Each user is treated as a vector in the m-dimensional item space and the similarities between the active user and other users are computed between the vectors.

In [17]:
import pandas as pd
import os, sys, csv, json
import numpy as np
import pandas as pd
from random import sample 

from collections import Counter
from itertools import chain
from pandas.io.json import json_normalize
from random import randrange, sample

from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity

## Build matrix

Building user-artist space. 

In [18]:
data_dir = 'ThirtyMusic/entities'
relations_dir = 'ThirtyMusic/relations'

artists_path_csv = os.path.join(data_dir, 'persons.csv')
sessions_path_csv = os.path.join(relations_dir, 'sessions_short.csv')

artists = pd.read_csv(artists_path_csv, delimiter=';', header=0)
artists = artists.set_index('ID')

sessions = pd.read_csv(sessions_path_csv, sep=';')

Create a matrix where: *rows = users, columns = artists, values[i][j] = number of times user_i listened to artist_j in all the sessions*

In [24]:
s = sessions.sort_values(by='UserId')
s.drop(columns=['Type', 'Timestamp', 'ID'], inplace = True)
s.set_index('UserId', inplace = True)
g = s.groupby(['UserId'])['ArtistsID'].apply(lambda x: [a.split(',') for a in x.tolist() if \
                                                     not pd.isna(a)]).reset_index()
g['ArtistsID'] = g['ArtistsID'].apply(lambda x: Counter(list(chain(*x))))

# Saving this dictionary for the embeddings model
dictionary_data = g.to_dict()['ArtistsID']

g.head()

Unnamed: 0,UserId,ArtistsID
0,4,"{'3': 2, '546': 2, '544': 2, '259773': 1, '315..."
1,14,"{'158286': 1, '13': 1}"
2,17,"{'40': 1, '20': 1, '15': 1}"
3,60,"{'40': 1, '29': 3, '11': 1, '65': 1, '74': 1, ..."
4,81,"{'38701': 4, '34577': 3, '74637': 3, '148559':..."


I used only a piece of the dataset because the matrix didn't fit in the memory.


A possible solution of this issue described in the end of this notebook

In [25]:
g = g.head(10000) 
matrix = json_normalize(g['ArtistsID'])
matrix.set_index(g['UserId'], inplace=True)
matrix.fillna(0, inplace=True)
print(matrix.shape)

(10000, 23741)


In [26]:
# Building indexes
idx_to_artist = dict(enumerate(list(map(int, matrix.columns))))
artist_to_idx = dict(zip(idx_to_artist.values(),idx_to_artist.keys()))
idx_to_user = dict(enumerate(list(map(int,matrix.index))))

## Train test split

Testing of recommender systems is tricky - we need to recommend relevant artists but it's difficult to check if the recomendations are indeed relevant.

I was deleting some sells of the user-artist matrix from top_6 most listened artists. It was done only for users who had listened to more than 10 different artists in order to not encounter with the cold start problem.

After creating the model I will check if the sells that were deleted (relevant artists) are in the top 10 recommended.

In [27]:
test_user_idxs = []
test_artist_idxs = []
test_val = []

def select_test(user, test_size, topk = 6, n_listened = 10):
    rated_artists = user[user>0]
    if rated_artists.shape[0] > n_listened and len(test_user_idxs) < test_size:
        artist_ind_num = randrange(1,topk)
        rated_artists.sort_values(ascending=False, inplace=True)
        artist_idx = rated_artists.index[artist_ind_num]
        test_user_idxs.append(int(user.name))
        test_artist_idxs.append(int(artist_idx))
        test_val.append(rated_artists.loc[artist_idx])
        user[artist_idx] = 0
    return user

test_size = int(matrix.shape[0]*0.3)
matrix = matrix.apply(lambda x: select_test(x, test_size = test_size), axis = 1)

### Normalizing data & calculating cosine similarity btw users


In [28]:
scaler = MinMaxScaler()
scaled_matrix = scaler.fit_transform(matrix)
cosine_sim = cosine_similarity(scaled_matrix)
np.fill_diagonal(cosine_sim, 0)
cosine_sim = pd.DataFrame(cosine_sim, index=matrix.index, columns = matrix.index)
cosine_sim.head()

UserId,4,14,17,60,81,84,88,92,98,101,...,39979,39987,39992,39998,39999,40002,40003,40005,40011,40013
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.471405,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
60,0.0,0.0,0.471405,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81,0.0,0.0,0.0,0.0,0.0,0.015394,0.050307,0.009455,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# Finding top 30 similar users for each user

def find_n_neighbours(df, n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

sim_user_top_30 = find_n_neighbours(cosine_sim, 30)
sim_user_top_30.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,...,top21,top22,top23,top24,top25,top26,top27,top28,top29,top30
UserId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,35892,29866,18754,38902,36711,30011,28861,31195,667,38907,...,15029,17049,35680,24469,13855,36954,23558,17341,12516,8856
14,40013,13194,13221,13218,13213,13209,13205,13202,13196,13185,...,13563,13268,13285,13284,13282,13277,13273,13272,13269,13267
17,60,40013,13196,13227,13221,13218,13213,13209,13205,13202,...,13236,13163,13269,13290,13285,13284,13282,13277,13273,13272
60,17,40013,13196,13227,13221,13218,13213,13209,13205,13202,...,13236,13163,13269,13290,13285,13284,13282,13277,13273,13272
81,10727,13399,18834,26000,20035,24109,15422,8801,19141,14930,...,6514,25035,30799,14771,22556,84,8121,2383,92,38159


After the k most similar users have been discovered, their corresponding rows in the user-artist matrix R are aggregated to identify a set of artists, listened by the group together with their frequency. With the set of artists we recommend the top-N most frequent items in C that the active user has not listened.

The aggregation of the similar users is done according to the next formula (weighted sum):

Let k similar users have cos_distances = [d1, d2, ..., dk]. Normalize theese distances with MinMax normalization and inverse them as 1-d. The bigger the cosin distance the less should be the coefficient. That's why after normalization we need to inverse. 

Normalised and inversed coef = [nd1, nd2, ..., ndk]. The closer a user to the target user the bigger will be the coeficient ndi.

The aggregation:

| user_id/artist_id | artist_id_1 | artist_id_2  |...  |artist_id_k  |
| ------------- |-------------|-------------|-------------|-------------|
| **id_1**          | val1_1      | val2_1      |...       |val_k_1       |
| **id_2**          | val1_2      |   val2_2    |...        |val_k_2        |
| **id_3**          | val1_3      |    val2_3   | ...        |val_k_3        |

> *value_for_artist_1 = val1_1 * nd1 + val1_2 * nd2 + ... + val_k_1 * ndk*


In [30]:
def recommend_to_user(user_id, matrix, sim_users, artists, k = 10):
    listened_artitsts = matrix.loc[user_id]
    listened_artitsts = listened_artitsts[listened_artitsts > 0]
    
    knn = sim_users.loc[user_id].tolist()
    
    cos_distances = cosine_sim[knn].loc[user_id]
    
    # normalizing distances to make coefficients
    norm_cos_distances = (cos_distances-cos_distances.min())/(cos_distances.max()-cos_distances.min())
    coef = norm_cos_distances.apply(lambda x: 1-x).sort_index()
    
    knn_matrix = matrix.loc[knn]
    knn_matrix.sort_index().mul(coef, axis=0)
    top_artists = knn_matrix.sum()
    top_artists = top_artists[top_artists > 0]
    
    # removing artists that user knows
    consider_artists = list(set(top_artists.index) - set(listened_artitsts.index))
    
    top_artists = top_artists.loc[consider_artists].sort_values(ascending=False)[:k]
    top_artists_ids = list(map(int, top_artists.index))
    top_artists_names = artists.loc[top_artists_ids]['Name']
    return top_artists, top_artists_names
 
top_artists, top_artists_names = recommend_to_user(user_id = 4,
                                                   matrix = matrix,
                                                   sim_users = sim_user_top_30,
                                                   artists = artists)

top_artists_names

ID
3747                                     30STM+&+Nearq
4807                    50+Cent+feat.+Snoop+Dogg+&+Pre
65988     Calle+13+&+Tuna+Bardos+UPR+Rio+Piedras-Choir
129                                              Track
4966                                          5+A+Seco
3984                        %3CArtista+Desconhecido%3E
108316                     Dead+Silence+Hides+My+Cries
3165                                 2+Chainz+&+Future
271162        Pendulum+&+Fresh+feat.+Spyda+&+Tenor+Fly
3582                                               2PM
Name: Name, dtype: object

### Test

Recommending for all the test samples and then ckecking if the relevant artist (the one we removed) is in the top_10 recommended. Calculetiong the avg between all test samples 

In [31]:
def test_CF_model(test_user_idxs, matrix, sim_user_top_30, artists, k = 10):
    n_relevant = 0
    for i, user_id in enumerate(test_user_idxs):
        res, _ = recommend_to_user(user_id, matrix, sim_user_top_30, artists)
        top_artists_ids = list(map(int, res.index))[:k]
        if test_artist_idxs[i] in top_artists_ids:
            n_relevant += 1
    n_test_samples = len(test_user_idxs)
    return n_relevant/n_test_samples

test_CF_model(test_user_idxs, matrix, sim_user_top_30, artists)

0.06524725274725275

### Comments

Another solution would be to use sparse representation on the matrix because the majority of it's values are None. [Scipy.sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html)

There was not enough time to implement such an approach. I expect it would increase the quality a lot because now only a piece of the dataset is being used.

In [32]:
# Samples for t-test
test_samples = [sample(test_user_idxs,len(test_user_idxs)//10) for i in range(30)]
CF_model_sample = [test_CF_model(sample, matrix, sim_user_top_30, artists) for sample in test_samples]
print(len(CF_model_sample))

30


# Embeddings

I was building embeddings of users and artists such as:

* user and artist are close in the embeddings space if the user likes the artist

Due to transitivity from the first:
* users are close together in the embeddings space if they are similar
* artists are close together in the embeddings space if they are similar

In [19]:
import random
import pickle

import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils import data
from torch.utils.data import Dataset

from torchvision import transforms
from collections import defaultdict

from torch.utils.data.sampler import SubsetRandomSampler, Sampler

## Build sample

Sample cosisits of rows: user_id, artist_id, target

If the user have listened to the artists target = 1, otherwise target = -1.

* Negative samples were selected randomly from the artists that the user did not listened.
* Positive samples were taken from the most listened artists by this user.

In [15]:
def create_sample(session_dict, all_artist_ids, negative = 1e-5, positive = 0.1):
    sample = []
    start_time = 0
    for i, (user, session) in enumerate(session_dict.items()):
        factor=1.0/sum(session.values())
        for artist in session:
            if session[artist]*factor >= positive:
                sample.append([user, artist, 1])
        not_listened_artists = list(set(all_artist_ids) - set(session.keys()))
        perc = int(len(not_listened_artists)*negative)
        indices = random.sample(range(len(not_listened_artists)), perc)
        for i in indices:
            sample.append([user, not_listened_artists[i], -1])
    return sample

artists.reset_index(inplace = True)
artist_ids = artists.ID.tolist() 
sample = create_sample(dictionary_data, artist_ids)

In [60]:
# change f_name
with open('metadata/sample_for_embed.pkl', 'wb') as f:
    pickle.dump(sample, f)

I don't have GPU and I again used a piece of the sample

In [20]:
with open('metadata/sample_for_embed.pkl','rb') as f:
    sample = pickle.load(f)

sample = sample[:len(sample)//100]  
print(len(sample)) 

10000


#### Decoding ids

In [21]:
sample = np.asarray(sample)
idx_2_user = dict(enumerate(set(sample[:, 0])))
user_2_idx = {value:key for key,value in idx_2_user.items()}
idx_2_artist = dict(enumerate(set(sample[:, 1])))
artist_2_idx = {value:key for key,value in idx_2_artist.items()}

a = np.array([user_2_idx[user] for user in sample[:, 0]])
b = np.array([artist_2_idx[artist] for artist in sample[:, 1]])
sample[:, 0] = a
sample[:, 1] = b

print('unique users :', len(idx_2_user), 'unique artists:', len(idx_2_artist))

unique users : 1427 unique artists: 8889


In [22]:
X = torch.Tensor(sample[:, :2]).to(torch.int64)
y = torch.Tensor(sample[:, 2:])

In [23]:
class Dataset(data.Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.positive_samples_idx = []

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        # Select sample
        X = self.X[index]
        y = self.y[index]
        return X, y
    
    def get_posisve_samples_idx(self):
        for i, y in enumerate(self.y):
            if y:
                self.positive_samples_idx.append(i)
        return self.positive_samples_idx

### Train validation test split

In [24]:
def split_data(dataset, batch_size, val_split_size = .15, test_split_size = .15):
    
    data_size = dataset.__len__()
    positive_idx = dataset.get_posisve_samples_idx()
    indices = list(range(data_size))
    positive_data_size = len(positive_idx)
    
    val_split = int(np.floor(val_split_size * positive_data_size))
    test_split = int(np.floor((test_split_size + val_split_size) * positive_data_size))

    np.random.shuffle(positive_idx)
    
    val_indices = positive_idx[:val_split]
    test_indices = positive_idx[val_split:test_split]
    train_indices = positive_idx[test_split:] + list(set(indices) - set(positive_idx))

    train_sampler = SubsetRandomSampler(train_indices)
    
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                               sampler=train_sampler)

    return train_loader, train_indices, val_indices, test_indices


### Building model

There was no time to add extra data to the dataset, but this code easily allows to do it.

The models UserEmbedder and ArtistEmbedder are identical and have 2 torch.nn.Linear layers.

*The main idea is to train and then calculate loss as nn.CosineEmbeddingLoss() between artist and user embeddings. The target value is {-1, 1}, as the range of cosine similarity values. *

In [25]:
class UserEmbedder(torch.nn.Module):

    def __init__(self, n_hidden, users_size, embedding_dim, user_extra=0):
        super(UserEmbedder, self).__init__()
        self.user_embeddings = torch.nn.Embedding(users_size, embedding_dim)

        self.user_extra = user_extra
        last_hidden_input = embedding_dim

        if user_extra:
            last_hidden_input += self.user_extra

        self.user_model = torch.nn.Sequential(
            torch.nn.Linear(last_hidden_input, n_hidden),
            torch.nn.Sigmoid(),
            torch.nn.Linear(n_hidden, n_hidden),
            torch.nn.Sigmoid(),
        )

    def forward(self, X):
        user = self.user_embeddings(X)

        # If user content features are available
        if self.user_extra:
            user = torch.cat([user, X['user_extra']], dim=1)

        user = self.user_model(user)
        return user


class ArtistEmbedder(torch.nn.Module):
    
    def __init__(self, n_hidden, artist_size, embedding_dim, artist_extra=0):
        super(ArtistEmbedder, self).__init__()
        self.artist_embeddings = torch.nn.Embedding(artist_size, embedding_dim)

        last_hidden_input = embedding_dim
        self.artist_extra = artist_extra

        if artist_extra:
            last_hidden_input += self.artist_extra

        self.artist_model = torch.nn.Sequential(
            torch.nn.Linear(last_hidden_input, n_hidden),
            torch.nn.Sigmoid(),
            torch.nn.Linear(n_hidden, n_hidden),
            torch.nn.Sigmoid(),
        )

    def forward(self, X):
        artist = self.artist_embeddings(X)

        # If artist content features are available
        if self.artist_extra:
            artist = torch.cat([artist, X['artist_extra']], dim=1) 

        artist = self.artist_model(artist)
        return artist

In [26]:
# Converting numpy array to dictionary

def array_to_dict(array):
    d = defaultdict(list)
    for i in array:
        d[int(i[0])].append(int(i[1]))
    return d

In [31]:
dataset = Dataset(X, y)
train_loader, train_ind, val_ind, test_ind = split_data(dataset, batch_size = 256)

# moving to dictionaries to accelerate access to the data

X_val = array_to_dict(X[val_ind])
X_test = array_to_dict(X[test_ind])
X_train = array_to_dict(X[train_ind])

val_unique_users = len(X_val)
test_unique_users = len(X_test)

n_hidden = 20
users_size = len(idx_2_user)

# to capture more information it's better to use bigger dim (for ex. 50)
# I'd icrease the embedding_dim if had a better machine

embedding_dim = 10 
artist_size = len(idx_2_artist)

userEmbedder = UserEmbedder(n_hidden, users_size, embedding_dim)
userEmbedder.type(torch.FloatTensor)

artistEmbedder = ArtistEmbedder(n_hidden, artist_size, embedding_dim)
artistEmbedder.type(torch.FloatTensor)

criterion = nn.CosineEmbeddingLoss()
optimizer = optim.Adam(list(userEmbedder.parameters()) + list(artistEmbedder.parameters()), lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

### Validation function

For each user in the validation set we recommend topk **new** artists. Then we check how many of the recomendations are relevant to the user. The goal is to recommend artists_ids from the validation dataset.

The result for each user is the **n_relevant_recomendations/min(topk, n_relevant)**. min(topk, n_relevant) is needed in case if there will be more than k relevant artists in the validation dataset.

The result of the function is the average of the results for the users. 


In [28]:
def validate(all_user_embed, all_artist_embed, X_val, X, n_unique_users, k = 10, batch_size = 1000):
    """
        Validate the model
        
        Arguments:
        
        all_user_embed, all_artist_embed - embeddings for all users and all artists
        X_val - dictionary with validation samples. X_val = {user_id: [arist_id_1, ..., arist_id_k]}
        X - dictionary with train samples. X = {user_id: [arist_id_1, ..., arist_id_k]}.
        X is used to get the artists, that the user has already listened - we want to recommend only new artists. 
        n_unique_users - number of unique users in the validation set
        k - looking for relevant artists only in top k recomendations
        batch_size - batch_size for artist embedding
    """
    cosine = torch.nn.CosineSimilarity()
    artists_embed_sampler = SubsetRandomSampler(list(range(all_artist_embed.shape[0])))
    embed_loader = torch.utils.data.DataLoader(all_artist_embed, batch_size=batch_size,
                                         sampler=artists_embed_sampler)
    total_relevant = 0
    for user_id, artists in X_val.items():
        
        relevant_artists = np.array(artists)
        listened_artists = np.array(X[user_id])
        
        topk = torch.Tensor()
        topk_indices = torch.Tensor()
        current_user_tensor = all_user_embed[user_id].view(1, all_user_embed.shape[1])
        
        # calculate cosine similarity in batches
        for batch_artist in embed_loader:
            cos_dist = cosine(current_user_tensor, batch_artist)
            topk_b, topk_indices_b = torch.sort(cos_dist, descending = True)
            topk = torch.cat([topk, topk_b[:k]], dim=0)
            topk_indices = torch.cat([topk_indices, topk_indices_b[:k].type(torch.FloatTensor)], dim=0)
        
        topk_all = np.transpose(torch.stack([topk, topk_indices], dim=0).detach().numpy())
        topk_all = topk_all[topk_all[:,0].argsort()]
        topk_indices = topk_all[:, 1]
        
        # deleting listened_artists
        topk_artists_recommendation = np.setdiff1d(topk_indices, listened_artists, assume_unique = True)[:k]
        n_relevant = len(relevant_artists)
        
        # looking for intersection in topk recomendations and relevant artists
        n_relevant_recommendations = np.intersect1d(relevant_artists, topk_artists_recommendation).shape[0]/min(k, n_relevant)
        total_relevant += n_relevant_recommendations
        
    return total_relevant/n_unique_users 


### Traning

The model was trained for 2K epochs and the model with the best validation score was chosen.

The score would increase dramatically if it would be possible to use GPU, increase sample size, and embeddings dimensions.

In [None]:
import scipy.ndimage
import copy

loss_history = []
train_history = []
val_history = []
num_epochs = 2000
best_val_metric = 0
cosine = torch.nn.CosineSimilarity()

for epoch in range(num_epochs):
    userEmbedder.train()
    artistEmbedder.train()
    loss_accum = 0
    user_emb_full = torch.Tensor()
    artist_emb_full = torch.Tensor()
    for i_step, (x, target) in enumerate(train_loader):
        
        user_emb = userEmbedder.forward(x[:, 0])
        artist_emb = artistEmbedder.forward(x[:, -1])
        loss = criterion(user_emb, artist_emb, target)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        loss_accum += loss
        
        scheduler.step()
        
    ave_loss = loss_accum / i_step
     
        
    # Validating every 20 epochs and saving the best model
    if (epoch % 20) == 0:
        all_user_embed = userEmbedder.forward(torch.Tensor([x for x in range(len(idx_2_user))]).to(torch.int64))
        all_artist_embed = artistEmbedder.forward(torch.Tensor([x for x in range(len(idx_2_artist))]).to(torch.int64))
        metric = validate(all_user_embed, all_artist_embed, X_val, X_train, val_unique_users)
        if metric > best_val_metric:
            best_val_metric = metric
            best_user_emb = copy.deepcopy(userEmbedder)
            best_artist_emb = copy.deepcopy(artistEmbedder)
        print('metric: %f' % metric)

    print("epoch %d, loss: %f" % (epoch, ave_loss))
    

In [None]:
# Save model
with open(r"metadata/best_user_emb.pkl", "wb") as output_file:
    pickle.dump(best_user_emb, output_file)
    
with open(r"metadata/best_artist_emb.pkl", "wb") as output_file:
    pickle.dump(best_artist_emb, output_file)


In [32]:
with open(r"metadata/best_user_emb.pkl",'rb') as f:
    best_user_emb = pickle.load(f)

with open(r"metadata/best_artist_emb.pkl",'rb') as f:
    best_artist_emb = pickle.load(f)

### Test



In [33]:
best_user_embedings = best_user_emb.forward(torch.Tensor([x for x in range(len(idx_2_user))]).to(torch.int64))
best_artist_embedings = best_artist_emb.forward(torch.Tensor([x for x in range(len(idx_2_artist))]).to(torch.int64))
res = validate(best_user_embedings, best_artist_embedings, X_test, X_train, test_unique_users)
res

0.003218884120171674

## Using the embeddings model

In [16]:
def recommend_to_user(user, user_emb, artist_emb, X, artists, k = 10):
    recommendation_names = []
    recommendation_ids = []
    user_idx = user_2_idx[user]
    cosine = torch.nn.CosineSimilarity()
    all_user_embed = user_emb.forward(torch.Tensor([x for x in range(len(idx_2_user))]).to(torch.int64))
    all_artist_embed = artist_emb.forward(torch.Tensor([x for x in range(len(idx_2_artist))]).to(torch.int64))
    
    listened_artists = np.array(X[user_idx])
        
    cos_dist = cosine(all_user_embed[user_idx].view(1,all_user_embed.shape[1]), all_artist_embed)
    topk, topk_indices = torch.sort(cos_dist, descending = True)
    topk_artists_recommendation = np.setdiff1d(topk_indices, listened_artists, assume_unique = True)[:k]
    
    for idx in topk_artists_recommendation:
        artist_id = idx_2_artist[idx]
        recommendation_names.append(artists[artists.ID == artist_id].reset_index().at[0, 'Name'])
        recommendation_ids.append(artist_id)
    return recommendation_ids, recommendation_names
 
user_id = 4    
idx, names = recommend_to_user(user_id, best_user_emb, best_artist_emb, X, artists)
names

['Soul+of+Chill',
 'Johnny+Depp,+Alan+Rickman',
 'Original+Evita+Cast',
 'Jazz+4',
 'Cervidae',
 'Cieplarnia',
 'Guillaume+Roussel',
 'Mob+Serenade',
 'James+Brown+&+Luciano+Pavarotti',
 'Heib']

In [17]:
def find_topk_similar_artists(artist, artist_emb, artists, k = 10):
    topk_names = []
    topk_ids = []
    artist_idx = artist_2_idx[artist]
    cosine = torch.nn.CosineSimilarity()
        
    cos_dist = cosine(artist_emb[artist_idx].view(1, artist_emb.shape[1]), artist_emb)
    topk, topk_indices = torch.sort(cos_dist, descending = True)
    topk_artists = np.setdiff1d(topk_indices, np.array(artist_idx), assume_unique = True)[:k]
    
    for idx in topk_artists:
        artist_id = idx_2_artist[idx]
        topk_names.append(artists[artists.ID == artist_id].reset_index().at[0, 'Name'])
        topk_ids.append(artist_id)
    return topk_ids, topk_names

# arist_id = 1756 artist_name = Sixteen+Horsepower
# arist_id = 4101 artist_name = Pink+Floyd+&+Floyd 

_, names = find_topk_similar_artists(1756, best_artist_embedings, artists)
print('Sixteen+Horsepower: ', names)

_, names = find_topk_similar_artists(276125, best_artist_embedings, artists)
print('Pink+Floyd+&+Floyd: ', names)

Sixteen+Horsepower:  ['Loft+Apartment', 'Topol+and+Original+Broadway+Cast', 'Nas+feat.+Anthony+Hamilton', 'Shell+Shocked', 'Boi+Akih', 'SunSuch', '%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B9+%D0%91%D0%B0%D1%81%D0%BA%D0%BE%D0%B2', 'Lycia%2FMike+Van+Portfleet', 'AVT', 'Redneck+surfers']
Pink+Floyd+&+Floyd:  ['Flexi+Cowboys', 'Damon+Dash', 'Music+City+Singers', 'Eu+gosto+tanto+de+voc%C3%AA', 'The+Pietro+Carapellucci+Choir', 'KC+and+The+Sunshine+Band', 'Matricians', 'Hill+Briggs', 'Seven+Nails', 'The+Royal+Concept']


### Comments

As you can see the test doesn't show great results, but this model has a lot of potentials. To improve it we should add extra information about users and artists, increase the sample size and embedding dimension. Also, increase the number of epochs.

This project also shows that often a simple solution is preferable to a complicated one, especialy if there are not enough computational resources and data. (Of course, it is not the case for industry tasks)

## Statistical t-test

In [None]:
# Embedding model samples for t-test
X_test_keys = [i for i in X_test.keys()]
n_users = test_unique_users//10
test_samples = [sample(X_test_keys, n_users) for i in range(30)]

Embed_model_sample = []
for X_test_key_sample in test_samples:
    X_test_sample = {key: X_test[key] for key in X_test_key_sample}
    print(X_test_sample)
    res = validate(best_user_embedings, best_artist_embedings, X_test_sample, X_train, n_users)
    Embed_model_sample.append(res)


#### Statistical two-sided t-test
Hypothesis:

* H0: Samples have identical average
* H1: Samples have different averages

In [37]:
from scipy.stats import ttest_ind

def compare_samples_t_test(s1, s2):
    m1_mean = np.mean(s1)
    m2_mean = np.mean(s2)
    print("model1 mean value:",m1_mean)
    print("model2 mean value:",m2_mean, '\n')
    m1_std = np.std(s1)
    m2_std = np.std(s2)
    print("model1 std value:", m1_std)
    print("model2 std value:", m2_std)
    print('\n')
    ttest,pval = ttest_ind(s1, s2)
    print("p-value",pval)
    if pval < 0.05:
        print("we reject null hypothesis")
    else:
        print("we accept null hypothesis")
        
compare_samples_t_test(Embed_model_sample, CF_model_sample)

model1 mean value: 0.0013593380614657208
model2 mean value: 0.005977011494252874 

model1 std value: 0.002990917477276594
model2 std value: 0.005833828754229665


p-value 0.0003571489452935831
we reject null hypothesis
