# Collaborative Filtering for Implicit Feedback Datasets

### Cost function for squared error with regularization

Across all x (users) and y (items), find the values of u and i that minimize the summation below:

$\underset{x,y}min\underset{u,i}\sum 
c_{ui} (p_{ui} - x_u^Ty_i)^2 + \lambda
(\underset u \sum \parallel x_u \parallel ^2
+\underset u \sum \parallel y_i \parallel ^2)$

##### Where:

$x_u$ is user vector,
$y_i$ is item vector.

$p_{ui} = 1$ if interaction, 
$p_{ui} = 0$ if no interaction.

$c_{ui} = 1 + \alpha * r_{ui}$, where
$r_{ui}$ = # of interactions for a user-item pair, and $\alpha$ determines our confidence levels.

$\lambda$ is regularization term.

#### Explanation of cost function

We take the squared error of our prediction and 
multiply by the confidence, and regularize our $x$ and $y$ vectors with $\lambda$ to penalize overfitting. (larger values or smaller values?)

$\alpha$ allows us to influence our confidence levels. Clearly, our confidence increases when a producer samples the same artist multiple times, but by how much? $\alpha$ determines how important multiple samples are.

We add 1 so that non-interactions are not lost during the cost calculation.

### ALS Algorithm

However, we can't use the cost function above because of the size of the dataset. (m * n terms)

Therefore we modify the cost function to Alternating Least Squares, which works by holding either user vectors or item vectors constant and calculating the global minimum, then alternating to the other vector.

#### Compute user factors

$x_u = (Y^T C^u Y + \lambda I)^{-1}  Y^T C^u p(u)$

##### Where:

$Y$ is $n * f$ matrix of item-factors. 

$C^u$ is a $n*n$ diagonal matrix for user $u$ where $C^u_{ii} = c_{ui}$. This is our confidence matrix for n items.

$p(u)$ is vector of preferences for user $u$.


#### Recompute item factors

$y_i = (X^TC^iX + \lambda I)^-1 X^TC^ip(i)$

##### Where:
$X$ = $m * f$ matrix  of user_factors. 

$C^i$ is $m * m$ diagonal matrix for each item $i$ where $C_{uu}^i = c_{ui}$

$p(i)$ is vector of preferences for item $i$.

### Explaining recommendations

If $\hat{p}_{ui}$, the predicted preference of user $u$ at item $i$, is equal to $y_i^Tx_u$, we can substite our user_factor equation for $x_u$. This gives us:

$\hat{p}_{ui} =  y_i^T(Y^T C^u Y + \lambda I)^{-1}  Y^T C^u p(u)$

Denote $f*f$ matrix $(Y^T C^u Y + \lambda I)^{-1}$ as $W^u$

$W^u$ is considered the weight for user $u$


### Ranking Algorithm

$\overline{rank} = \frac{\sum_{u,i} r^t_{ui} * rank_{ui}}{\sum_{u,i} r^t_{ui}}$

#### where:
$r^t_{ui}$ is the # of interactions for observations in the test set, and 

$rank_{ui}$ is how they ranked that item for that user, as a percentile ranking


In [6]:
model.recommend(0)

TypeError: recommend() missing 1 required positional argument: 'user_items'

They log_scaled their data, which makes sense.

# Implement Ranking Algorithm

First, train the model.

In [3]:
from pymongo import MongoClient
client = MongoClient()
db = client.whosampled
import numpy as np
import pandas as pd

import implicit
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import scipy.sparse as sparse
from scipy.sparse import csr_matrix

import os, sys
os.environ["OPENBLAS_NUM_THREADS"]="1"

import random

from src.turn_db_main_into_utility_matrix import from_mongo_collection_to_utility_matrix

# Read in the data from the Mongo collection

song_prod, artist_prod, df = from_mongo_collection_to_utility_matrix(db.main_redo)

In [15]:
artist_prod.head()
artist_prod.shape

(8203, 11284)

In [14]:
# Using the parameters I found to be best from the Grid Search
alpha = 20
factors = 5
iterations = 10

model = implicit.als.AlternatingLeastSquares(factors=100,iterations=15)

# train the model on a sparse matrix of item/user/confidence weights
sparse_artist_prod = csr_matrix(artist_prod)
model.fit(sparse_artist_prod)

# recommend items for a user
sparse_prod_artist = sparse_artist_prod.T.tocsr()
recommendations = model.recommend(0, sparse_prod_artist, 20, False)

# find related items
# related = model.similar_items(0, 20)

# sim_users = model.similar_users(0, 10)

100%|██████████| 15.0/15 [00:07<00:00,  2.87it/s]


In [81]:
item_vecs = model.item_factors

user_vecs = model.user_factors

#predictions is artist, prod- items, user
predictions = item_vecs.dot(user_vecs.T)

#We need it as u,i
preds_prods_artists = predictions.T
#get the rank of each item for each user
order = np.flip(preds_prods_artists.argsort(axis = 1), axis = 1)
ranks = order.argsort(axis = 1)

#turn ranks into percentages
percentages = (ranks / ranks.shape[0]) * 100

# We need actual values, r_ui
prod_artists = artist_prod.T

numer = prod_artists * percentages

numer_all = np.sum(np.sum(numer))

denom = np.sum(np.sum(prod_artists))

KeyboardInterrupt: 

In [87]:
percentages[:10,:10]

array([[ 2.88904644, 19.15987239, 17.80397022, 59.83693726, 60.12938674,
        53.75753279, 16.359447  , 57.33782347, 32.96703297,  4.43991492],
       [58.87982985, 16.84686281, 40.77454803, 20.6043956 ,  8.84438143,
        62.15880893, 18.77880184, 10.49273307, 30.8135413 , 52.32187168],
       [ 0.61148529, 55.22864233, 57.06309819, 71.55264091, 51.04572847,
        45.18787664, 64.33002481, 45.30308401, 62.14994683, 63.82488479],
       [67.95462602,  5.54767813, 51.04572847,  8.41013825, 15.28713222,
        21.21588089, 39.85288905, 16.83800071, 67.71534917, 61.99929103],
       [29.60829493, 48.03261255, 49.98227579, 42.74193548,  7.49734137,
        41.52782701, 36.25487416,  2.16235378, 59.11024459, 68.48635236],
       [49.10492733, 56.41616448, 32.10740872, 13.77171216, 55.6540234 ,
        59.17227933, 65.61503013, 10.66997519,  6.62885502,  9.66855725],
       [ 9.21658986, 66.62531017, 58.87982985, 17.69762496, 55.56540234,
        24.14923786, 39.82630273, 48.68840837

In [72]:
yo
order = np.flip(yo.argsort(axis = 1), axis =1)
ranks = order.argsort(axis = 1)

In [73]:
yo

array([[ 0.00831857,  0.00194022,  0.00211498, -0.00188891],
       [-0.00109681,  0.00074547, -0.0002938 ,  0.00029709]],
      dtype=float32)

In [77]:
ranks

array([[0, 2, 1, 3],
       [3, 0, 2, 1]])

In [47]:
highest_ranked[:, 0]

array([6302, 3763, 5022, 1828,  676,  895, 3867, 5606,  820,  900, 6822,
       3945, 3504, 2657, 1823, 4525, 4706, 6974, 2968, 8006, 2046, 5202,
       3893, 4417, 4090, 2207, 7649, 2907, 7410, 5630, 5258, 7241, 1562,
       5503, 5172, 2032, 2326, 1097, 4243, 2817, 7083, 6273,  724, 6731,
       4337, 6335, 5262, 4086, 6948, 3526, 7938, 5830, 2584, 7520, 7363,
       6781, 7694, 6881, 6631, 3520, 1428, 4943, 3979, 7206, 2752, 6281,
       7759, 4598, 2497, 7342, 7368, 5225, 1740, 7464, 2959, 6011, 6494,
       7344, 1942, 2602, 6398, 7261, 7039, 7244, 5733,   11, 1699, 4725,
       7201, 8036, 3142, 1124,  500,  391, 1513, 2078, 5305,  184, 7379,
       4507, 8073, 5325,   26, 1205, 6967, 7032, 6567,  720,  511,  815,
       3252, 7304, 1566, 7109, 3381, 6814, 2072, 4143,  216, 5212, 5464,
       3829, 4789, 8085, 7581, 5204, 4401, 2857, 7903, 6204, 3890, 2620,
       7308, 5477, 5561, 5982, 2343, 6482, 6936, 4510, 7373, 4335,  751,
        588, 3160, 5234, 5820, 4813, 4935, 5530,  3

In [45]:
ranks

array([1, 2, 0, 3])

In [46]:
order

array([2, 0, 1, 3])

In [4]:
# What percentage of the data is 2018, 2017, etc?

np.cumsum(df.groupby('new_song_year').count()['URL'].apply(
    lambda x: (x / len(df)) * 100 ).sort_index(
    ascending = False))

# If we take the data past 2016-2019 as our test set, we get about 5%.

new_song_year
2019      0.127584
2018      1.752243
2017      3.614425
2016      5.619121
2015      8.069004
2014     10.856848
2013     14.537780
2012     18.091127
2011     21.231863
2010     24.333238
2009     26.670467
2008     29.158353
2007     31.654383
2006     34.037759
2005     36.783528
2004     39.245626
2003     41.907244
2002     44.100601
2001     46.638707
2000     49.208030
1999     52.375911
1998     55.539721
1997     58.646525
1996     62.246020
1995     65.739647
1994     70.241731
1993     74.872755
1992     80.081979
1991     84.847374
1990     88.833693
           ...    
1982     98.813741
1981     98.892463
1980     98.946754
1979     99.013261
1978     99.082482
1977     99.134058
1976     99.201922
1975     99.280644
1974     99.363438
1973     99.455732
1972     99.523596
1971     99.611819
1970     99.686469
1969     99.771978
1968     99.857486
1967     99.892775
1966     99.929422
1965     99.940280
1964     99.944352
1963     99.951138
1962     99.96335