## Collaborative filtering for implicit feedback data

[This paper](https://www.researchgate.net/publication/220765111_Collaborative_Filtering_for_Implicit_Feedback_Datasets) covers implicit feedback models.

Most of the code below is taken from [here](http://www.benfrederickson.com/matrix-factorization/).

In [1]:
import requests, json, os
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
from tqdm import tqdm
from scipy.sparse import coo_matrix
from scipy.sparse.linalg import svds
from sklearn.neighbors import NearestNeighbors
import implicit
from implicit.nearest_neighbours import bm25_weight

%matplotlib inline

In [2]:
# path = 'https://raw.githubusercontent.com/James-Leslie/deep-collaborative-filtering/master/data/lastfm/'
path = 'data/lastfm/'
df = pd.read_csv(path+'plays.csv')
df.head()

Unnamed: 0,userId,artistId,playcount
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983


In [3]:
df.shape

(92834, 3)

In [4]:
artists = pd.read_csv(path+'artists.csv', usecols=['artistId', 'name'])
artists.head()

Unnamed: 0,artistId,name
0,1,MALICE MIZER
1,2,Diary of Dreams
2,3,Carpathian Forest
3,4,Moi dix Mois
4,5,Bella Morte


## (To do) convert page views to confidence $c_{ui}$
$c_{ui} = 1 + \alpha r_{ui}$

Apparently setting $\alpha=40$ produces good results

## Encode ID variables and save as sparse matrix
Use pandas' built-in `category` dtype to encode all values from 0 to N

In [5]:
# map each user and artist to a unique numeric value
df['userId'] = df['userId'].astype("category")
df['artistId'] = df['artistId'].astype("category")

# keys are new category levels, values are original ids
artist_decoder = dict(enumerate(df['artistId'].cat.categories ))
user_decoder = dict(enumerate(df['userId'].cat.categories ))

#### Decoding category codes
E.g. how to get the user in the first column of the user factor matrix:

In [6]:
# find first user's top 10 artists
_ = df[df['userId']==user_decoder[0]].sort_values('playcount', ascending=False).head(10)
pd.merge(_, artists, on='artistId')

Unnamed: 0,userId,artistId,playcount,name
0,2,51,13883,Duran Duran
1,2,52,11690,Morcheeba
2,2,53,11351,Air
3,2,54,10300,Hooverphonic
4,2,55,8983,Kylie Minogue
5,2,56,6152,Daft Punk
6,2,57,5955,Thievery Corporation
7,2,58,4616,Goldfrapp
8,2,59,4337,New Order
9,2,60,4147,Matt Bianco


`implicit` library expects item-user matrix with items as rows, users as columns

In [7]:
# create a sparse matrix of all the artist/user/play triples
plays = coo_matrix((df['playcount'].astype(float),
                   (df['artistId'].cat.codes,
                    df['userId'].cat.codes)))

In [8]:
plays.toarray().shape

(17632, 1892)

## Latent Semantic Analysis (LSA)
Decompose item-user matrix into latent factor submatrices

In [9]:
class LSA(object):
    def __init__(self, k=50, N=11):
        self.k = k
        self.N = N
        
    def fit(self, interactions):
        '''
        1) Decompose interactions into latent factor submatrices
        2) Calculate pairwise distances between item and user latent factor vectors'''
        
        # get submatrices
        self.item_factors, _, self.user_factors = svds(bm25_weight(interactions), k=self.k)
        
        # pairwise distances between items
        nbrs = NearestNeighbors(n_neighbors=self.N+1, metric='cosine', n_jobs=-1).fit(self.item_factors)
        self.distances, self.indices = nbrs.kneighbors(self.item_factors)
        
        ## TODO add user distances
        
    def get_similar_items(self, item_no):
        '''Get N most similar items to a given item'''
        
        nbrs = []
        for i in range(self.N):
            nbrs.append(
                (self.indices[item_no,i], self.distances[item_no,i])
            )
            
        return nbrs

In [10]:
lsa = LSA()

In [11]:
lsa.fit(plays)

In [12]:
lsa.get_similar_items(4)

[(4, 0.0),
 (14549, 0.02978928461596042),
 (16, 0.03086833496011321),
 (7863, 0.062135712434408075),
 (7861, 0.06458656352403469),
 (7859, 0.0645865635240348),
 (7864, 0.0645865635240348),
 (7860, 0.06458656352403513),
 (1291, 0.06574311957413659),
 (3983, 0.07137023004182863),
 (1294, 0.0896739830869061)]

In [13]:
def decode_results(results):
    
    # convert results to dataframe
    results = pd.DataFrame(np.vstack(results), columns=['artistCode', 'cosineDist'])
    results['artistCode'] = results['artistCode'].astype('int')  # change code column dtype to int
    
    # add artist names
    results['artistId'] = results['artistCode'].apply(lambda x: artist_decoder[x])
    results = pd.merge(results, artists, on='artistId')
    
    return results[['name', 'cosineDist']]

decode_results(lsa.get_similar_items(4))

Unnamed: 0,name,cosineDist
0,Bella Morte,0.0
1,Dismantled,0.029789
2,The Crüxshadows,0.030868
3,Screaming Lights,0.062136
4,Great Northern,0.064587
5,The Longcut,0.064587
6,Beyond the Void,0.064587
7,Tapping the Vein,0.064587
8,Blutengel,0.065743
9,The Black Angels,0.07137


## Alternating Least Squares (ALS)

In [14]:
# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50)



In [15]:
# train the model on a sparse matrix of item/user/confidence weights
model.fit(plays)

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




In [16]:
# recommend items for user 0
user_items = plays.T.tocsr()  # need to transpose item_user matrix
recommendations = model.recommend(0, user_items)

In [17]:
recommendations

[(976, 1.3869199),
 (148, 1.2651378),
 (1089, 1.2323935),
 (1081, 1.1916033),
 (3464, 1.1596777),
 (383, 1.1580961),
 (3181, 1.1508766),
 (595, 1.1503391),
 (520, 1.1247096),
 (950, 1.1186937)]

In [18]:
decode_results(recommendations)

Unnamed: 0,name,cosineDist
0,Blondie,1.38692
1,Radiohead,1.265138
2,Björk,1.232394
3,Franz Ferdinand,1.191603
4,Mylène Farmer,1.159678
5,Enya,1.158096
6,Kate Bush,1.150877
7,Prince,1.150339
8,La Roux,1.12471
9,Queen,1.118694


In [19]:
# find related items
related = model.similar_items(4, N=11)

In [20]:
related

[(4, 0.021726523),
 (14549, 0.020093562),
 (13690, 0.01963245),
 (1312, 0.0196114),
 (4764, 0.018954225),
 (14550, 0.018550953),
 (7863, 0.018513454),
 (7859, 0.018510925),
 (7861, 0.018476276),
 (6405, 0.01843738),
 (7864, 0.018391205)]

In [21]:
decode_results(related)

Unnamed: 0,name,cosineDist
0,Bella Morte,0.021727
1,Dismantled,0.020094
2,Razed in Black,0.019632
3,Lights of Euphoria,0.019611
4,Imperative Reaction,0.018954
5,Hora,0.018551
6,Screaming Lights,0.018513
7,The Longcut,0.018511
8,Great Northern,0.018476
9,Blaqk Audio,0.018437
