<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=160px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Graphs and Networks</h1>
<h1>Recommender Systems</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import watermark

%load_ext watermark
%matplotlib inline

Watermark the notebook with current versions of all loaded libraries

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 8.12.2

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 22.4.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: c651710cbd33e41181173979762bc5df0ee7e746

json      : 2.0.9
numpy     : 1.24.2
matplotlib: 3.7.1
watermark : 2.1.0
pandas    : 1.5.3



Load default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')

Let's use this simple example, where everything is easy to visualize

<img src="data/bipartite.png" width='400px'>

We start by defining the adjacency matrix of our bipartite network. This is not the most efficient graph representation, but it is the most convenient in our case

In [4]:
ratings = pd.read_csv('data/ml-latest-small/ratings.csv')
ratings = pd.pivot_table(ratings, index='userId', columns='movieId', values='rating')

In [5]:
A = ratings.copy()
A[A>0]=1

In [6]:
A.fillna(0, inplace=True)

The adjacency matrix is then:

In [7]:
A.astype('int')

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
607,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
608,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
609,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In total, we have 610 users and 9724 movies

In [8]:
A.shape

(610, 9724)

And we can easily calculate the degree of each kind of nodes as well, by simply summing over the rows or columns. This corresponds to the number of movies each user rated and the number of users that rated each movie, respectively:

In [9]:
k_user = A.sum(axis=1)
k_movie = A.sum(axis=0)

In [10]:
k_user

userId
1       232.0
2        29.0
3        39.0
4       216.0
5        44.0
        ...  
606    1115.0
607     187.0
608     831.0
609      37.0
610    1302.0
Length: 610, dtype: float64

In [11]:
k_movie

movieId
1         215.0
2         110.0
3          52.0
4           7.0
5          49.0
          ...  
193581      1.0
193583      1.0
193585      1.0
193587      1.0
193609      1.0
Length: 9724, dtype: float64

The User-User and Movie-Movie one-mode projections are:

In [12]:
UU = A.dot(A.T)
MM = A.T.dot(A)

In [13]:
pprint(UU.shape)

(610, 610)


In [14]:
pprint(MM.shape)

(9724, 9724)


## Similarity

Let us define the similarity between two users (UU) or movies (MM) to simply be the fraction of edges user u shares with user u. For convenience, we supply the one-mode UU projection directly

In [15]:
S = UU.copy().astype('float')

In [16]:
def similarity(X, kx):
    S = X.copy().astype('float')
    
    for i in X.index:
        for j in X.columns:
            S.loc[i, j]/= np.min([kx[i], kx[j]])

    return S

Our similarity is then:

In [17]:
S = similarity(UU, k_user)
print(S)

userId       1         2         3         4         5         6         7    \
userId                                                                         
1       1.000000  0.068966  0.179487  0.208333  0.295455  0.142241  0.171053   
2       0.068966  1.000000  0.000000  0.034483  0.034483  0.068966  0.103448   
3       0.179487  0.000000  1.000000  0.025641  0.025641  0.076923  0.000000   
4       0.208333  0.034483  0.025641  1.000000  0.272727  0.125000  0.144737   
5       0.295455  0.034483  0.025641  0.272727  1.000000  0.818182  0.204545   
...          ...       ...       ...       ...       ...       ...       ...   
606     0.362069  0.172414  0.205128  0.444444  0.522727  0.203822  0.532895   
607     0.320856  0.034483  0.102564  0.155080  0.340909  0.192513  0.184211   
608     0.586207  0.206897  0.230769  0.342593  0.681818  0.353503  0.690789   
609     0.243243  0.034483  0.000000  0.081081  0.270270  0.594595  0.162162   
610     0.297414  0.620690  0.179487  0.

Naturally, this symmilarity metric is symmetric

In [18]:
(S-S.T).mean().sum()

0.0

Now we can predict scores for all user/item pairs. The score for each user-item pair will be the average similarity of all users that have rated that item

In [19]:
def predict_score(A, S):
    # sum the scores of each rater for each movie, weighted by similarity
    v = S.dot(A)
    
    # Get the norm across all similarities
    norms = S.sum(axis=0)-np.diag(S)
    
    # Divide by the norm to normalize the scores
    v = v.div(norms).fillna(0)

    return v

The predicted scores are then:

In [20]:
v = predict_score(A, S)
v.round(2)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.41,0.48,0.61,0.02,0.11,0.21,0.13,0.02,0.10,0.56,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.23,0.31,0.22,0.00,0.05,0.10,0.04,0.01,0.03,0.23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.12,0.17,0.21,0.01,0.04,0.07,0.04,0.01,0.03,0.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.29,0.32,0.38,0.01,0.08,0.14,0.10,0.01,0.06,0.36,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.47,0.64,0.67,0.04,0.14,0.24,0.16,0.03,0.08,0.74,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.57,0.58,0.56,0.03,0.12,0.24,0.16,0.02,0.10,0.67,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.41,0.49,0.55,0.02,0.11,0.20,0.12,0.02,0.10,0.58,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.76,0.86,0.94,0.03,0.20,0.34,0.23,0.03,0.21,1.05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,0.41,0.55,0.59,0.03,0.12,0.21,0.13,0.02,0.09,0.69,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>