<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=160px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Graphs and Networks</h1>
<h1>Lesson V - Recommender Systems</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import watermark

%load_ext watermark
%matplotlib inline

Watermark the notebook with current versions of all loaded libraries

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: e2a121886d5fb3c11344fccd2cf35c76180f7246

matplotlib: 3.3.2
numpy     : 1.19.2
watermark : 2.1.0
json      : 2.0.9



Load default figure style

In [3]:
plt.style.use('./d4sci.mplstyle')

Let's use this simple example, where everything is easy to visualize

<img src="data/bipartite.png" width='400px'>

We start by defining the adjacency matrix of our bipartite network. This is not the most efficient graph representation, but it is the most convenient in our case

In [4]:
A = np.zeros((8, 6), dtype='int')

Rows correspond to 'x' nodes and columns to 'y' nodes

In [5]:
A[0, 0]=1
A[0, 1]=1
A[1, 0]=1
A[1, 1]=1
A[1, 3]=1
A[2, 2]=1
A[2, 4]=1
A[3, 0]=1
A[3, 3]=1
A[4, 2]=1
A[4, 4]=1
A[5, 2]=1
A[5, 5]=1
A[6, 4]=1 
A[6, 5]=1
A[7, 4]=1

The adjacency matrix is then:

In [6]:
pprint(A)

array([[1, 1, 0, 0, 0, 0],
       [1, 1, 0, 1, 0, 0],
       [0, 0, 1, 0, 1, 0],
       [1, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0, 1],
       [0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 1, 0]])


In [7]:
A.shape

(8, 6)

And we can easily calculate the degree of each kind of nodes as well, by simply summing over the rows or columns:

In [8]:
kx = A.sum(axis=1)
ky = A.sum(axis=0)

In [9]:
kx

array([2, 3, 2, 2, 2, 2, 2, 1])

In [10]:
ky

array([3, 2, 3, 2, 4, 2])

The X and Y one-mode projections are:

In [11]:
X = np.dot(A, A.T)
Y = np.dot(A.T, A)

In [12]:
pprint(X)

array([[2, 2, 0, 1, 0, 0, 0, 0],
       [2, 3, 0, 2, 0, 0, 0, 0],
       [0, 0, 2, 0, 2, 1, 1, 1],
       [1, 2, 0, 2, 0, 0, 0, 0],
       [0, 0, 2, 0, 2, 1, 1, 1],
       [0, 0, 1, 0, 1, 2, 1, 0],
       [0, 0, 1, 0, 1, 1, 2, 1],
       [0, 0, 1, 0, 1, 0, 1, 1]])


In [13]:
pprint(Y)

array([[3, 2, 0, 2, 0, 0],
       [2, 2, 0, 1, 0, 0],
       [0, 0, 3, 0, 2, 1],
       [2, 1, 0, 2, 0, 0],
       [0, 0, 2, 0, 4, 1],
       [0, 0, 1, 0, 1, 2]])


And we can see that the y-projection neatly splits into two disconnected graphs, as expected

In [14]:
order = [0, 1, 3, 2, 4, 5]
Y[order, :][:, order]

array([[3, 2, 2, 0, 0, 0],
       [2, 2, 1, 0, 0, 0],
       [2, 1, 2, 0, 0, 0],
       [0, 0, 0, 3, 2, 1],
       [0, 0, 0, 2, 4, 1],
       [0, 0, 0, 1, 1, 2]])

## Similarity

Let us definie the similarity between two users (X) or items (Y) to simply be the fraction of edges user x shares with user y. For convenience, we supply the one-mode X projection directly

In [15]:
def similarity(X, kx):
    S = X.copy().astype('float')
    
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            S[i, j]/= np.min([kx[i], kx[j]])

    return S

Our similarity is then:

In [16]:
S = similarity(X, kx)
print(S)

[[1.  1.  0.  0.5 0.  0.  0.  0. ]
 [1.  1.  0.  1.  0.  0.  0.  0. ]
 [0.  0.  1.  0.  1.  0.5 0.5 1. ]
 [0.5 1.  0.  1.  0.  0.  0.  0. ]
 [0.  0.  1.  0.  1.  0.5 0.5 1. ]
 [0.  0.  0.5 0.  0.5 1.  0.5 0. ]
 [0.  0.  0.5 0.  0.5 0.5 1.  1. ]
 [0.  0.  1.  0.  1.  0.  1.  1. ]]


Naturally, this symmilarity metric is symmetric

In [17]:
(S-S.T).mean()

0.0

Now we can predict scores for all user/item pairs. The score for each user-item pair will be the average similarity of all users that have 

In [18]:
def predict_score(A, S):
    v = np.dot(S, A)
    norms = S.sum(axis=0)-np.diag(S)
    
    v = v/norms.reshape(-1,1)

    return v

The predicted scores are then:

In [19]:
v = predict_score(A, S)
print(v.round(2))

[[1.67 1.33 0.   1.   0.   0.  ]
 [1.5  1.   0.   1.   0.   0.  ]
 [0.   0.   0.83 0.   1.17 0.33]
 [1.67 1.   0.   1.33 0.   0.  ]
 [0.   0.   0.83 0.   1.17 0.33]
 [0.   0.   1.33 0.   1.   1.  ]
 [0.   0.   0.6  0.   1.2  0.6 ]
 [0.   0.   0.67 0.   1.33 0.33]]


As we can see, we not only have scores for the items that each user already rated, but also for other items as well:

In [20]:
np.transpose(np.nonzero(np.sign(v)-A)) # Get the coordinates of the non-zero elements

array([[0, 3],
       [2, 5],
       [3, 1],
       [4, 5],
       [5, 4],
       [6, 2],
       [7, 2],
       [7, 5]])

From this matrix we would know to recommend **y4** to **x1**, **y6** to **x3**, etc

<div style="width: 100%; overflow: hidden;">
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</div>