# Collaborative Filtering

### Notations
Imagine we are to predict the rating for each movie given by a user.

| Notation | Description |
| :- | :- |
| $n_u$ | Number of users |
| $n_m$ | Number of movies |
| $r(i, j)$ | It equals 1 if user "j" has rated move "i" |
| $y^{(i, j)}$ | Rating of user "j" for the movie "i" |
| $n$ | Number of features for each movie |
| $w^{(j)}$ | Weights of features for user "j" |
| $x^{(i)}$ | Features for movie "i" |
| $b^{(j)}$ | Intersection for user "j" |

Given a dataset like below, we can use collaborative filtering to use all the users' rating to predict future ratings.

| Movie | User1 | User2 | User3 | User4 | User5 | Feature1 ($x_1$) | Feature2 ($x_2$) |
| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| Movie1 | 5 | 5 | 0 | 0 | ? | 0.9 | 0 |
| Movie2 | 5 | ? | ? | 0 | ? | 1.0 | 0.01 |
| Movie3 | ? | 4 | 0 | ? | ? | 0.99 | 0 |
| Movie4 | 0 | 0 | 5 | 4 | ? | 0.1 | 1.0 |
| Movie5 | 0 | 0 | 5 | ? | ? | 0 | 0.9 |

In [1]:
import numpy as np
import tensorflow as tf

### Function
The function is like a linear function with a difference that we have three variables<br>
$
f_{W, b, X}(W^{(j)}, b^{(j)}, X^{(i)}) = W^{(j)}.X^{(i)} + b^{(j)}
$
<br>
If the function is mean normalized: <br>
$
f_{W, b, X}(W^{(j)}, b^{(j)}, X^{(i)}) = W^{(j)}.X^{(i)} + b^{(j)} - \mu^{(i)}
$
<br>
where $\mu$ is a vector with $n_m$ rows. (Mean normalization could be calculated in a way to have $n_u$ rows i.e. be calculated via taking the mean of each column rather than row)

### Cost function
Now instead of learning parameters $W$ and $b$ we also have to learn $X$. <br>
$
\begin{equation}
\displaystyle
    J(W, b, X) = \frac{1}{2} \sum_{(i, j):r(i, j)=1}(W^{(j)}.X^{(i)} + b^{(j)} - y^{(i, j)})^2  + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(W_{k}^{(j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(X_{k}^{(i)})^2
\end{equation}
$
<br>
We can write this in another way <br>
$
\begin{equation}
\displaystyle
    J(W, b, X) = \frac{1}{2} \sum_{i=1}^{n_m}\sum_{j=1}^{n_u}r(i, j)\times(W^{(j)}.X^{(i)} + b^{(j)} - y^{(i, j)})^2  + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(W_{k}^{(j)})^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^{n}(X_{k}^{(i)})^2
\end{equation}
$

### Gradient descent
Repeat {<br>
    $W_{i}^{(j)} = W_{i}^{(j)} - \alpha\displaystyle\frac{\partial}{\partial W_{i}^{(j)}}J(W, b, X)$ <br>
    $b^{(j)} = b^{(j)} - \alpha\displaystyle\frac{\partial}{\partial b^{(j)}}J(W, b, X)$ <br>
    $X_{k}^{(j)} = X_{k}^{(j)} - \alpha\displaystyle\frac{\partial}{\partial X_{k}^{(j)}}J(W, b, X)$ <br>
}

In [122]:
def cf_func(x, w, b):
    return np.dot(w, x) + b

In [3]:
def cf_cost(X, W, b, Y, R, lambda_):
    nm, nu = Y.shape
    J = 0
    nm, n = X.shape
    nu, _ = W.shape
    
    for j in range(nu):
        for i in range(nm):
            J += R[i, j] * (1/2) * (cf_func(W[j], X[i], b[0, j]) - Y[i, j]) ** 2
    
    for j in range(nu):
        for k in range(n):
            J += (lambda_ / 2 ) * W[j, k] ** 2
            
    for i in range(nm):
        for k in range(n):
            J += (lambda_ / 2) * X[i, k] ** 2

    return J

Function below is vectorized implementation of the code (refrence "Coursera, unsupervised learning, recommenders, reinforcement-learning")

In [4]:
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

Since finding derivative of the cost function seems difficult, we use "auto diff" tools from tensorflow.

In [5]:
def mean_normalization(X):
    mu = X.sum(axis=1) / X.shape[1]
    mu = mu.reshape(-1, 1)
    X = X - mu.reshape(-1, 1)
    return X, mu

In [121]:
def gradient_descent(n, Y, R, alpha=0.1, iterations=200, lambda_=0):
    rows, columns = Y.shape
    X = tf.Variable(tf.random.normal((rows, n), dtype=tf.dtypes.float64), name='X')
    W = tf.Variable(tf.random.normal((columns, n), dtype=tf.dtypes.float64), name='W')
    b = tf.Variable(tf.random.normal((1, columns), dtype=tf.dtypes.float64), name='b')
    optimizer = tf.keras.optimizers.Adam(alpha)
    for i in range(iterations):
        with tf.GradientTape() as tape:
            cost_val = cofi_cost_func_v(X, W, b, Y, R, lambda_)
            
        grads = tape.gradient(cost_val, [X, W, b])
        optimizer.apply_gradients(zip(grads, [X, W, b]))
        
        print(f'Iteration {i + 1} cost:{cost_val}')
    
    return X, W, b

## Example

In [145]:
nu = 6
nm = 5
n = 1

# X = np.array([[0.9, 0],
#               [0.1, 0.01],
#               [0.99, 0],
#               [0.1, 1.0],
#               [0, 0.9]])

X = np.zeros(nm * n).reshape(nm, n)

Y = np.array([[5, 5, 0, 0, 4, 0],
              [5, 0, 0, 0, 3, 0],
              [0, 4, 0, 0, 2, 0],
              [0, 0, 5, 4, 4, 0],
              [0, 0, 5, 0, 1, 0],])

R = np.array([[1, 1, 1, 1, 1, 0],
              [1, 0, 0, 1, 1, 0],
              [0, 1, 1, 0, 1, 0],
              [1, 1, 1, 1, 1, 0],
              [1, 1, 1, 0, 1, 0],])

b = np.zeros(nu).reshape(1, nu)

W = np.zeros(nu * n).reshape(nu, n)

In [146]:
Y, mu = mean_normalization(Y)
X, W, b = gradient_descent(1, Y, R)

Iteration 1 cost:49.09495031034808
Iteration 2 cost:45.15162116305772
Iteration 3 cost:41.39408413041775
Iteration 4 cost:37.779166640840515
Iteration 5 cost:34.23885108328891
Iteration 6 cost:30.71965530968314
Iteration 7 cost:27.217707913016852
Iteration 8 cost:23.78215830470254
Iteration 9 cost:20.462718245745833
Iteration 10 cost:17.30776128110569
Iteration 11 cost:14.422628350411465
Iteration 12 cost:11.954614179742267
Iteration 13 cost:10.043465233590112
Iteration 14 cost:8.76099566183483
Iteration 15 cost:8.043750505402611
Iteration 16 cost:7.684702583964462
Iteration 17 cost:7.438247094018042
Iteration 18 cost:7.133378113317575
Iteration 19 cost:6.703201306114898
Iteration 20 cost:6.156599224980801
Iteration 21 cost:5.538161684477556
Iteration 22 cost:4.9044808445010375
Iteration 23 cost:4.319345780730934
Iteration 24 cost:3.8472537511207756
Iteration 25 cost:3.536586016933657
Iteration 26 cost:3.403036602252267
Iteration 27 cost:3.4244391925484576
Iteration 28 cost:3.550323115

In [151]:
user = 4
movie = 2
cf_func(X1[movie], W1[user], b1[0, user]) + mu[movie]

array([1.0683186])