## Recommender Systems

Notation:
Eg: Predicting movie ratings
- $n_u = \text{no. of users}$
- $n_m = \text{no. of movies}$
- $r_{(i,j)} = \text{1 if user $j$ has rated movie $i$}$
- $y^{(i,j)} = \text{rating given by user $j$ to movie $i$ defined only if $r(i,j) = 1$}$
- $n = \text{no. of features}$
- $w^{(j)}, b^{(j)} = \text{parameters for user $j$}$
- $x^{(i)} = \text{feature vector for movie $i$}$


## 1. Collaborative Filtering

- If we have features of the movie:
- $n = \text{no. of features}$

<img  src="./images/Wk9_1.png"  style=" width:70%; padding: 10px 20px ; ">    

### 1A. Cost function

For user $j$ and movie $i$, predict rating: $w^{(j)} \centerdot x^{(i)} + b^{(j)}$  
$m^{(j)} = \text{no. of movies rated by user $j$}$  


To learn $w^{(j)}, b^{(j)}$

<img  src="./images/Wk9_2.png"  style=" width:70%; padding: 10px 20px ; ">    


### 1B. Collaborative Filtering Algorithm

- What if we didn't have $x_1,x_2$? But have parameters?
<img  src="./images/Wk9_3.png"  style=" width:70%; padding: 10px 20px ; ">    

Thus, we need to learn the feature vector by minimising the cost function as function of $x_i$

$$\text{Given }w^{1},b^{1},w^{2},b^{2},...,w^{n_u},b^{n_u}$$
<img  src="./images/Wk9_4.png"  style=" width:70%; padding: 10px 20px ; ">    

Combining both cost functions, we get
<img  src="./images/Wk9_5.png"  style=" width:70%; padding: 10px 20px ; ">    

We then run gradient descent:
<img  src="./images/Wk9_6.png"  style=" width:70%; padding: 10px 20px ; ">    


**Summary**: It is called collaborative filtering, due to the nature of having many users/ collabaration between users to determine its own parameters, and the feature vector $x$ for each movie to determine the nature of the movie based on ratings

### 1C. Collaborative filtering for binary labels: favs/likes/clicks

Binary labels used (liked/disliked)
- 1 - Positive/ Engaged
- 0 - Negative/ Did not engage
- ? - Not interacted / Not yet shown

Predict probability of $y^{(i,j)} = 1$ as $g(w^{(j)} \centerdot x^{(i)} + b^{(j)})$  
where $g(z) = \frac{1}{1 + e^{-z}}$

**Notice Logistic Regression Model!**

#### Cost function for binary application

<img  src="./images/Wk9_7.png"  style=" width:70%; padding: 10px 20px ; ">    

### 1D: Implementation Tips: Mean Normalisation
- Without mean normalisation, users default (without having rated any movie) would be 0, which is not fair

<img  src="./images/Wk9_8.png"  style=" width:70%; padding: 10px 20px ; ">    

> **Normalisation sets users default to mean rating of each movie.**

## 2. TensorFlow Implementation
- TensorFlow excels in gradient descent, which is used for optimising our cost function to find best parameters

In [1]:
import tensorflow as tf
from tensorflow import keras






### 2A. Custom Training Loop 


In [2]:
w = tf.Variable(3.0)
x = 1.0
y = 1.0
alpha = 0.01

iterations = 100

for iter in range(iterations):
    # Use TF's Gradient tape to record the steps
    #Used to compute cost J, to enable auto differentiation
    
    with tf.GradientTape() as tape:
        fwb = w*x
        costJ = (fwb - y) ** 2
        
    #Use gradient taope to calculate gradients
    #of the cost with respect to parameter w
    [dJdw] = tape.gradient(costJ, [w])
    
    #Run one step of gradient descent by updating w to reduce cost
    w.assign_add(-alpha * dJdw)
    
    
print(w)

<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.2652391>


#### Collaborative Filtering using AutoDiff
- Can also use Adam optimisation algorithm instead of Gradient Descent

In [6]:
def cofiCostFuncV(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    #reduce_sum- sums up all the terms in the matrix
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    return J

'''
# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')

# Reload ratings and add new ratings
Y, R = load_ratings_small()
#Concatenate user ratings/R(0,1) COLUMN to the left
Y    = np.c_[my_ratings, Y]
R    = np.c_[(my_ratings != 0).astype(int), R]

# Normalize the Dataset (minus mean and stuff)
Ynorm, Ymean = normalizeRatings(Y, R)
my_ratings = np.zeros(num_movies)   



# Instantiate a optimizer
optimizer = keras.optimizers.Adam(learning_rate = 1e-1)
iterations = 200
lambda_ = 1

for iter in range(iterations):
    #Gradient Tape
    with tf.GradientTape() as tape:
        
        #Compute cost (forward pass included in cost)
        cost_value = cofiCostFuncV(X, W, b, Ynorm, R, lambda_)
        
    grads = tape.gradient(cost_value, [X,W,b])
    
    optimizer.apply_gradients(zip(grads, [X, W, b]))
    
'''



"\n# Set Initial Parameters (W, X), use tf.Variable to track these variables\ntf.random.set_seed(1234) # for consistent results\nW = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')\nX = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')\nb = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')\n\n# Reload ratings and add new ratings\nY, R = load_ratings_small()\n#Concatenate user ratings/R(0,1) COLUMN to the left\nY    = np.c_[my_ratings, Y]\nR    = np.c_[(my_ratings != 0).astype(int), R]\n\n# Normalize the Dataset (minus mean and stuff)\nYnorm, Ymean = normalizeRatings(Y, R)\nmy_ratings = np.zeros(num_movies)   \n\n\n\n# Instantiate a optimizer\noptimizer = keras.optimizers.Adam(learning_rate = 1e-1)\niterations = 200\nlambda_ = 1\n\nfor iter in range(iterations):\n    #Gradient Tape\n    with tf.GradientTape() as tape:\n        \n        #Compute cost (forward pass included in co

#### Why cant we use a regular neural network? (Sequential, Compile, Fit)
A: Collaborative filtering algorithm and cost function doesnt fit into standard NN/ Dense format. 

### 2B. Finding related items
Eg: How to find similar movies using the collaborative filtering?

Features $x^{(i)}$ of item $i$ are quite hard to interpret.  
To find other items related to it, we need to find item $k$ with $x^{(k)}$ similar to $x{(i)}$ by finding:
<img  src="./images/Wk9_9.png"  style=" width:70%; padding: 10px 20px ; ">    

### 2C. Limitations of Collaborative Filtering
* Cold start problem. How to:
    * Rank new items that few users rated?
    * Show reasonable items to users who have rated very few items?
    
* Use side info about items/ users:
    * Item: Genre, movie stars, studio
    * User: Demographics (age, gender, location), expressed preferences),...

## 3. Content-based Filtering

### 3A. Collaborative filtering vs Content-based filtering

* Collbaorative Filtering:
    * Recommend items to you based on **rating** of users who gave similar ratings as you

* Content-based Filtering
    * Recommend items to you based on **features** of users and item to find a good match

- User features: $x_u^{j}$ for user $j$
    * Age, Gender, Movies Watched, Country, Average rating per genre
- Movie features: $x_m^{i}$ for movie $i$
    * Year, Genre, Reviews, ...
    
**Use one-hot encoding for features when necessary**

<img  src="./images/Wk9_10.png"  style=" width:70%; padding: 10px 20px ; ">    

**Note: $x_u^{j}$ and $x_m^{i}$ might differ in size, but $v_u^{j}$ and $v_m^{i}$ must be of same size for dot product**

### 3B. Deep-learning for content-based filtering using Neural Network
#### Neural Network Architecture


Convert $x_u$ to vector $v_u$ ; $x_m$ to vector $v_m$

**User Network** $x_u$ to vector $v_u$
- Sequential (Dense NNL(128u) , ..., ) > $v_u$ (32 units)

**Movie Network** $x_m$ to vector $v_m$
- Sequential (Dense NNL(256u), ..., ) > $v_m$ (32 units **same as above**)

**Prediction: $g(v_u \centerdot v_m)$ to predict that probability $y^{(i,j)}$ is 1**

<img  src="./images/Wk9_11.png"  style=" width:70%; padding: 10px 20px ; "> 

With user and item vectors ($v_u^{(j)}, v_m^{(i)}$),

- To find movies similar to movie $i$:
    * Ensure $(v_m^{(k)} - v_m^{(i)})^2$ is small
    * Example of combining/ putting two neural networks to form complex architecture
    * **Note: This can be pre-computed ahead of time (while user is inactive)**

### 3C. Recommending from a large catalogue

- Might be computationally expensive/infeasible
- Need to be find recommendation efficiently from a large set of items
- Involves two main steps: **Retrieval & Ranking**

1. Retrieval
    * Generate large list of plausible item candidates
    * Eg: Similar movies watched, top movies for most viewed genre, country's top movies
    * Combine items in list and filter duplicates/watched
    * Trade-off: More items > **Better performance**, but **Slower Recommendations**
    * Optimise trade-off by conducting offline experiments to see if retrieving more items > more relevant recommendations (i.e. $p(y^{(i,j)} = 1)$ is higher)

2. Ranking
    * Take list retrieved and rank using learned model
    * Display ranked items to user (based on predicted user rating)
    * $x_m$ can be computed before hand

### 3D. TensorFlow Implementation

In [None]:
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Dot
from tensorflow.keras.losses import MeanSquaredError

user_NN = Sequential([
    Dense(256, activation = 'relu'),
    Dense(128, activation = 'relu'),
    Dense(32),
])

item_NN = Sequential([
    Dense(256, activation = 'relu'),
    Dense(128, activation = 'relu'),
    #ouputs 32 numbers
    Dense(32),
])
num_user_features, num_item_features = 50,10

#create user input and point to base network
input_user = Input(shape = (num_user_features))
vu = user_NN(input_user)
#normalise vector for algorithm to work better
vu = tf.linalg.l2_normalize(vu, axis = 1)

#create item input and point to base network
input_item = Input(shape = (num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis = 1)

#measure similarity of the two vector outputs
output = Dot(axes = 1)([vu, vm])

#specify inputs, ouputs of the model
model = Model([input_user, input_item], output)

#Specify cost function
cost_fn = MeanSquaredError()

