# <img align="left" src="./images/movie_camera.png"     style=" width:40px;  " > Practice lab: Collaborative Filtering Recommender Systems

In this exercise, you will implement collaborative filtering to build a recommender system for movies. 

# <img align="left" src="./images/film_reel.png"     style=" width:40px;  " > Outline
- [ 1 - Notation](#1)
- [ 2 - Recommender Systems](#2)
- [ 3 - Movie ratings dataset](#3)
- [ 4 - Collaborative filtering learning algorithm](#4)
  - [ 4.1 Collaborative filtering cost function](#4.1)
    - [ Exercise 1](#ex01)
- [ 5 - Learning movie recommendations](#5)
- [ 6 - Recommendations](#6)
- [ 7 - Congratulations!](#7)




_**NOTE:** To prevent errors from the autograder, you are not allowed to edit or delete non-graded cells in this lab. Please also refrain from adding any new cells. 
**Once you have passed this assignment** and want to experiment with any of the non-graded code, you may follow the instructions at the bottom of this notebook._

##  Packages <img align="left" src="./images/film_strip_vertical.png"     style=" width:40px;   " >
We will use the now familiar NumPy and Tensorflow Packages.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from recsys_utils import *

<a name="1"></a>
## 1 - Notation


|General <br />  Notation  | Description                                                                | Python (if any) |
|:-------------------------|:---------------------------------------------------------------------------|-----------------|
| $r(i,j)$                 | scalar; = 1  if user j rated movie i  = 0  otherwise                       |                 |
| $y(i,j)$                 | scalar; = rating given by user j on movie  i    (if r(i,j) = 1 is defined) |                 |
| $\mathbf{w}^{(j)}$       | vector; parameters for user j                                              |                 |
| $b^{(j)}$                | scalar; parameter for user j                                               |                 |
| $\mathbf{x}^{(i)}$       | vector; feature ratings for movie i                                        |                 |     
| $n_u$                    | number of users                                                            | num_users       |
| $n_m$                    | number of movies                                                           | num_movies      |
| $n$                      | number of features                                                         | num_features    |
| $\mathbf{X}$             | matrix of vectors $\mathbf{x}^{(i)}$                                       | X               |
| $\mathbf{W}$             | matrix of vectors $\mathbf{w}^{(j)}$                                       | W               |
| $\mathbf{b}$             | vector of bias parameters $b^{(j)}$                                        | b               |
| $\mathbf{R}$             | matrix of elements $r(i,j)$                                                | R               |

<a name="2"></a>
## 2 - Recommender Systems <img align="left" src="./images/film_rating.png" style=" width:40px;  " >
In this lab, you will implement the collaborative filtering learning algorithm and apply it to a dataset of movie ratings.
The goal of a collaborative filtering recommender system is to generate two vectors: For each user, a 'parameter vector' that embodies the movie tastes of a user. For each movie, a 'feature vector' of the same size which embodies some description of the movie. The dot product of the two vectors plus the bias term should produce an estimate ($\hat{y}$) of the rating the user might give to that movie.

The diagram below details how these vectors are learned.

<figure>
   <img src="./images/ColabFilterLearn.PNG"  style="width:740px;height:250px;" >
</figure>

Existing ratings are provided in matrix form as shown. $Y$ contains ratings; 0.5 to 5 inclusive in 0.5 steps. 0 if the movie has not been rated. $R$ has a 1 where movies have been rated. Movies are in rows, users in columns. Each user has a parameter vector $w^{user}$ and bias $b^{user}$. Each movie has a feature vector $x^{movie}$. These vectors are simultaneously learned by using the existing user/movie ratings as training data. One training example is shown above: $\mathbf{w}^{(1)} \cdot \mathbf{x}^{(1)} + b^{(1)} = 4$. It is worth noting that the feature vector $x^{movie}$ must satisfy all the users while the user vector $w^{user}$ must satisfy all the movies. This is the source of the name of this approach - all the users collaborate to generate the rating set. 

<figure>
   <img src="./images/ColabFilterUse.PNG"  style="width:640px;height:250px;" >
</figure>

Once the feature vectors and parameters are learned, they can be used to predict how a user might rate an unrated movie. This is shown in the diagram above. The equation is an example of predicting a rating for user one on movie zero.


In this exercise, you will implement the function `cofiCostFunc` that computes the collaborative filtering
objective function. After implementing the objective function, you will use a TensorFlow custom training loop to learn the parameters for collaborative filtering. The first step is to detail the data set and data structures that will be used in the lab.

<a name="3"></a>
## 3 - Movie ratings dataset <img align="left" src="./images/film_rating.png"     style=" width:40px;  " >
The data set is derived from the [MovieLens "ml-latest-small"](https://grouplens.org/datasets/movielens/latest/) dataset.   
[F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>]

The original dataset has  9000 movies rated by 600 users. The dataset has been reduced in size to focus on movies from the years since 2000. This dataset consists of ratings on a scale of 0.5 to 5 in 0.5 step increments. The reduced dataset has $n_u = 443$ users, and $n_m= 4778$ movies. 

Below, you will load the movie dataset into the variables $Y$ and $R$.

The matrix $Y$ (a  $n_m \times n_u$ matrix) stores the ratings $y^{(i,j)}$. The matrix $R$ is an binary-valued indicator matrix, where $R(i,j) = 1$ if user $j$ gave a rating to movie $i$, and $R(i,j)=0$ otherwise. 

Throughout this part of the exercise, you will also be working with the
matrices, $\mathbf{X}$, $\mathbf{W}$ and $\mathbf{b}$: 

$$\mathbf{X} = 
\begin{bmatrix}
--- (\mathbf{x}^{(0)})^T --- \\
--- (\mathbf{x}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{x}^{(n_m-1)})^T --- \\
\end{bmatrix} , \quad
\mathbf{W} = 
\begin{bmatrix}
--- (\mathbf{w}^{(0)})^T --- \\
--- (\mathbf{w}^{(1)})^T --- \\
\vdots \\
--- (\mathbf{w}^{(n_u-1)})^T --- \\
\end{bmatrix},\quad
\mathbf{ b} = 
\begin{bmatrix}
 b^{(0)}  \\
 b^{(1)} \\
\vdots \\
b^{(n_u-1)} \\
\end{bmatrix}\quad
$$ 

The $i$-th row of $\mathbf{X}$ corresponds to the
feature vector $x^{(i)}$ for the $i$-th movie, and the $j$-th row of
$\mathbf{W}$ corresponds to one parameter vector $\mathbf{w}^{(j)}$, for the
$j$-th user. Both $x^{(i)}$ and $\mathbf{w}^{(j)}$ are $n$-dimensional
vectors. For the purposes of this exercise, you will use $n=10$, and
therefore, $\mathbf{x}^{(i)}$ and $\mathbf{w}^{(j)}$ have 10 elements.
Correspondingly, $\mathbf{X}$ is a
$n_m \times 10$ matrix and $\mathbf{W}$ is a $n_u \times 10$ matrix.

We will start by loading the movie ratings dataset to understand the structure of the data.
We will load $Y$ and $R$ with the movie dataset.  
We'll also load $\mathbf{X}$, $\mathbf{W}$, and $\mathbf{b}$ with pre-computed values. These values will be learned later in the lab, but we'll use pre-computed values to develop the cost model.

In [2]:
#Load data
X, W, b, num_movies, num_features, num_users = load_precalc_params_small()
Y, R = load_ratings_small()

print("Y", Y.shape, "R", R.shape)
print("X", X.shape)
print("W", W.shape)
print("b", b.shape)
print("num_features", num_features)
print("num_movies",   num_movies)
print("num_users",    num_users)

Y (4778, 443) R (4778, 443)
X (4778, 10)
W (443, 10)
b (1, 443)
num_features 10
num_movies 4778
num_users 443


np.float64(0.0)

In [None]:
#  From the matrix, we can compute statistics like average rating.
tsmean =  np.mean(Y[0, R[0, :].astype(bool)])
'''
1. Context (Collaborative filtering / Movie ratings example)
    Y is usually a ratings matrix with shape (num_movies, num_users).
    Y[i, j] = rating of movie i by user j (if available).
    R is a binary indicator matrix (same shape as Y).
    R[i, j] = 1 if user j rated movie i,
    R[i, j] = 0 if the rating is missing.
    So together:
    Y holds the actual scores (1–5 stars).
    R tells us which entries are valid.

2. The indexing explained
    R[0, :] → gives the row of the first movie (all users’ indicators for movie 1).
    .astype(bool) → converts 0/1 into False/True, so we can use it as a mask.
    Example:
    R[0, :] = [1, 0, 1, 0]  →  R[0, :].astype(bool) = [True, False, True, False]
    Y[0, R[0, :].astype(bool)] → selects only the ratings of movie 1 from users who actually rated it.
    For example, if Y[0, :] = [5, 0, 4, 0], this would pick [5, 4].

3. The mean
    tsmean = np.mean(Y[0, R[0, :].astype(bool)])
    This computes the average rating for movie 1, using only the entries where R[0, :] = 1.

4. Print statement
    print(f"Average rating for movie 1 : {tsmean:0.3f} / 5")
    :0.3f formats the number with 3 decimal places.
    So if the average is 4.1666..., it prints:
'''
print(f"Average rating for movie 1 : {tsmean:0.3f} / 5" )

Average rating for movie 1 : 3.400 / 5


In [10]:
tsmean_matrix =  np.mean(Y[R.astype(bool)])
tsmean_matrix

np.float64(3.471314294448832)

In [None]:
# Sum of ratings per movie (only where rated)
ratings_sum = np.sum(Y * R, axis=1)
'''
Y * R
This uses elementwise multiplication (Hadamard product).
It keeps ratings where R=1, and zeroes them out where R=0.

✅ Example:
Y row (movie 1): [5, 0, 4, 0]
R row (movie 1): [1, 0, 1, 0]
Y * R          : [5, 0, 4, 0]
'''

# Number of ratings per movie
ratings_count = np.sum(R, axis=1)

# Average rating per movie (avoid division by zero)
average_ratings = ratings_sum / np.maximum(ratings_count, 1)
for i, avg in enumerate(average_ratings, start=1):
    print(f"Movie {i}: {avg:.2f} / 5")

Movie 1: 3.40 / 5
Movie 2: 3.25 / 5
Movie 3: 2.00 / 5
Movie 4: 2.00 / 5
Movie 5: 2.67 / 5
Movie 6: 4.22 / 5
Movie 7: 1.00 / 5
Movie 8: 3.06 / 5
Movie 9: 2.30 / 5
Movie 10: 3.17 / 5
Movie 11: 3.71 / 5
Movie 12: 2.67 / 5
Movie 13: 3.56 / 5
Movie 14: 3.52 / 5
Movie 15: 5.00 / 5
Movie 16: 2.46 / 5
Movie 17: 3.66 / 5
Movie 18: 4.00 / 5
Movie 19: 1.33 / 5
Movie 20: 2.17 / 5
Movie 21: 3.17 / 5
Movie 22: 4.50 / 5
Movie 23: 2.69 / 5
Movie 24: 3.53 / 5
Movie 25: 2.93 / 5
Movie 26: 3.08 / 5
Movie 27: 3.00 / 5
Movie 28: 2.25 / 5
Movie 29: 3.67 / 5
Movie 30: 3.12 / 5
Movie 31: 2.75 / 5
Movie 32: 3.42 / 5
Movie 33: 2.25 / 5
Movie 34: 3.56 / 5
Movie 35: 3.07 / 5
Movie 36: 3.00 / 5
Movie 37: 3.61 / 5
Movie 38: 3.30 / 5
Movie 39: 3.79 / 5
Movie 40: 3.45 / 5
Movie 41: 3.33 / 5
Movie 42: 3.25 / 5
Movie 43: 2.40 / 5
Movie 44: 3.00 / 5
Movie 45: 3.42 / 5
Movie 46: 2.50 / 5
Movie 47: 1.62 / 5
Movie 48: 3.50 / 5
Movie 49: 2.88 / 5
Movie 50: 5.00 / 5
Movie 51: 3.00 / 5
Movie 52: 3.94 / 5
Movie 53: 3.50 / 5
Mo

<a name="4"></a>
## 4 - Collaborative filtering learning algorithm <img align="left" src="./images/film_filter.png"     style=" width:40px;  " >

Now, you will begin implementing the collaborative filtering learning
algorithm. You will start by implementing the objective function. 

The collaborative filtering algorithm in the setting of movie
recommendations considers a set of $n$-dimensional parameter vectors
$\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)}$, $\mathbf{w}^{(0)},...,\mathbf{w}^{(n_u-1)}$ and $b^{(0)},...,b^{(n_u-1)}$, where the
model predicts the rating for movie $i$ by user $j$ as
$y^{(i,j)} = \mathbf{w}^{(j)}\cdot \mathbf{x}^{(i)} + b^{(j)}$ . Given a dataset that consists of
a set of ratings produced by some users on some movies, you wish to
learn the parameter vectors $\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},
\mathbf{w}^{(0)},...,\mathbf{w}^{(n_u-1)}$  and $b^{(0)},...,b^{(n_u-1)}$ that produce the best fit (minimizes
the squared error).

You will complete the code in `cofiCostFunc` to compute the cost
function for collaborative filtering. 


<a name="4.1"></a>
### 4.1 Collaborative filtering cost function

The collaborative filtering cost function is given by
$$J({\mathbf{x}^{(0)},...,\mathbf{x}^{(n_m-1)},\mathbf{w}^{(0)},b^{(0)},...,\mathbf{w}^{(n_u-1)},b^{(n_u-1)}})= \left[ \frac{1}{2}\sum_{(i,j):r(i,j)=1}(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+ \underbrace{\left[
\frac{\lambda}{2}
\sum_{j=0}^{n_u-1}\sum_{k=0}^{n-1}(\mathbf{w}^{(j)}_k)^2
+ \frac{\lambda}{2}\sum_{i=0}^{n_m-1}\sum_{k=0}^{n-1}(\mathbf{x}_k^{(i)})^2
\right]}_{regularization}
\tag{1}$$
The first summation in (1) is "for all $i$, $j$ where $r(i,j)$ equals $1$" and could be written:

$$
= \left[ \frac{1}{2}\sum_{j=0}^{n_u-1} \sum_{i=0}^{n_m-1}r(i,j)*(\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)} - y^{(i,j)})^2 \right]
+\text{regularization}
$$

You should now write cofiCostFunc (collaborative filtering cost function) to return this cost.

<a name="ex01"></a>
### Exercise 1

**For loop Implementation:**   
Start by implementing the cost function using for loops.
Consider developing the cost function in two steps. First, develop the cost function without regularization. A test case that does not include regularization is provided below to test your implementation. Once that is working, add regularization and run the tests that include regularization.  Note that you should be accumulating the cost for user $j$ and movie $i$ only if $R(i,j) = 1$.

In [35]:
# GRADED FUNCTION: cofi_cost_func
# UNQ_C1

def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
      Y (4778, 443) R (4778, 443)
      X (4778, 10)
      W (443, 10)
      b (1, 443)
      num_features 10
      num_movies 4778
      num_users 443
    """
    nm, nu = Y.shape
    J = 0
    ### START CODE HERE ###  
    for i in range(nm):
        for j in range(nu):
          A = np.sum(W[j,:]*X[i,:])
          J += R[i, j]*((A+b[0,j] - Y[i,j])**2)
    J = J/2 
    ### END CODE HERE ### 

    return J

<details>
  <summary><font size="3" color="darkgreen"><b>Click for hints</b></font></summary>
    You can structure the code in two for loops similar to the summation in (1).   
    Implement the code without regularization first.   
    Note that some of the elements in (1) are vectors. Use np.dot(). You can also use np.square().
    Pay close attention to which elements are indexed by i and which are indexed by j. Don't forget to divide by two.
    
```python     
    ### START CODE HERE ###  
    for j in range(nu):
        
        
        for i in range(nm):
            
            
    ### END CODE HERE ### 
```    
<details>
    <summary><font size="2" color="darkblue"><b> Click for more hints</b></font></summary>
        
    Here is some more details. The code below pulls out each element from the matrix before using it. 
    One could also reference the matrix directly.  
    This code does not contain regularization.
    
```python 
    nm,nu = Y.shape
    J = 0
    ### START CODE HERE ###  
    for j in range(nu):
        w = W[j,:]
        b_j = b[0,j]
        for i in range(nm):
            x = 
            y = 
            r =
            J += 
    J = J/2
    ### END CODE HERE ### 

```
    
<details>
    <summary><font size="2" color="darkblue"><b>Last Resort (full non-regularized implementation)</b></font></summary>
    
```python 
    nm,nu = Y.shape
    J = 0
    ### START CODE HERE ###  
    for j in range(nu):
        w = W[j,:]
        b_j = b[0,j]
        for i in range(nm):
            x = X[i,:]
            y = Y[i,j]
            r = R[i,j]
            J += np.square(r * (np.dot(w,x) + b_j - y ) )
    J = J/2
    ### END CODE HERE ### 
```
    
<details>
    <summary><font size="2" color="darkblue"><b>regularization</b></font></summary>
     Regularization just squares each element of the W array and X array and them sums all the squared elements.
     You can utilize np.square() and np.sum().

<details>
    <summary><font size="2" color="darkblue"><b>regularization details</b></font></summary>
    
```python 
    J += (lambda_/2) * (np.sum(np.square(W)) + np.sum(np.square(X)))
```
    
</details>
</details>
</details>
</details>

    


In [36]:
# Reduce the data set size so that this runs faster
num_users_r = 4
num_movies_r = 5 
num_features_r = 3

X_r = X[:num_movies_r, :num_features_r]
W_r = W[:num_users_r,  :num_features_r]
b_r = b[0, :num_users_r].reshape(1,-1)
Y_r = Y[:num_movies_r, :num_users_r]
R_r = R[:num_movies_r, :num_users_r]

# Evaluate cost function
J = cofi_cost_func(X_r, W_r, b_r, Y_r, R_r, 0);
print(f"Cost: {J:0.2f}")

Cost: 13.67


**Expected Output (lambda = 0)**:  
$13.67$.

In [37]:
# GRADED FUNCTION: cofi_cost_func
# UNQ_C1

def cofi_cost_func(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
    """
    nm, nu = Y.shape
    J = 0
    ### START CODE HERE ###  
    for i in range(nm):
        for j in range(nu):
          A = np.sum(W[j,:]*X[i,:])
          J += R[i, j]*((A+b[0,j] - Y[i,j])**2)
    J = J/2 
    n = X.shape[1]
    for k in range(n):
      for j in range(nu):
          J += (lambda_/2)*(W[j, k]**2)
      for i in range(nm):
        J += (lambda_/2)*(X[i, k]**2) 
    ### END CODE HERE ### 

    return J

In [38]:
# Evaluate cost function with regularization 
J = cofi_cost_func(X_r, W_r, b_r, Y_r, R_r, 1.5);
print(f"Cost (with regularization): {J:0.2f}")

Cost (with regularization): 28.09


**Expected Output**:

28.09

In [39]:
# Public tests
from public_tests import *
test_cofi_cost_func(cofi_cost_func)

[92mAll tests passed!


**Vectorized Implementation**

It is important to create a vectorized implementation to compute $J$, since it will later be called many times during optimization. The linear algebra utilized is not the focus of this series, so the implementation is provided. If you are an expert in linear algebra, feel free to create your version without referencing the code below. 

Run the code below and verify that it produces the same results as the non-vectorized version.

In [None]:
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
    """
    Returns the cost for the content-based filtering
    Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
    Args:
      X (ndarray (num_movies,num_features)): matrix of item features
      W (ndarray (num_users,num_features)) : matrix of user parameters
      b (ndarray (1, num_users)            : vector of user parameters
      Y (ndarray (num_movies,num_users)    : matrix of user ratings of movies
      R (ndarray (num_movies,num_users)    : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
      lambda_ (float): regularization parameter
    Returns:
      J (float) : Cost
      J (float) : Cost
      Y (4778, 443) R (4778, 443)
      X (4778, 10)
      W (443, 10)
      b (1, 443)
      num_features 10
      num_movies 4778
      num_users 443
    """
    '''
    X: Matrix of item (movie) features (num_movies, num_features).
    W: Matrix of user parameters (num_users, num_features).
    b: Bias term for each user (1, num_users).
    Y: Matrix of ratings (num_movies, num_users) — the actual ratings given by users to movies.
    R: Matrix indicating whether a user has rated a movie (num_movies, num_users). If a movie was rated, R(i, j) = 1; otherwise, 0.
    lambda_: Regularization parameter to avoid overfitting.
    '''
    j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
    '''
    tf.linalg.matmul(X, tf.transpose(W)):
      This is the matrix multiplication of X (movie features) with W.T (user parameters).
      It gives us the predicted ratings for each movie-user pair:
      X[i, :] (movie features) is multiplied with W[j, :] (user features).
      This produces the predicted rating for each movie by each user.
    + b:
      The bias term b is added to the predictions. This helps shift the predictions to better fit the actual ratings.
    - Y:
      This subtracts the actual ratings Y from the predictions, yielding the prediction error (difference between predicted and actual ratings).
    * R:
      This multiplies the error by the matrix R, which ensures that we only compute errors for rated movies.
      If a movie was not rated (i.e., R(i, j) = 0), the error is set to 0 for that movie-user pair, so it does not contribute to the cost.
      
      The operation Y * R: Elementwise multiplication of Y and R (Hadamard product).
      Y = [[5, 0, 4],
          [0, 3, 2]]

      R = [[1, 0, 1],
          [0, 1, 1]]
      Performing the elementwise multiplication (Y * R):
      Y * R = [[5 * 1, 0 * 0, 4 * 1],
              [0 * 0, 3 * 1, 2 * 1]]
            = [[5, 0, 4],
                [0, 3, 2]]
    '''
    J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
    '''
    0.5 * tf.reduce_sum(j**2):
      This computes the squared error (j**2) for all movie-user pairs where ratings are available.
      tf.reduce_sum() sums all the squared errors, effectively calculating the total error across all ratings.
      The factor 0.5 is a common normalization factor in cost functions to make the derivative simpler during optimization.

    Regularization term:
      (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2)):
      This adds the regularization term to avoid overfitting.
      tf.reduce_sum(X**2) and tf.reduce_sum(W**2) compute the L2 norm of the feature matrices X and W.
      The regularization term helps keep the model weights small and prevents overfitting by penalizing large weights.
      # Create a 2D tensor (3x3 matrix)
      tensor = tf.constant([[1, 2, 3],
                            [4, 5, 6],
                            [7, 8, 9]])
      tf.reduce_sum(tensor) #output: 45
      The sum of all the elements in the tensor:
      1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45
    '''
    return J

In [41]:
# Evaluate cost function
J = cofi_cost_func_v(X_r, W_r, b_r, Y_r, R_r, 0);
print(f"Cost: {J:0.2f}")

# Evaluate cost function with regularization 
J = cofi_cost_func_v(X_r, W_r, b_r, Y_r, R_r, 1.5);
print(f"Cost (with regularization): {J:0.2f}")

Cost: 13.67
Cost (with regularization): 28.09


**Expected Output**:  
Cost: 13.67  
Cost (with regularization): 28.09

<a name="5"></a>
## 5 - Learning movie recommendations <img align="left" src="./images/film_man_action.png" style=" width:40px;  " >
------------------------------

After you have finished implementing the collaborative filtering cost
function, you can start training your algorithm to make
movie recommendations for yourself. 

In the cell below, you can enter your own movie choices. The algorithm will then make recommendations for you! We have filled out some values according to our preferences, but after you have things working with our choices, you should change this to match your tastes.
A list of all movies in the dataset is in the file [movie list](data/small_movie_list.csv).

In [None]:
movieList, movieList_df = load_Movie_List_pd()

my_ratings = np.zeros(num_movies)          #  Initialize my ratings

# Check the file small_movie_list.csv for id of each movie in our dataset
# For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
my_ratings[2700] = 5 

#Or suppose you did not enjoy Persuasion (2007), you can set
my_ratings[2609] = 2;

# We have selected a few movies we liked / did not like and the ratings we gave are as follows:
my_ratings[929]  = 5   # Lord of the Rings: The Return of the King, The
my_ratings[246]  = 5   # Shrek (2001)
my_ratings[2716] = 3   # Inception
my_ratings[1150] = 5   # Incredibles, The (2004)
my_ratings[382]  = 2   # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366]  = 5   # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622]  = 5   # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988]  = 3   # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1   # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1   # Nothing to Declare (Rien à déclarer)
my_ratings[793]  = 5   # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]

print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0 :
        print(f'Rated {my_ratings[i]} for  {movieList_df.loc[i,"title"]}');


New user ratings:

Rated 5.0 for  Shrek (2001)
Rated 5.0 for  Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Rated 2.0 for  Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Rated 5.0 for  Harry Potter and the Chamber of Secrets (2002)
Rated 5.0 for  Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Rated 5.0 for  Lord of the Rings: The Return of the King, The (2003)
Rated 3.0 for  Eternal Sunshine of the Spotless Mind (2004)
Rated 5.0 for  Incredibles, The (2004)
Rated 2.0 for  Persuasion (2007)
Rated 5.0 for  Toy Story 3 (2010)
Rated 3.0 for  Inception (2010)
Rated 1.0 for  Louis Theroux: Law & Disorder (2008)
Rated 1.0 for  Nothing to Declare (Rien à déclarer) (2010)


In [70]:
my_ratings.shape

(4778,)

In [73]:
len(my_rated)

13

In [49]:
# type(movieList) #list
movieList_df.shape  #(4778, 3)

(4778, 3)

In [52]:
len(movieList)

4778

In [51]:
movieList[:3]

['Yards, The (2000)', 'Next Friday (2000)', 'Supernova (2000)']

In [50]:
movieList_df[:3]

Unnamed: 0,mean rating,number of ratings,title
0,3.4,5,"Yards, The (2000)"
1,3.25,6,Next Friday (2000)
2,2.0,4,Supernova (2000)


Now, let's add these reviews to $Y$ and $R$ and normalize the ratings.

In [None]:
# Reload ratings
Y, R = load_ratings_small()

# Add new user ratings to Y 
Y = np.c_[my_ratings, Y]
'''
my_ratings: This represents the ratings provided by a new user. It could be a 1D array, for example, of size (num_movies,), where each element corresponds to the rating the new user gave to each movie.
np.c_[]: This function is used to concatenate arrays along the columns. In this case, my_ratings is added as a new column to the left of the existing Y matrix (so the new user’s ratings will be added as the first column in Y).
Before: Y is (num_movies, num_users)
After: Y becomes (num_movies, num_users + 1) (with the new user’s ratings as the first column).
'''

# Add new user indicator matrix to R
R = np.c_[(my_ratings != 0).astype(int), R]

# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)
# The normalizeRatings function processes a dataset of movie ratings by subtracting the mean rating for each movie, so that each movie has a normalized average rating of 0. It’s commonly used in collaborative filtering to account for user-specific biases in ratings (for example, some users tend to rate things higher or lower than others).
# Explanation:
# np.multiply(Ymean, R): This elementwise multiplies the mean rating Ymean for each movie with R. This ensures that the mean is subtracted from the ratings only for the rated movies.
# For unrated movies (R[i, j] == 0), this multiplication gives 0, so no change is made to the rating.
# For rated movies (R[i, j] == 1), it subtracts the mean rating from the rating of that movie.
# Y - np.multiply(Ymean, R): This subtracts the mean rating for each movie from all of its ratings, so that the ratings are centered around 0. This step normalizes the ratings.

# Example (using the previous movie ratings):
# Movie 1 original ratings: [5, 0, 4]  
# Mean rating for Movie 1 = 4.5  
# Normalized ratings for Movie 1: [5 - 4.5, 0 - 0, 4 - 4.5] = [0.5, 0, -0.5]

In [56]:
(my_ratings != 0).astype(int)

array([0, 0, 0, ..., 0, 0, 0], shape=(4778,))

In [55]:
my_ratings.shape

(4778,)

In [54]:
Y.shape


(4778, 444)

Let's prepare to train the model. Initialize the parameters and select the Adam optimizer.

In [None]:
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100
'''
num_features: This represents the number of features used for latent factorization. 
In matrix factorization, we attempt to represent each movie and user in a latent feature space (of dimension num_features). Each feature represents some hidden aspect of movies and users that contributes to their ratings.
'''

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users,  num_features),dtype=tf.float64),  name='W')
'''
W: This is the matrix of user parameters (latent factors), which has dimensions (num_users, num_features). Each row in W corresponds to the latent features for a specific user. These parameters will be learned during training to predict ratings.
tf.random.normal(...): Generates a matrix of random values from a normal distribution with mean 0 and standard deviation 1, with shape (num_users, num_features).
tf.Variable(...): W is a trainable variable. The tf.Variable wrapper ensures that TensorFlow tracks this matrix so it can be updated during training.
'''
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64),  name='X')
'''
X: This is the matrix of item (movie) parameters (latent factors), which has dimensions (num_movies, num_features). Each row in X corresponds to the latent features for a specific movie.
Similar to W, it's initialized with random values and is a trainable variable.
'''
b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float64),  name='b')
'''
b: This represents the bias term for each user. It has shape (1, num_users), meaning each user gets one bias value that is added to their predicted rating. Bias terms help account for differences in average ratings between users.
Like W and X, it's initialized randomly and will be updated during training.
'''
# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)
# optimizer: This creates an optimizer (specifically the Adam optimizer). Optimizers are responsible for updating the model’s parameters (W, X, b) based on the computed gradients to minimize the cost function (like mean squared error).
# learning_rate=1e-1: The learning rate controls how much the parameters (W, X, b) are updated during each step of training. A higher learning rate makes larger updates, while a smaller rate leads to more gradual changes. In this case, the learning rate is set to 0.1 (a moderate value).

Let's now train the collaborative filtering model. This will learn the parameters $\mathbf{X}$, $\mathbf{W}$, and $\mathbf{b}$. 

The operations involved in learning $w$, $b$, and $x$ simultaneously do not fall into the typical 'layers' offered in the TensorFlow neural network package.  Consequently, the flow used in Course 2: Model, Compile(), Fit(), Predict(), are not directly applicable. Instead, we can use a custom training loop.

Recall from earlier labs the steps of gradient descent.
- repeat until convergence:
    - compute forward pass
    - compute the derivatives of the loss relative to parameters
    - update the parameters using the learning rate and the computed derivatives 
    
TensorFlow has the marvelous capability of calculating the derivatives for you. This is shown below. Within the `tf.GradientTape()` section, operations on Tensorflow Variables are tracked. When `tape.gradient()` is later called, it will return the gradient of the loss relative to the tracked variables. The gradients can then be applied to the parameters using an optimizer. 
This is a very brief introduction to a useful feature of TensorFlow and other machine learning frameworks. Further information can be found by investigating "custom training loops" within the framework of interest.
    


In [None]:
# This is part of the training loop for a collaborative filtering model using matrix factorization in TensorFlow. The model updates its parameters using gradient descent to minimize the cost function (e.g., Mean Squared Error, or another cost).
iterations = 200
lambda_ = 1
'''
iterations = 200: This defines the number of iterations (or epochs) for training. The model will update its parameters 200 times.
lambda_ = 1: This is the regularization parameter, used to prevent overfitting by penalizing large values of the parameters (e.g., X, W, b). In our example we didn't regularize b
'''
for iter in range(iterations): #The loop runs for 200 iterations, and in each iteration, the model will perform the following steps: forward pass, calculate gradients, and update parameters.
    # Use TensorFlow’s GradientTape
    # to record the operations used to compute the cost 
    with tf.GradientTape() as tape:
        '''
        tf.GradientTape: This is a TensorFlow utility that records the operations that are performed on trainable variables (i.e., X, W, b). The gradients of the cost function with respect to these variables will be computed automatically later.
        It watches all operations inside the with block to compute gradients for the variables involved.
        '''
        # Compute the cost (forward pass included in cost)
        cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)
        '''
        This line computes the cost (cost_value) using the function cofi_cost_func_v.
        X, W, b: These are the trainable parameters (user and item feature matrices and bias term) that we want to optimize.
        Ynorm: The normalized ratings matrix, which the model is trying to approximate.
        R: The binary matrix indicating where the ratings are present (1 if rated, 0 if missing).
        lambda_: The regularization parameter to control overfitting.
        The cost function calculates the error between the predicted ratings and the actual ratings, plus a regularization term to avoid overfitting.
        '''

    # Use the gradient tape to automatically retrieve
    # the gradients of the trainable variables with respect to the loss
    grads = tape.gradient( cost_value, [X,W,b] )
    '''
    tape.gradient(): This function computes the gradients of the cost_value with respect to the trainable variables [X, W, b].
    The gradients indicate how much each variable (i.e., X, W, b) should be adjusted to minimize the cost function. Gradients are computed by backpropagating the error through the model.
    '''
    # Run one step of gradient descent by updating
    # the value of the variables to minimize the loss.
    optimizer.apply_gradients( zip(grads, [X,W,b]) )
    '''
    optimizer.apply_gradients(): This method applies the gradients to the variables to update them using gradient descent.
    The gradients (grads) are zip-ped together with the variables [X, W, b]. This pairs each gradient with the corresponding variable.
    The Adam optimizer (used earlier) adjusts the parameters X, W, and b based on the gradients, with a specified learning rate, to minimize the cost function.
    What is zip? The zip() function is a built-in Python function that pairs elements from multiple iterables (like lists or arrays) together into tuples. It effectively "zips" the iterables together, combining corresponding elements from each iterable.
    a = [1, 2, 3]
    b = ['a', 'b', 'c']
    result = zip(a, b)
    print(list(result))  # Output: [(1, 'a'), (2, 'b'), (3, 'c')]

    '''

    # Log periodically.
    if iter % 20 == 0:
        print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

Training loss at iteration 0: 1904.5
Training loss at iteration 20: 1798.2
Training loss at iteration 40: 1789.3
Training loss at iteration 60: 1786.1
Training loss at iteration 80: 1783.8
Training loss at iteration 100: 1781.9
Training loss at iteration 120: 1780.2
Training loss at iteration 140: 1778.7
Training loss at iteration 160: 1777.4
Training loss at iteration 180: 1776.2


In [65]:
type(X)
X.shape

TensorShape([4778, 100])

In [66]:
Ymean

array([[3.4 ],
       [3.25],
       [2.  ],
       ...,
       [3.5 ],
       [3.5 ],
       [3.5 ]], shape=(4778, 1))

In [67]:
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
p.shape

(4778, 444)

<a name="6"></a>
## 6 - Recommendations
Below, we compute the ratings for all the movies and users and display the movies that are recommended. These are based on the movies and ratings entered as `my_ratings[]` above. To predict the rating of movie $i$ for user $j$, you compute $\mathbf{w}^{(j)} \cdot \mathbf{x}^{(i)} + b^{(j)}$. This can be computed for all ratings using matrix multiplication.

In [68]:
# Make a prediction using trained weights and biases
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()
'''
np.matmul(X.numpy(), np.transpose(W.numpy())):
X is the matrix of movie features (shape: (num_movies, num_features)).
W is the matrix of user features (shape: (num_users, num_features)).
np.matmul(X, W.T) performs a matrix multiplication between X and the transpose of W. This results in a matrix of predicted ratings based on the features for each movie and user. Each element represents the predicted rating for a specific user and movie.
+ b.numpy():
b is the bias for each user, and adding it adjusts the predicted ratings for each user. This helps shift the predictions according to the user’s general rating behavior.
Thus, p represents the predicted ratings matrix, where each element is the predicted rating for a movie-user pair.
'''
#restore the mean
pm = p + Ymean
'''
Ymean: This is the mean rating for each movie (i.e., the average rating across all users for each movie).
By adding Ymean to p, you restore the mean rating for each movie, effectively reconstructing the predicted ratings that are closer to the original data, i.e., adding back the bias that was removed during the normalization step.
pm is now the final predicted ratings, after adjusting for the mean.
'''
my_predictions = pm[:,0]
'''
pm[:,0]: This selects the predicted ratings for the first user (i.e., the first column of the matrix pm).
In this case, my_predictions will contain the predicted ratings for the new user, which was added to the dataset.
'''
# sort predictions
ix = tf.argsort(my_predictions, direction='DESCENDING')
'''
tf.argsort(my_predictions, direction='DESCENDING'): This sorts the predicted ratings in descending order (from highest to lowest).
ix: This is the sorted index array, which gives the indices of my_predictions sorted from highest to lowest. These indices represent the order of movies, starting from the movie with the highest predicted rating.
'''
for i in range(17):
    j = ix[i]
    if j not in my_rated:
        print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')

print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
    if my_ratings[i] > 0:
        print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')

Predicting rating 4.75 for movie Colourful (Karafuru) (2010)
Predicting rating 4.58 for movie One I Love, The (2014)
Predicting rating 4.58 for movie Laggies (2014)
Predicting rating 4.58 for movie Delirium (2014)
Predicting rating 4.56 for movie Battle Royale 2: Requiem (Batoru rowaiaru II: Chinkonka) (2003)
Predicting rating 4.56 for movie Into the Abyss (2011)
Predicting rating 4.56 for movie Eichmann (2007)
Predicting rating 4.55 for movie Particle Fever (2013)
Predicting rating 4.54 for movie 'Salem's Lot (2004)
Predicting rating 4.53 for movie Deathgasm (2015)


Original vs Predicted ratings:

Original 5.0, Predicted 4.88 for Shrek (2001)
Original 5.0, Predicted 4.85 for Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
Original 2.0, Predicted 2.19 for Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Original 5.0, Predicted 4.86 for Harry Potter and the Chamber of Secrets (2002)
Original 5.0, Predicted 4.86 for Pirates of the Carib

In [69]:
ix

<tf.Tensor: shape=(4778,), dtype=int32, numpy=
array([1150,  246,  929, ..., 2644, 1938, 3680],
      shape=(4778,), dtype=int32)>

In practice, additional information can be utilized to enhance our predictions. Above, the predicted ratings for the first few hundred movies lie in a small range. We can augment the above by selecting from those top movies, movies that have high average ratings and movies with more than 20 ratings. This section uses a [Pandas](https://pandas.pydata.org/) data frame which has many handy sorting features.

In [None]:
filter=(movieList_df["number of ratings"] > 20)
'''
movieList_df["number of ratings"]: This accesses the column in the movieList_df DataFrame that contains the number of ratings each movie has received.
filter: This creates a boolean mask (filter) that is True for movies that have more than 20 ratings, and False for movies that have 20 or fewer ratings.
'''
movieList_df["pred"] = my_predictions
'''
movieList_df["pred"]: This adds a new column pred to the movieList_df DataFrame and assigns it the values from my_predictions, which likely contains predicted ratings for the movies.
Now, the DataFrame will include both the original ratings and the predicted ratings for each movie.
'''
movieList_df = movieList_df.reindex(columns=["pred", "mean rating", "number of ratings", "title"])
'''
reindex(columns=[...]): This reorders the columns of the movieList_df DataFrame to the specified list: ["pred", "mean rating", "number of ratings", "title"].
"pred": The predicted rating for each movie (from the previous step).
"mean rating": The mean rating for each movie.
"number of ratings": The total number of ratings each movie has received.
"title": The title of the movie.
The columns are now rearranged for easier access and display.
'''
movieList_df.loc[ix[:300]].loc[filter].sort_values("mean rating", ascending=False)
'''
ix[:300]: This selects the first 300 elements from the sorted index ix. ix likely represents the indices of the top predicted movies (possibly sorted in descending order based on predicted ratings).
ix[:300] selects the top 300 movies based on the predictions.
loc[filter]: This applies the filter (i.e., movies with more than 20 ratings), so only the movies that meet the condition number of ratings > 20 will be included in the final selection.
sort_values("mean rating", ascending=False): After filtering the movies, this sorts the remaining movies by their mean rating in descending order (ascending=False), so that movies with the highest mean ratings appear first.
'''

Unnamed: 0,pred,mean rating,number of ratings,title
2112,4.148116,4.238255,149,"Dark Knight, The (2008)"
155,4.235979,4.155914,93,Snatch (2000)
929,4.865141,4.118919,185,"Lord of the Rings: The Return of the King, The..."
2700,4.774415,4.109091,55,Toy Story 3 (2010)
393,4.426001,4.106061,198,"Lord of the Rings: The Fellowship of the Ring,..."
653,4.365652,4.021277,188,"Lord of the Rings: The Two Towers, The (2002)"
2804,4.093862,3.989362,47,Harry Potter and the Deathly Hallows: Part 1 (...
773,4.376168,3.960993,141,Finding Nemo (2003)
2649,4.051183,3.943396,53,How to Train Your Dragon (2010)
1051,4.093187,3.913978,93,Harry Potter and the Prisoner of Azkaban (2004)


<a name="7"></a>
## 7 - Congratulations! <img align="left" src="./images/film_award.png"     style=" width:40px;  " >
You have implemented a useful recommender system!

<details>
  <summary><font size="2" color="darkgreen"><b>Please click here if you want to experiment with any of the non-graded code.</b></font></summary>
    <p><i><b>Important Note: Please only do this when you've already passed the assignment to avoid problems with the autograder.</b></i>
    <ol>
        <li> On the notebook’s menu, click “View” > “Cell Toolbar” > “Edit Metadata”</li>
        <li> Hit the “Edit Metadata” button next to the code cell which you want to lock/unlock</li>
        <li> Set the attribute value for “editable” to:
            <ul>
                <li> “true” if you want to unlock it </li>
                <li> “false” if you want to lock it </li>
            </ul>
        </li>
        <li> On the notebook’s menu, click “View” > “Cell Toolbar” > “None” </li>
    </ol>
    <p> Here's a short demo of how to do the steps above: 
        <br>
        <img src="https://drive.google.com/uc?export=view&id=14Xy_Mb17CZVgzVAgq7NCjMVBvSae3xO1" align="center" alt="unlock_cells.gif">
</details>