<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Final Project</a>

## <a name="0">Design and build a recommender system</a>

At Amazon, we understand the power of **personalized recommendations** in driving customer engagement and revenue growth. While our recommender systems are highly sophisticated and leverage cutting-edge machine learning techniques, the core principles can be distilled into fundamental mathematical and statistical concepts. In this final project, we will embark on a journey to build a minimum viable **book recommender system** from the ground up. Through this hands-on experience, you'll gain insights into the iterative process of prototyping and refining recommendation algorithms, laying the foundation for more advanced exploration in this domain.


# <a name="0">MLU MATH - Final Project</a>


Although recommender systems are the secret source for many multi-billion businesses, prototyping a minimum viable recommender system takes only some basic mathematical and statistical fundamentals for machine learning to implement. For the final project, we will start from scratch and walk through the process of how to prototype a book recommender system.

* <a href="#99">Business Problem: Recommender System. ML Problem and Data Loading</a>
* <a href="#1">1. Basic Data Visualization</a>
* <a href="#2">2. Data Vectorization</a> 
* <a href="#3">3. Model Definition</a>
* <a href="#4">4. Baseline Model</a>
* <a href="#5">5. Model Likelihood</a>
* <a href="#6">6. Loss Function</a>
* <a href="#7">7. Gradient Descent</a>
* <a href="#8">8. Overfitting</a>
* <a href="#9">9. Recommendations and Improvements</a> 
* <a href="#10">10. Submit Final Model to Leaderboard</a> 

In [None]:
# Standard libraries
# Upgrade dependencies
!pip install -q --upgrade pip
!pip install -q --upgrade scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import torch

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from IPython.display import Markdown, display, HTML

%matplotlib inline

## <a name="99">Business Problem: Recommender System</a> 
(<a href="#0">Go to top</a>)

For the final project of this course, you will leverage your growing mathematical knowledge to build a __recommender system__ similar to those that you find used throughout any digital content provider which will attempt to predict user's preferences based upon past ratings.  The technique we use is called a __model-based  collaborative filtering__ where we use only a user’s past behaviors (items previously purchased and ratings given to those items) to build a model to predict the missing data (ratings for items that the user may have an interest in).


### Data Loading

You will be using a subset of Amazon reviews for the task.  It is actually based on a public dataset collected by scraping all reviews prior to 2014, extracting userID, ASIN, and star rating. The dataset has evolved every since and can now be found as [Amazon Reviews dataset](https://amazon-reviews-2023.github.io/). This dataset itself would be very large and hard to train on owing to the fact there are many ASINs without only one or two reviews, and many users that only review one or two books.  The subset we use is actually a very restricted subset known as the $40$-core.  This is the collection of those ASINs and users such that every ASIN has at least 40 reviews and every user has reviewed at least 40 books. The ratings are on a scale from 1 to 5.

Let's load our datasets, ```training``` and ```test_features```. We build the recommender system using the ```training``` data, and evaluate how well the model performs on the ```test_features```, by submitting model predictions to the leaderboard, accessible via [MyMLU](https://portal.mlu.aws.dev/registration/mymlu/).

Notice that the ```test_features``` do not contain a column `Rating`, which is what you're expected to produce with your trained model. The evaluation of your predicted ratings against the actual ones will be performed when you submit to the leaderboard.


In [None]:
# Import the datasets
training = pd.read_csv("../../data/MATH_Final_Project_Data_training.csv")
test_features = pd.read_csv("../../data/MATH_Final_Project_Data_test_features.csv")

# Display the head of the files
display(Markdown("### Sample training data"))
display(HTML(training.head(5).to_html()))

display(Markdown("### Sample test data"))
display(HTML(test_features.head(5).to_html()))

The code below stores a list of the unique userIDs and ASINs, counts the number of such userIDs and ASINs, and then splits the ```training``` dataset into two sets: ```train``` (train set) and ```val``` (validation set). You can use the validation set to evaluate model performance before submitting to the leaderboard.

In [None]:
# Count number of unique users and number of unique ASINs in our dataset
# Unique user who reviewed
unique_users = training["User"].unique().tolist()
# Unique books that were reviewed
unique_asins = training["ASIN"].unique().tolist()

# Sort data to avoid index matching errors later on
unique_users.sort()
unique_asins.sort()

n_users = len(unique_users)
n_asins = len(unique_asins)

display(Markdown("### Amount of data"))
display(Markdown(f"Number of Users: {n_users}"))
display(Markdown(f"Number of ASINs: {n_asins}"))

# Split into train and validation
train, val = train_test_split(training, random_state=42, stratify=training["ASIN"])
n_train = train.shape[0]
n_val = val.shape[0]

display(Markdown("### Size of data splits"))
display(Markdown(f"Shape of original Training Data: {training.shape}"))
display(Markdown(f"- of which shape of Train split: {train.shape}"))
display(Markdown(f"- and shape of Validation split: {val.shape}"))
display(Markdown(f"Shape of Test Features Data: {test_features.shape}"))

##  <a name="1">1. Basic Data Visualization</a> 
(<a href="#0">Go to top</a>)

Let's dig in and start to do some data visualization.  When working with real-world data, it is always important to try to understand what bizarre little issues it has.  We've removed many of these for you, but all the same we should do at least a tiny exercise in data visualization so we can see some features of our data.

In a real world setting, data collected from feedbacks like user ratings can be very sparse and data points are mostly collected from very popular items (books) and highly engaged users. Large amount of less known items (books) don’t have ratings at all. Let’s see some plots on the distribution of books ratings frequency, to have a __better idea of the data we have__, and ponder whether is enough (and good enough) for a recommender model.

### Project Question 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Data visualization.</b></p>
        <p>Use <a href="https://matplotlib.org/">matplotlib</a> to produce the following two plots: 
            <ul><li>the <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html">histogram</a> of the number of reviews given by the users</li> 
                <li>a <a href="https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html">scatter plot</a> of the average rating of an ASIN versus the number of ratings received by the ASIN.</li>
        </ul>
        It is good practice to not touch the testing data during model development, so make these plots only on the train data.
</p>
    </span>
</div>



In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>It is helpful to use like <code>.groupby([feature1][feature2].count())</code> or <code>.groupby([feature1][feature2].mean())</code> to prepare the data for the desired plots. </p>
    </span>
</div>

##  <a name="2">2. Data Vectorization</a> 
(<a href="#0">Go to top</a>)

Now that we agree that we have enough data for a recommender system, we can start designing a machine learning model to solve the problem. But if you go up and look again at the samples from the datasets, you'll notice that the data is not a vector or matrix. Various columns are strings rather than numbers, and even if we solved that (say by creating a unique ID number for every ASIN and user) it would still be in a very strange format.  __Deciding how to represent your data is a major first step in any ML problem.__  

First, you need to transform the dataframe of ratings into a proper format that can be consumed by a machine learning model, with a recommender system in mind. In fact, __you want the data to be in an $n\times m$ matrix format__, where $n$ is the number of users and $m$ is the number of ASINs. 

We can think of the data as a score matrix $S = \left(s_{i,j}\right)$ which is a $n\times m$ matrix where
$$
s_{i,j} = \begin{cases}
s & \text{if the rating by user $i$ to ASIN $j$ was $s$ stars,}\\
0 & \text{otherwise,}
\end{cases}
$$
where $n$ is the number of users and $m$ is the number of ASINs.

As you might notice, the data is extremely sparse, meaning that the $S$ matrix contains many zero elements. That makes sense, after examining the plots from Question 1: each user only rated a small number of ASINs. It will be useful to define another matrix, the matrix of which ASINs were rated. 

Let $R = \left(r_{i,j}\right)$ be the matrix where
$$
r_{i,j} = \begin{cases}
1 & \text{if user $i$ rated ASIN $j$,}\\
0 & \text{otherwise.}
\end{cases}
$$
This matrix will allow you to mask off which entries were observed or not.

These two matrices, $S$ and $R$, will form the core of the work for this project, so it is a good idea to define them and create them now.


### Project Question 2

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Data vectorization.</b></p>
        <p>Translate the Pandas DataFrames that contain the train and validation data into two numpy matrices each: the score matrix <code>S</code> and the mask matrix <code>R</code>. This will leave you with four matrices:
            <ul>
                <li><code>R_train</code>,</li> 
                <li><code>S_train</code>,</li> 
                <li><code>R_val</code>, and</li> 
                <li><code>S_val</code>.</li> 
            </ul>
        Print all four matrices and examine their elements. These are sparse matrices, so many of them should be zero.
</p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>It is helpful to know that, other than defining the matrices with for loops or list comprehensions, you can also use <code>pd.pivot_table</code> to pivot the dataframes to the desired format with users as rows and ASINs as columns. This second approach is much more performant than the first one.</p>
    </span>
</div>

##  <a name="3">3. Model Definition</a> 
(<a href="#0">Go to top</a>)

To better understand how to build a recommender system by [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering#), it helps to print and examine the ```S_train``` matrix. You'll notice the majority of zero entries. In order to define the model, you need to have some way of making predictions for the unfilled entries --  that's ultimately the recommender system.

Your goal with this project is to construct a prediction matrix $P = (p_{i,j})$ that fills in all the missing scores where $r_{i,j} = 0$. If you assumed nothing about how the matrix $P$ is constructed, there would have no basis to build a prediction. You will need to assume that $P$ is generated from some process that has more structure than just an arbitrary matrix.

As already mentioned, the $S$ matrices have many zeros. Moreover, they are very large, 1490 by 1186 (and they could be even bigger in other practical scenarios!). Remember all the computational impediments related to storing, working, and manipulating large matrices, even on current high-dimensionality friendly ML frameworks. __Reducing dimensions__ can improve the performance of the algorithm in terms of both storage and time. You learned about matrix factorization as a simple way to build larger matrices from smaller ones in Lecture 2.

You also need to find out how some users preferences of some ASINs might impact other unexplored users-ASINs ratings. Somehow you want to connect users-ASINs-ratings in a meaningful way to extract __underlying relationships__ that might exist between users and ASINs to explain their corresponding ratings, and leverage these __latent relationships__ to predict other users-ASINs interactions. This could be modeled linearly by factorizing the initial matrix in terms of the product of two matrices. 

Remember from Lecture 1 that two matrices of sizes $n\times k$ and $q \times m$ can be multiplied only if the number of columns in the first column is equal to the number of rows in the second matrix, $k = q$. Moreover, if this number $k$ is small, the two matrices are much smaller than the matrix resulting from multiplying them. This is going to be the approach to solve the problem: reduce the dimensionality of our modeling approach by factorizing the matrix $P$ and, in doing so, obtaining a meaningful modeling for the user-ASIN underlying relationship.

### Matrix Factorization

The prediction matrix will be the __product of two other matrices__. You will decompose the users-ASINs interactions matrix into the product of two lower-dimensionality rectangular matrices. These matrices represent the users and ASINs individually. 
* The first matrix $A$ can be seen as the _users matrix_, where rows represent the users, and columns tell us about some of users' affinities for ASINs -- let's call it the *affinities* matrix $A$. 
* The second matrix $F$ can be seen as the _ASINs matrix_, where rows tell us about the features ASINs have to offer to users, and columns represent the ASINs -- let's call it the *features* matrix $F$. 

The columns in the users *affinities* matrix $A$ and the rows in the ASINs *features* matrix $F$ are called **latent factors** and are an indication of hidden characteristics about the users or the ASINs, patterns in the data that could lead to more personalized recommendations. The number of such latent factors can be anything from one to hundreds. This number is one of the things that can be optimized during the training of the model. 

Mathematically, the idea is the following: 
* To each ASIN $j$ we will associate $k$ features $(f_{\ell,j})_{\ell = 1}^k$. These features can be thought of as simple descriptors of the book, like for instance amount of action or romance, although we don't need to know what they are when we propose the model (i.e. this is not a _content-based_ recommender system.
* To each user $i$ we will associate $k$ affinities $(a_{i,\ell})_{\ell=1}^k$.  The affinities can be thought of as providing how much a person likes or dislikes a feature; so perhaps $+1$ for action, since they like action, and $-0.5$ for romance. Again, we don't need to define what these affinities are when we propose the model. These latent factors will be learned from the training data.

__The rating that a user will assign to a ASIN will be taken as a sum over all the features of their affinities for that feature times that ASIN's features.__

Regarding terminology, in the world of ML, a *parameter* is a number that needs to be fixed in order to specify a model completely.  For instance, in this case every affinity and every feature must be specified for our model to be able to make predictions.  For instance, if every user has two features, and every ASIN has two features, the total number of parameters is:
$$
\#\text{Parameters} = 2\cdot\#\text{User} + 2 \cdot\#\text{ASIN}.
$$

As a general rule of thumb, the more parameters you have, the more data you need to fit your model. an old statistics rule stated that you need about 10 data points for each parameter. There are exceptions to this rule (particularly in deep learning), but it is always a good idea to keep the number of parameters always in mind.


### Project Question 3

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Model definition.</b></p>
        <p>Express the model described above mathematically in matrix notation as a product of the matrices of features and affinities.
            <ul>
                <li>What are the dimensions of the matrices involved?</li> 
                <li>How many parameters are there in this model?</li> 
                <li>How does the number of parameters compare to the number of entries of $P$?</li> 
            </ul>
        Complete this question in abstract matrices shapes (in terms of $m,n,$ and $k$), and then say what this means for our dataset, in particular with $k=2$, in comparison to the total number of entries in $P$.
</p>
    </span>
</div>

---
###### YOUR ANSWER HERE ######






###### END OF ANSWER ######
---

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>It is helpful to remember the formula for the matrix multiplication.</p>
    </span>
</div>

##  <a name="4">4. Baseline Model</a> 
(<a href="#0">Go to top</a>)

If you answered the question above, you realized that the total number of parameters of the proposed model is vastly smaller than the total number of entries in the matrix $P$. This is a key aspect: since we need to find the values for a comparatively small number of parameters to specify all of $P$, the model will be forced to find structure in the observations rather than to memorize scores. This helps generalize to unseen pairs of user-ASINs.

Let's consider the simplest example of this type of models. __Set $\mathbf{k=1}$__. Pick the single feature as the average star rating an ASIN received, and assume every user has affinity of $+1$ for that feature. In the mathematical terms explained in Lecture 2, this is a __rank-1 approximation__.  When you do the product of $P=AF$, note that this means that it predicts that every single user gives the same rating to the ASIN, and that rating is the average value that other users gave to that ASIN. Throughout this notebook, we will refer to this as the __baseline model__.  This gives a simple benchmark against which to measure progress.

In terms of evaluation metrics, it is reasonable to measure the quality of the predictions by computing the average squared difference between the predictions and the true values in the *test* set, where the test set can be any set we would like to test our model against (the train or the validation set, for example):

$$
{\frac{1}{\#\text{Test data points}}\sum_{\text{Test data}}(\text{prediction} - \text{true test observation})^2}
$$

This is called the mean squared error (or MSE for short). MSE is expressed in units that are difficult to interpret, so the Root Mean Square (RMS) can be used instead. The RMS error can be thought of in terms of how far the predictions are off on average and is obtained by taking the square root of the MSE:

$$
\sqrt{{\frac{1}{\#\text{Test data points}}\sum_{\text{Test data}}(\text{prediction} - \text{true test observation})^2}}.
$$


### Project Question 4

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Baseline model.</b></p>
        <p> Write code to compute the vector of average ASIN scores from the training set as our single feature.  Construct a vector of all ones for your vector of affinities, and then use these two vectors to create a (baseline) matrix of predictions $P$.
            <ul>
                <li>Print the matrix $P$ and examine its elements. How does it compare to the validation matrix $S$?</li> 
                <li>Compute the MSE and the RMS on the validation set. How far off are we on average if we use this simple method for prediction?</li> 
            </ul>
</p>
    <p>Apply the baseline model to the test data set, produce a csv with the predictions, download it, and then upload it to the <a href="https://portal.mlu.aws.dev/contests/redirect/14">course's leaderboard</a> to achieve your first submission.</p>
    <p><b>The csv file for submission needs to contain two columns: the ID of the User-ASIN interaction, found in the first column of the <code>test_features</code> data frame, and a second column <code>Rating</code> containing the predicted rating for each User-ASIN pair.</b></p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>To create a vector of ones, use <a href="https://numpy.org/doc/stable/reference/generated/numpy.ones.html"><code>np.ones</code></a>.</p>
        <p>For ease of implementation, note that the RMS is the same as
$$
RMS = \frac{1}{\sqrt{\text{Test data points}}}\|R\circ(S-P)\|_2
$$

where we use $R$ to mask off only the observed test entries, and the fact that the $L_2$ norm is the square root of the sum of the squares of entries. Using matrix products (implemented with <code>np.dot(A, B)</code>), the Hadamard product (implemented with <code>np.multiply(A, B)</code> or <code>A*B</code>), and <code>np.linalg.norm</code> for the $L_2$ norm, computing the RMS can be done in one line of code.
</p>
    </span>
</div>

##  <a name="5">5. Model Likelihood</a> 
(<a href="#0">Go to top</a>)


### Propose a model to capture the underlying relationship in the dataset  

This section is entirely theoretical and will contain no code. The proposed model for the recommender system is such that predicted ratings are obtained as $P = AF$. Unless the universe follows our model exactly, meaning that there are truly just $k$ numbers that determine how much a person likes a book and the person assigns that score every single time without fail, it won't be true that $S = AF$. Instead, $S$ will look like $P=AF$ plus some __independent random noise__, where that random noise is designed to capture all the inaccuracies in the model, for instance there can be random fluctuation in the rating given depending on things unrelated to the book at all.

The way to model this is to say that 
$$
S\sim P + \mathcal{N}(0,\sigma^2),
$$
where the noise is modeled as independent additive Gaussian noise. The way to read this is that every observed rating $s_{i,j}$ is distributed as a  prediction $p_{i,j}$ plus independent, zero-mean Gaussian noise of unknown variance for every observation.  Since adding a constant to a Gaussian just shifts the mean, we could also say that 
$$
s_{i,j}\sim\mathcal{N}(p_{i,j},\sigma^2).
$$ 
The scale of the noise, $\sigma^2$, is now an additional parameter of the model.


### Compute the likelihood that this model generates the data

For many __independent observations__, as assumed for the ratings, each with its own Gaussian probability densities as proposed above, the probability to observe all datapoints (ratings) at the same time is given by the product of the individual probabilities. The probability or probability density of obtaining a particular set of data is often called the __likelihood__.


### Project Question 5 

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Model likelihood.</b></p>
        <p> Let $N$ be the total number of observed ratings (of the training dataset). For the proposed model, explain why the likelihood is:
$$ 
\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^N e^{\displaystyle{-\frac{1}{2\sigma^2}\sum_{N}(s_{i,j} - p_{i,j})^2}}. 
$$
Write your derivation below.
</p>
    </span>
</div>

---
###### YOUR ANSWER HERE ######






###### END OF ANSWER ######
---

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>For the derivation you need to use a fundamental property of the exponential function: $$\displaystyle{e^a e^b = e^{a+b}}.$$</p>
    </span>
</div>

##  <a name="6">6. Loss Function</a> 
(<a href="#0">Go to top</a>)


### Compute the negative log-likelihood (loss function)

The maximum likelihood estimation process is based on finding out what values of the model parameters make the data most likely, in the sense of occurring with the highest probability.  Notice that many of the terms above in the likelihood we computed are unrelated to the dataset, such as the factor in front of the exponential. To find the choice of predictions $p_{i,j}$ that maximizes the probability, or the likelihood rather, one can ignore most things in the likelihood and focus on the piece that contains the observations $p_{i,j}$ .


### Project Question 6

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Loss function.</b></p>
        <p> Explain that discarding all terms that do not involve the data, the nogative $\log$-likelihood (the <b>loss function</b>) is essentially:
$$
\sum_{N}(s_{i,j} - p_{i,j})^2.
$$
Explain why minimizing the loss function is the same as minimizing the MSE error. This indicates that minimizing the intuitively reasonable MSE error is the same thing as maximizing the probability of the data.</p>
    </span>
</div>


---
###### YOUR ANSWER HERE ######






###### END OF ANSWER ######
---

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>It helps to remember some of the properties of the logarithm: $$\log(ab) = \log(a) + \log(b),$$ $$\log(a^N) = N \log(a),$$ $$\log(e) = 1.$$ However, the question can be answered even without using negative log, just by noticing that $e^{-x}$ is a decreasing function.</p>
    </span>
</div>

##  <a name="7">7. Gradient Descent</a> 
(<a href="#0">Go to top</a>)


### Find the model that minimizes the loss function

You will use __gradient descent__ to minimize the loss funtion iteratively, and consequently __learn the model from the train data__.

While the section above mentions minimizing the RMS as a more interpretable metric, one can instead optimize the MSE with the same results, which has the advantage of avoiding the square root. This freedom is useful: you can think about RMS if to interpret the results, and use instead the MSE to computationally optimize the model.

Let's recap: it is reasonable to model the matrix of predictions as the product $AF$, since it corresponded to the idea that users have various preferences for various features of each ASIN. If this was not an ML problem, one could assign to every ASIN a list of features by hand (perhaps using Mechanical Turk), and then hand-code rules like, "if a person has assigned 5 stars to at least 10 sci-fi books, then they have affinity $+1$ for sci-fi books." This would likely work, but there is a lot of trial and error, and human labor.

The ML solution is to essentially work this problem backwards. We suppose that such features exist, we define a way to evaluate how well a collection of features and affinities works to describe the scores we observe, and then we try to automatically find the best possible set of features and affinities to explain the data we've seen. Thus, our goal is to find the best possible values for $A$ and $F$ in the sense that they minimize the MSE.

Let's now dive in to the implementation of gradient descent for this problem. The mean squared error loss function can be written as

$$
L(A,F) = \frac{1}{N}\|R\circ(S-AF)\|_2^2,
$$

where the variables have the following meanings:
* $N$ is the number of data points in the dataset used to define $R$ or $S$.  It could be either the training or testing data.
* $k$, while not contained in the above formula, is the number of features describing each ASIN and affinities describing each user. This is a hyperparameter that is set when one creates the matrices $A$ and $F$.
* $R$ is the matrix that masks off those entries for which a rating is available.
* $S$ is the matrix of observed ratings.
* $A$ are the affinities of the users for various features.
* $F$ are the values of those features for all the ASINs.
* $AF$ is the matrix product that contains the predictions of ratings for every pair of user and ASIN.

To perform our gradient descent, you need to compute the gradient of the mean square errors loss function with respect to the parameters of the model: the entries of $A$ and $F$. While this loss function is still simple enough that one could work with this mathematically to find its derivatives, here you will approach this with automatic differentiation and autograd.

### Implementing Gradient Descent with autograd

In order to implement gradient descent to solve the problem you need to define a few things:
* How to initialize $A$ and $F$?
* What to take for the number of features/affinities ($k$)?
* What to use as learning rate?
* When to stop the optimization?

To find the best values, you would want to do some *hyperparameter tuning*, meaning an optimization of the choices made before running the optimization algorithm. Here are some choices that work well, but feel free to experiment and test further:
* Initialize $A$ and $F$ with values from a Gaussian distribution of mean 0 and std 1.  
* Take the number of features $k=2$. 
* Take the learning rate to be $25.0$
* Stop after $500$ steps.

### Project Question 7

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Gradient descent implementation.</b></p>
        <p>Implement the gradient descent algorithm and train the recommender system. To monitor learning of your model, plot the MSE error on both the train and validation data as a function of the number of iterations.</p>
        <p>How does the final validation loss compare with the value that you obtained using the baseline model from Question 4?.</p>
        <p>Apply the trained model to the test data set, produce a csv with the predictions, download it, and then upload it to the <a href="https://portal.mlu.aws.dev/contests/redirect/14">course's leaderboard</a> to achieve an improved submission with respect to the baseline model.</p>
    <p><b>The csv file for submission needs to contain two columns: the ID of the User-ASIN interaction, found in the first column of the <code>test_features</code> data frame, and a second column <code>Rating</code> containing the predicted rating for each User-ASIN pair.</b></p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>Remember that to use <code>torch.autograd</code> you need to transform your data into <code>torch.tensor</code> type with <code>requires_grad=True</code>.</p>
    </span>
</div>

##  <a name="8">8. Overfitting</a> 
(<a href="#0">Go to top</a>)

If all goes well over the training period, the training and validation losses are both decreasing over the entire run, and probably finally end up somewhere lower than the baseline values, so there is an improvement in recommendation quality over the baseline model. However, in many ocassions, the training comes to a point when further iterations lead to a decrease of the training loss while the validation loss starts increasing. 

This is one of the most pervasive issues in machine learning: __the phenomenon of overfitting__. It refers to a model memorizing random fluctuations rather than true patterns. The model's ability to predict on unseen data becomes *worse* after continued training, even though the models ability to explain training data becomes better. The idea is that in the beginning, the model is learning real and generalizable patterns, however, as it keeps learning for longer, it starts picking up on things which it *thinks* are real patterns, but are instead just random fluctuations. The model is *memorizing* the training data, instead of learning to generalize.

Let's examine at little bit closer the training and validation losses in the trained model, to make sure they are indeed both decreasing over the entire run as expected. 

### Project Question 8

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Overfitting.</b></p>
        <p>Plot the training and validation losses from the previous question, but now only do so for steps $100$ through $500$. There were many large gains in the early steps of gradient descent that make the overfitting harder to see which should now become clearer.</p>
        <p>A useful technique to prevent overfitting is called *early stopping*, where you simply stop training your model at the point at which the validation losses were best. What was the minimum value of the validation loss you obtained? </p>
        <p>You can now run a training loop and stop at the point before the validation loss starts increasing. You can submit that early stopping model to the leaderboard and check whether the performance of the model improves.</p> 
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid green; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_question.png" alt="MLU question" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Here's a tip!</b></p>
        <p>To implement early stopping, you can create a variable to store the best (lowest) validation loss that is seen during the training. When a new validation loss happens to be larger than the best, indicating the start of the overfitting, you can decide to stop the training.</p>
        <p>Often, the first sign of no further improvement may not be the best time to stop the training though. The model might be in a plateau or even get slightly worse before getting much better. You can account for this by adding a delay to the early stopping. You can set a “patience” argument and stop the training only when the validation loss fails to improve during a certain number of epochs given by the "patience".</p>
    </span>
</div>

##  <a name="9">9. Recommendations and Improvements</a> 
(<a href="#0">Go to top</a>)

By now you have gained expertise in building a baseline model for the recommender system, plus a trained model with gradient descent and optional early stopping. 

Let's take a look at the model that you have built. Your last $P$ matrix from the gradient descent algorithm now contains entries for all pairs of users and items. Those entries of the $S$ matrix are now filled in with what are in fact recommendations for those users that have not yet rated those ASINs.

In [None]:
###############################################    
# This is just a sample predictions of all ones
# REPLACE WITH YOUR OWN P MATRIX
P = torch.ones((n_users, n_asins))
###############################################

# P contains entries for all users and ASINs
dfP = pd.DataFrame(P.detach(), columns=unique_asins)
dfP.index = unique_users

dfP.head()

### Improvement Ideas

There could be many improvements to make to this model, and in fact here is a list of possible ideas to further improve model performance. 

You can now try:

- different values of latent features $k$
- different learning rates
- different number of Gradient Descent iterations (epochs)
- different initializations of matrices A and F
- different settings for early stopping when overfitting is identified
- scalings of the dataset, such as standardization or min-max scaling 
- more train data: different train-test ratios
- more train data: upsample the train data, while also balancing the 'ratings' ratio (for instance, upsample more of the less represented rating of 1, 2, and 3)
- add a regularizer term to the loss function
- clamp the predictions to only values between 1 and 5
- force the predictions to only be integer values 1 to 5
- use a third diagonal matrix in the matrix factorization, similar to $\Sigma$ in SVD


### Project Question 9

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Improve your model.</b></p>
        <p>Machine learning problems are often solved by carefully tuning and/or adding in more and more ingredients which slowly improve the model performance little by little. Improve the model performance by implementing some of the ideas above. Always use the validation set to assess whether a particular ingredient has improved your model or not.</p> 
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

##  <a name="10">10. Submit Final Model to Leaderboard</a> 
(<a href="#0">Go to top</a>)

Although you can continue iterating on the train and validation datasets, you would not be able to know how well your recommender system works on the ```test_features``` dataset unless we submit to the [Math Course Leaderboard](https://portal.mlu.aws.dev/contests/redirect/14). 


### Project Question 10

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Submit your best model to the Leaderboard.</b></p>
        <p>Make improved submissions to the leaderboard. Use the following code to write the test predictions to a CSV file in the format expected. Download the CSV file locally from the SageMaker instance folder, and upload it to the <a href="https://portal.mlu.aws.dev/contests/redirect/14">leaderboard for this course</a>.</p>
        <p>Good luck and enjoy the challenge!</p> 
    </span>
</div>


In [None]:
test_predictions = []
if P.requires_grad:
    P = P.detach()
for index, row in test_features.iterrows():
    user_idx = unique_users.index(row["User"])
    asin_idx = unique_asins.index(row["ASIN"])
    test_predictions.append(P[user_idx, asin_idx])

# Get test predictions in the format expected by the Leaderboard
MATH_Final_Project_LB_submission = pd.DataFrame(columns=["ID", "Rating"])
MATH_Final_Project_LB_submission["ID"] = test_features["ID"]
MATH_Final_Project_LB_submission["Rating"] = test_predictions

MATH_Final_Project_LB_submission.to_csv(
    "MATH_Final_Project_LB_submission.csv", index=False
)

print("CSV for Leadership submission created.")

## Final Comments

Once you get the results of evaluating your predictions against the actual test labels and see your final MSE metric in the Math Course Leaderboard, you might find yourself disappointed that it wasn't too much better than the first baseline model or the basic gradient descent results. This is the way how most ML projects work. You will often be able to implement a trivial solution in minutes to do an ok job, work for weeks to improve on it by $10\%$, and then years to get another $10\%$. 

Something very near this algorithm is what gave Netflix its early lead in the movie recommendation market in the early 2000s. They were only able to decrease the error by another $0.1$ stars after offering a [million dollar prize](https://en.wikipedia.org/wiki/Netflix_Prize) which took $3$ years of joint academic and industry effort to claim!  

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        With the submission of the predictions of your recommender system to the leaderboard you have successfully completed MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>