<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/104_viz-mse.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Coding and Visualizing the Mean Squared Error (MSE)
___
We have seen (with the WDBC data) that a purely heuristic ad-hoc specification of a learning procedure does not always work well. Hence, we need to think about *"first principles"* of a good learning procedure. These imply setting up a learning problem as a minimization problem, where the objective function relates to prediction errors. We also need to make sure that this function has nice properties, i.e., that it is continuous and differentiable.
A very common choice for such an objective function is the mean squared error function (MSE). (Others are the likelihood function, Gini, and cross-entropy. More on them later.)

The aim of this notebook is to give you an intuition about the MSE concept. What does it depend on, how does it look like? On paper, when you see it the first time, it may feel pretty abstract. Hopefully, this notebook helps you understand the concept of the MSE better and make it more colorful (literally).

### 🧑‍💻 <font color=green>**Your Task**</font>

Go through the explanations and code pieces of this notebook and solve the questions outlined below. <font color=red>**Feel free to work in groups!**</font>

___
## Data pre-processing

In [None]:
# Import necessary packages
import numpy as np # Numerical computation package
import pandas as pd # Dataframe package
import matplotlib.pyplot as plt # Plotting package
import matplotlib as mpl # To use colormaps later on
np.random.seed(1) # Set the random seed for reproduceability

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Read in the WDBC dataset
wdbc = pd.read_csv(f"{DATA_PATH}/wdbc.csv")
# Keep only necessary columns: the diagnosis, the perimeter, and the severity of concave portions
# of the cell nucleus
wdbc = wdbc[["perimeterM", "concaveM", "diagnosis"]]
# Shuffle the dataset
wdbc = wdbc.sample(frac=1)
wdbc.head(5) # Display the first rows of the dataset

In [None]:
# Define a function for vector standardization
standardize = lambda x: (x - np.mean(x)) / np.std(x)

In [None]:
# Standardize the features
wdbc[["perimeterM", "concaveM"]] = wdbc[["perimeterM", "concaveM"]].apply(standardize)
wdbc.head(5) # Display the first rows with standardized features

In [None]:
# Create the matrix of features
X = np.array(wdbc[["perimeterM", "concaveM"]])
# Create the vector of labels / targets
y = np.where(wdbc["diagnosis"] == "M", 1, -1)

___
## The loss function landscape

___
#### ➡️ ✏️<font color=green>**Question 1**</font>

1. Make a drawing of how you envisage the MSE as a function of two weights $w_1$ and $w_2$ (as *"independent/free variables"*). 
2. Draw a coordinate system with the weight dimensions as "$x$" and "$y$" variables (*"in the plane"*), and with the MSE as the third ("$z$") dimension (the *"spatial dimension"*). To be clear, with "$x$" and "$y$" we do not mean features and targets but generic variables in the mathematical sense (think about high school math, even if this may feel unpleasant ;-).
3. What does a high or low $z$ dimension mean? 
4. If you could dream up a learning machine with a really "good engineering", how would you try to construct an *"ideal"* MSE for this good engineering? What would be a sensible meaning of "good engineeering"?


___
### Visualizing the MSE

Let us now plot the MSE drawing discussed in Question 1. Say that our learning problem only contains two elements, namely the two weights $w_1$ and $w_2$ (there is no bias $b$ or constant $w_0$ in this case; why?). For every $(w_1, w_2)$ point, we can compute the corresponding predictions $\hat{\mathbf{y}}$ for the target. And hence we can compute and plot the MSE $\frac{1}{N}\sum_{i=1}^N \left(\hat{y}^{(i)} - y^{(i)}\right)^2$. This is a function of the weights! (Why?) Hence, we can write $MSE = MSE(w_1, w_2)$ &ndash; make sure this makes intuitive sense to you!

A good way to represent a three-dimensional relationship is either a [contour plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html) or a [3D surface](https://matplotlib.org/stable/gallery/mplot3d/surface3d.html).

In [None]:
# Define the bounds of the weights grid: each weight goes from -10 to +10
wbounds = [-10, 10]
# Define the number of points we use to plot the weights, i.e., the 'granularity'
npoints = 41
# Create the w1 ("x") coordinates and the w2 ("y") coordinates as linearly spaced points
w1_vec = np.linspace(*wbounds, num=npoints)
w2_vec = np.linspace(*wbounds, num=npoints)

___

Now that we defined our weight vectors `w1_vec` and `w2_vec`, let's create an `MSE` matrix with the MSE value for each combination of `w1` and `w2`.

In [None]:
# Create a npoints x npoints empty matrix
mse = np.empty([npoints, npoints])
# Iterate over every weight combination
for i in range(npoints):
    for j in range(npoints):
        # Obtain the weights for this specific combination
        w1, w2 = w1_vec[i], w2_vec[j]
        
        # The prediction on this point in the grid
        pred = w1 * X[:, 0] + w2 * X[:, 1]

        # Compute the MSE and store it to the corresponding index
        # The MSE is the mean squared error over all the dataset, the division by 2
        # is only a rescaling (see the slides).
        mse[j, i] = np.mean((y - pred) ** 2) / 2
        # 🙀 🤯 Notice that the indexing is [j, i] and not [i, j]. This is confusing.
        # In fact, when we use a 3D plot in Matplotlib, the x-axis values are in the
        # columns and the y-axis values in the rows of our matrix.
        
# Lastly, for 3D plots, we need to create a so-called meshgrid for the X- and Y-axes
x_axis, y_axis = np.meshgrid(w1_vec, w2_vec)

... and now let's visualize our results!

In [None]:
# Create the canvas. Notice the subplot_kw, this is needed for a 3d plot
# (No need to understand the details of this visualization; get back to it when you need something like this)
fig, ax = plt.subplots(figsize=(12, 8), subplot_kw={"projection": "3d"})
# Draw a surface plot
ax.plot_surface(x_axis, y_axis, mse, cmap=mpl.colormaps["viridis"], alpha=0.9)

# Notice the TeX notation in the axis labels!
ax.set_xlabel("$w_1$")
ax.set_ylabel("$w_2$")
ax.set_zlabel("MSE")
# Change perspective
ax.view_init(azim=-35, elev=25)

___
#### ➡️ ✏️<font color=green>**Question 2**</font>

How does this MSE plot relate to the construction of a learning machine? Why are we interested in the MSE? What do we want to find? How would you relate it to a learning task for the iris or WDBC data?

___
#### 🙀 🤯 Finding the minimum in a multidimensional array (e.g., matrix) in `numpy`

A plot is nice to understand the general relationship between weights and MSE, but it's difficult to find the optimal weights just by looking at the graph. Let's find the minimum in our MSE matrix numerically.

While finding the minimum value of a matrix (or other multidimensional array) is relatively straighforward, extracting the index of this element is slightly more cryptic.

For instance, with our `mse` matrix, the minimum value can be obtained using `mse.min()`. Easy enough! To find the corresponding index, we can just use `mse.argmin()`, however, if you try it out, you will notice that it does not yield a tuple with the index amongst each dimension. Instead, we get a single number. The problem is that it returns the index for the *flattened* matrix. The `numpy` syntax to obtain the index on each dimension from the *flattened* index is the following:

```python
multidimensional_index = np.unravel_index(flattened_index, shape_of_the_multidimensional_array)
```

There is no neeed to understand this in detail. You have seen it now and you know there is this issue... Get back to here if you need something similar!
___

In [None]:
# Obtain the index of the best weights
w_index = np.array(np.unravel_index(mse.argmin(), mse.shape))
# Map index to corresponding weights and store in a vector
w_best = np.array([w1_vec[w_index[0]], w2_vec[w_index[1]]])
w_best # Display best weights

In [None]:
# Display the smallest value of the mse
mse.min()

In [None]:
# You can ignore this code, the important thing is the resulting 3D plot!
import plotly.graph_objects as go

# 3D plotting helper
def plot_mse_3D(w1_vec, w2_vec, mse):
    # Obtain the index of the best weights
    w_index = np.array(np.unravel_index(mse.argmin(), mse.shape))

    fig = go.Figure(
        data=[
            go.Surface(z=mse, x=w1_vec, y=w2_vec, opacity=0.9),
            go.Scatter3d(x=[w1_vec[w_index[0]]], y=[w2_vec[w_index[1]]], 
                         z=[mse.min()])
        ]
    )
    fig.update_layout(title='Mean Squared Error Surface', height=600, width=800, 
                      autosize=False,
                      scene=dict(xaxis_title="w1", yaxis_title="w2", 
                                 zaxis_title="MSE")
    )
    fig.show()
    
# Use the function to plot our MSE landscape
plot_mse_3D(w1_vec, w2_vec, mse)

___
#### ➡️ ✏️<font color=green>**Question 3**</font>
1. Does the lowest MSE value and the corresponding weight vector change if we choose a more granular grid (i.e., if we raise the value of `npoints`)?
2. What is a good grid size? Is there a problem with making the grid as granular as possible (i.e., `npoints` as large as possible)?

___
#### ➡️ ✏️<font color=green>**Question 4**</font>
1. Could we still calculate the MSE the way we did if we had 3 weights $w_1$, $w_2$, and $w_3$ ? What about any positive number of weights $p > 0$?
2. Could you still visualize the MSE with 3 weights?
3. Could you still visualize the MSE with any positive number of weights $p> 0$?

___
#### ➡️ ✏️ <font color=green>**Question 5**</font>
1. Concerning the difficulty you may have encountered in *Question 3*, do you expect this difficulty to become less or more severe if we had more weights? 
2. If this problem becomes really severe, can you think of an alternative way to find the weights that lead to the minimum of the MSE?


___
#### ➡️ ✏️<font color=green>**Question 6**</font>
Suppose we used some other dataset. What kind of changes would you expect in the visualization of the MSE? Do you think the shape of the MSE will change or would you expect it to keep this kind of U-shape?

___
#### ➡️ ✏️ <font color=green>**Question 7**</font>
In light of your results to *Question 6*, is the MSE a function of the data or the weights?