# 02 - Loss and Cost Functions

Let's start with the regular lib imports.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv("../data/ciqual_small.csv")

In addition, you can get these functions from Exercise 01:

In [3]:
 def h(x, theta0, theta1):
    # TODO: Take h() from Exercise 02
    pass


If you remind the workflow presented in Exercise 01, the first step is parameter initialization. Easy: we'll start with random values for $\theta_0$ and $\theta_1$. To be sure that we get the same results, here is the random initialization that you should use:

In [5]:
np.random.seed(123)
theta0 = np.random.rand()
theta1 = np.random.rand()
print(theta0, theta1)
# you should get: 0.6964691855978616 0.28613933495037946

0.6964691855978616 0.28613933495037946


We used `np.random.seed()` to be sure that our results are reproducible.

Step 2 was the hypothesis function that we just coded. So, we can now start step 3: calculate the error of our model using the MSE.

A loss function is a function that we use to compare an estimated value with the ground truth.

<img src="images/errors.png" width="400">

In linear regression, we generally use the *Mean Squared Error* (MSE) as a loss function. This corresponds to the average of the squared differences between estimated values and ground truth. For each data sample, you can calculate the estimated value, take the difference between the real value, square this difference. Then, take the mean of these results.


<details>
  <summary>Mathematical Notation - Only if you are not allergic </summary>
    
Let $\hat{y}$ be the estimated values and $y$ the true values. The MSE loss function $L$ is defined with:

$$
L_{\theta_0, \theta_1} = \frac{1}{m}\sum_{i=0}^n (\hat{y}^{(i)} - y^{(i)})^2
$$

with $m$ the number of data samples, $i$ the index of the current data sample and $n$ the total number of samples.

</details>


For this first part, you will calculate the error for a single sample. This is simpler: you just need to calculate the difference between the estimated value and the ground truth for this sample and square this difference.


<details>
  <summary>Mathematical Notation - Only if you are not allergic </summary>
Here is the equation corresponding to the loss for the single sample $i$:

$$
L_{\theta_0, \theta_1} = (\hat{y}^{(i)} - y^{(i)})^2
$$

</details>


Your exercise is to implement the MSE function and calculate the error for the first 3 samples:

In [6]:
data[:3]

Unnamed: 0,alim_ssgrp_nom_eng,alim_nom_eng,Phosphorus (mg/100g),Protein (g/100g),Zinc (mg/100g)
0,pasta. rice and grains,Durum wheat pre-cooked. whole grain. cooked. u...,116.0,5.25,0.48
1,pasta. rice and grains,Asian noodles. plain. cooked. unsalted,43.0,3.5,0.19
2,pasta. rice and grains,Rice. brown. cooked. unsalted,120.0,3.21,0.62


Remind that we want to estimate the amount of `phosphorus` from the amount of `zinc`.

<details>
  <summary>hint</summary>
  You need to use your hypothesis function $h$ from above to calculate $\hat{y}$.
</details>


In [7]:
# Your code here
def L(x, y, theta0, theta1):
    # TODO: Implement the MSE loss function. It should take one data sample
    # as input (x and y are floats) and the parameters theta0 and theta1.
    # It should return the MSE for this data sample.
    pass


To check your function, here are few tests that you can run. They correspond to the first 3 samples of the Ciqual dataset:

In [9]:
L(x=0.48, y=116, theta0=theta0, theta1=theta1)
# should return 13263.249921833765

13263.249921833765

In [10]:
L(x=0.19, y=43, theta0=theta0, theta1=theta1)
# should return 1784.9918874926789

1784.9918874926789

In [11]:
L(x=0.62, y=120, theta0=theta0, theta1=theta1)
# should return 14191.03352093345

14191.03352093345

These values correspond to the error of our model for each sample. This value depends on the scale of the variables but it seems that the model is not very good, and this is normal, since it is just random for now 😀.

Now, you can try to create a function to plot the regression line along with the data to have an idea of the loss. This function will take `zinc` values as `x` and `phosphorus` as `y`, and the parameters for the regression line (`theta0` and `theta1`).

In [12]:
# Your code here
def plot_reg_line(x, y, theta0, theta1):
    # TODO: Plot y in function of x (scatter plot) and the regression line represented by
    # parameters theta0 (intercept) and theta1 (slope).
    pass


## Cost Function

When applied to multiple data samples, we usually call the function a *cost function*.

You will adapt the loss function `L()` from the previous part to create the MSE cost function. This function will calculate the total error. Remember that the MSE cost function is the mean of the squared differences between estimated values and true values. However, instead of using the mean (dividing the sum by the number of data samples), we generally use a mean times 0.5 (we divide the sum by two times the number of data samples). The reason for doing that is that it simplify the derivative as you will see in the next exercises.




<details>
  <summary>Mathematical Notation - Again, only if you are not allergic </summary>
    
Let $\hat{y}$ be the predictions and $y$ the ground truth. The MSE loss function $L$ is defined with:

$$
L_{\theta_0, \theta_1} = \frac{1}{2m}\sum_{i=0}^n (\hat{y}^{(i)} - y^{(i)})^2
$$

with $m$ the number of data samples, $i$ the index of the current data sample and $n$ the total number of samples.

</details>


In [18]:
# Your code here
def L(x, y, theta0, theta1):
    # TODO: Create the cost function that return the Mean Squared Error (MSE).
    pass
    

Before going further, you can test your function running the following cell:

In [20]:
L(x=data['Zinc (mg/100g)'], y=data['Phosphorus (mg/100g)'], theta0=theta0, theta1=theta1)
# Should return 26312.13577921613

26312.13577921613