# Linear Regression

<hr style="clear:both">

This notebook is part of a series of exercises for the CIVIL-226 Introduction to Machine Learning for Engineers course at EPFL, and adapted for the ME-390. Copyright (c) 2021 [VITA](https://www.epfl.ch/labs/vita/) lab at EPFL.  Use of this source code is governed by an MIT-style license that can be found in the LICENSE file or at https://www.opensource.org/licenses/MIT

**Author(s):** Tom Winandy and David Mizrahi

**Adapted by:** Sabri El Amrani
<hr style="clear:both">

In [None]:
# Function to align all tables to the left (useful for later on)

In [None]:
%%html
<style>
table {float:left}
</style>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import Any, Callable

# Helper file with functions for pre-processing and visualization
import helpers

##  0. Intro 

In the first part of the exercise, you're tasked with implementing linear regression with only one variable to predict the C02 emissions of a car as a function of its weight. This is known as **[simple linear regression](https://en.wikipedia.org/wiki/Simple_linear_regression)**, as opposed to **multiple linear regression** (where multiple variables are taken into account for the prediction). You'll see later on that the code implemented here will work just as well for multiple linear regression.

**Question:** How does a regression problem differ from a classification problem?

*__Background:__ You're a data scientist working for a leading automotive research firm, specializing in emissions analysis. Your team has compiled a dataset that includes measurements of carbon dioxide (CO2) emissions from various car models, alongside their respective weights. Your task is to develop a predictive model that can estimate CO2 emissions based on the weight of a car. This model will assist car manufacturers in designing more fuel-efficient and environmentally friendly vehicles, as well as policymakers in formulating regulations to curb emissions.*

## 1. Data loading & pre-processing

Here, we'll use a dataset containing 36 car models, with their weight (in tonnes), the volume of their engine (in liters) and their C02 emissions (in kg/km). 

Take a look at the file `car_data.csv` ([__Source__](https://www.kaggle.com/datasets/midhundasl/co2-emission-of-cars-dataset)) and see how it's loaded using pandas by running the cell below.
Pandas is a Python library that simplifies data manipulation and analysis, making it easier to work with structured data like tables and time series. If you want to get started in pandas (very useful tool for any data analyst), check this [Kaggle tutorial](https://www.kaggle.com/learn/pandas).

In [None]:
car_df = pd.read_csv('data/car_data.csv', usecols=['car', 'model', 'volume', 'weight', 'CO2'])

print(f"There are {car_df.shape[0]} rows and {car_df.shape[1]} columns.")
# Show the first 5 rows of the data
car_df.head(5)

As stated above, we'll start by implementing a simple linear regression. As we only want to predict CO2 emissions as a function of car weight, we can restrict our dataset to these two columns. 

In [None]:
# only subset of columns
car_df = car_df[['weight', 'CO2']]
car_df.head()

Run the cell below to get a plot of the data. 

In [None]:
car_df.plot(kind='scatter', x='weight', y='CO2')

Before training our linear regression model, we should now split our dataset in training, validation and test sets. Watch this [short video](https://www.youtube.com/watch?v=NPWlj9G1Si8&t=59s) to understand why this subdivision is needed, and how it is usually performed.

To simplify things around this time, we'll omit the validation set here. Given that there is no validation dataset, we won't be able to perform any hyperparameter search (see next lecture for more explanations about this).

Here, the target label is `CO2`, and the (only) feature is the `weight`.

In [None]:
# We'll use 80% of our data as training data and the remaining 20% as test data
# Here, we use a random seed to ensure that the data shuffling and splitting can be reproduced
X_train, y_train, X_test, y_test, feature_names = helpers.preprocess_data(car_df, label="CO2", train_size=0.8, seed=99)

### Adding the intercept

The goal of linear regression is to fit a line of slope $w_1$ and of intercept $b$ such that for any data $x^{i}$, the prediction $\hat{y}^{i}$ is:
$$\hat{y}^{i} = w_{1}x^{i} +  b$$

Note that this can also be written as:
$$\hat{y}^{i} = \begin{bmatrix} b & w_1 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ x^{i} \end{bmatrix}$$

Therefore, in order to take into account the offset term ($b$) directly in our matrix, we add an additional first column to `X` and set it to all ones. Then, we treat the intercept as another feature (`b` will be treated as `w_0`), which will make our matrix computation easier.

__Note__: The same principle applies if the data has multiple features:
$$\hat{y}^{i} = w_{D}x^{i}_{D} + \ ... \ + w_{2}x^{i}_{2} + w_{1}x^{i}_{1} +  b$$
is equivalent to
$$\hat{y}^{i} = \begin{bmatrix} b & w_1 & w_2 & ... & w_D \end{bmatrix} \cdot \begin{bmatrix} 1 \\ x^{i}_{1} \\ x^{i}_{2} \\ ... \\ x^{i}_{D} \end{bmatrix}$$

Let's add a column of ones (known as the offset term / constant term) to the feature matrix `X`.

<div class="alert alert-info">
As a rule, for linear regression, the constant is always included in the feature matrix X, and the intercept / bias term will be part to the weight vector w. 

However, this **will not be the case** in future exercises, where the bias term will be separate.
    
</div>

In [None]:
def add_constant(X: np.ndarray) -> np.ndarray:
    """ Adds an constant term to the dataset (as the first column)

    Args:
        X (np.ndarray): Dataset of shape (N, D-1)

    Returns: 
        Dataset with offset term added, of shape (N, D)

    """
    X_with_offset = np.insert(X, 0, 1, axis=1)

    return X_with_offset

X_train = add_constant(X_train)
X_test = add_constant(X_test)

**Question:** In simple linear regression, what happens if no intercept is added?

### Data preview

In [None]:
print(f"Features: {feature_names}")

In [None]:
# Visualization of X_train and y_train (separation of the features and the labels)
print('Training set features:')
print(f'X_train: \n {X_train[:10]}')

print('\nTraining set labels:')
print(f'y_train: \n {y_train[:10]}')

In [None]:
# Show shapes
print('Training set shape:')
print(f'X: {X_train.shape}, y: {y_train.shape}')

print('\nTest set shape:')
print(f'X: {X_test.shape}, y: {y_test.shape}')

### Notation

Now that we have added the constant term, here's what our data looks like:

- features: $\boldsymbol{X} \in \mathbb{R}^{N \times (d+1)}$, $\forall \ \boldsymbol{x}^{i} \in \boldsymbol{X}: \boldsymbol{x}^{i} \in \mathbb{R}^{d+1}$.
- labels: $\boldsymbol{y} \in \mathbb{R}^{N}$, $\forall \ y^{i} \in \boldsymbol{y}: y^{i} \in \mathbb{R}$ 
  
 where $N$ is the number of examples in our dataset, and $d$ is the number of features per example. In other words, $d$ is the dimension of the independent variables.  
 

For the weights, we have:
 
 
 - weights: $\mathbf{w} \in \mathbb{R}^{d+1}$, where $w_0$ (or $b$) is known as the intercept.</FONT>

 **Note:**
 $\boldsymbol{X}$ is sometimes called the design matrix, where $\boldsymbol{X}_{i, :}$ denotes $\boldsymbol{x}^{i}$.  
 Note that a single example $\boldsymbol{x}^{i}$ is a column vector of dimension (shape in python language) $(d+1) \times 1$, while the design matrix $\boldsymbol{X}$ is of dimension (shape) $N \times (d+1)$, where each row represents an example and each column represents a feature. 

## 2. Loss function

One of the first step when working on a machine learning problem is to pick a loss / cost function. Here, we will use the Mean Squared Error (MSE), defined as: 

$$
\begin{align}
J(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}^{i} - y^{i})^{2} \\
= \frac{1}{N} \sum_{i=1}^{N} (\mathbf{w}^T{\boldsymbol{x}}^{i} - y^{i})^{2} \\
= \frac{1}{N} (\mathbf{X} \mathbf{w}-\mathbf{y})^{T} (\mathbf{X} \mathbf{w}-\mathbf{y})
\end{align}$$

where $N$ is the number of examples, $\hat{y}^{i}$ is the prediction for the $i^{th}$ example, and ${y}^{i}$ is the ground-truth for the $i^{th}$ example.

Implement the function `mse_loss()`

**Note about loss / cost:** The function we want to minimize or maximize is called the cost function, loss function, or error function. In this exercise, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.

**Hint**: Use the matrix form shown above and make use of NumPy operations.

In [None]:
def mse_loss(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> float:
    """Computes the Mean Square Error (MSE)
    
    Args:
        X (np.ndarray): Dataset of shape (N, D)
        y (np.ndarray): Labels of shape (N, )
        w (np.ndarray): Weights of shape (D, )

    Returns:
        float: the MSE loss
    """
    ### START CODE HERE ### (≈ 3 lines of code)
    
    loss = ...
    ### END CODE HERE ###
    return loss

Let's initialize the weights to 0 and look at the current loss.

In [None]:
zero_weights = np.zeros(X_train.shape[1])

In [None]:
train_loss = mse_loss(X_train, y_train, zero_weights)
test_loss = mse_loss(X_test, y_test, zero_weights)
print(f"Train loss: {train_loss:.5f}")
print(f"Test loss: {test_loss:.5f}")

**Expected output:** 

|   |                                                  |
|---|--------------------------------------------------|
| **Train loss** | 0.01043 |
| **Test loss** | 0.01059 |

Note that the loss before model training is similar in the training and test sets. This is a good sign: it means that the test set is representative of our data.

Let's visualize our regression line before training.

In [None]:
helpers.plot_linear_regression_2d(X=X_train, y=y_train, w=zero_weights, feature_name="Weight", label_name="CO2")

Not great, right? We'll see in the next sections how to fit our model in order to get a much better predictor.

<FONT> **Question:** Before proceeding, based on the lecture, what are some ways you could think of to determine a better set of weights? 
</FONT>

## 3. Gradient Descent

Now we need to define a function to perform gradient descent on the weights $\mathbf{w}$ using the update rules. First, write a function that computes the gradient of the loss function (`mse_gradient()`) and then use it in the `gradient_descent()` function to update the weights at every iteration.

As seen in the previous section, our loss is:
$$
J(\mathbf{w}) = \frac{1}{N} (\mathbf{X} \mathbf{w}-\mathbf{y})^{T} (\mathbf{X} \mathbf{w}-\mathbf{y})
$$

Therefore, the derivative w.r.t to ${\mathbf{w}}$ is:

$$ \nabla_{\mathbf{w}} J(\mathbf{w}) = \frac{2}{N} \mathbf{X}^{T} (\mathbf{X} \mathbf{w} - \mathbf{y}) 
$$

**Note:** You can use http://www.matrixcalculus.org/ to compute the gradient.


The gradient descent formula is:
$$\mathbf{w} := \mathbf{w} - \alpha \nabla_{\mathbf{w}} J(\mathbf{w})$$

where $\nabla_{\mathbf{w}} J(\mathbf{w})$ is the gradient of the loss function at the current iteration, $\mathbf{w}$ is the weights vector, and $\alpha$ is the learning rate.

**Hint**: Use the matrix form of the gradient and make use of NumPy operations.

In [None]:
def mse_gradient(X: np.ndarray, y: np.ndarray, w: np.ndarray) -> np.ndarray:
    """Compute the gradient of the MSE
    
    Args:
        X (np.ndarray): Dataset of shape (N, D)
        y (np.ndarray): Labels of shape (N, )
        w (np.ndarray): Weights of shape (D, )

    Returns:
        Gradient of shape (D, )
    """
    ### START CODE HERE ### (≈ 2 lines of code)
    
    grad = ...
    ### END CODE HERE ###
    return grad

In [None]:
def gradient_descent(X: np.ndarray, y: np.ndarray, w: np.ndarray, alpha: float, max_iters: int) -> (np.ndarray, np.ndarray):
    """Gradient descent for linear regression.
    
    Args:
        X (np.ndarray): Dataset of shape (N, D)
        y (np.ndarray): Labels of shape (N, )
        w (np.ndarray): Weights of shape (D, )
        alpha (float): Learning rate
        max_iters (int): Maximum number of gradient descent iteration

    Returns:
        w (np.ndarray): Optimum weights of shape (D, )
        losses (np.ndarray): Loss at every iteration of gradient descent. Shape is (max_iters, )
    """
    # Define an array to store the evolution of the loss
    losses = np.zeros(max_iters)
    
    for n_iter in range(max_iters):
        ### START CODE HERE ### (≈ 2 lines of code)
        # Update w using the gradient descent formula
        w = ...  
        # Compute the loss with the updated w
        loss = ...
        ### END CODE HERE ###
        
        # Track losses
        losses[n_iter] = loss
        
        # Print loss at some iterations
        if n_iter % (max_iters / 20) == 0:
            if w.shape[0] == 2: 
                print(f"Iteration {n_iter}: loss={loss:.7f}, w0={w[0]:.4f}, w1={w[1]:.4f}")
            else:
                print(f"Iteration {n_iter}: loss={loss:.7f}")

    return w, losses

Let's initialize some additional variables - the learning rate alpha, and the number of iterations to perform.

In [None]:
alpha = 0.04
iters = 5000
w = np.zeros((X_train.shape[1], ))

Now let's run the gradient descent algorithm to fit our parameters theta to the training set.

In [None]:
w, loss = gradient_descent(X_train, y_train, w, alpha, iters)

Note that `gradient_descent` prints the loss and the values of the weights matrix, `w`. The reason is that `w` is at the core of our algorithm. Make sure to understand that the whole point of the learning algorithm is to update this `w` so that the linear regression model (described by `w`) fits the data as well as possible. As `X` and `y` are fixed, the only parameter that can be changed is `w`. This why we use the gradient of the loss w.r.t `w` in gradient descent. It enables us to get closer to the best value of `w` at every iteration.

Now, play with the learning rate, `alpha`, and the number of iterations, `iters`, to see how the convergence changes. Document your findings.

**Hint**: 
- Try `alpha = 0.4`. What's happening? Try to guess why.
- Try `alpha = 0.004`. Why is the final loss bigger than when `alpha = 0.04`?
- Try `alpha = 0.004` with `iters = 50 000`. Is the problem of the loss solved?

In [None]:
alpha = 0.04
iters = 5000
w = np.zeros((X_train.shape[1], ))
w, loss = gradient_descent(X_train, y_train, w, alpha, iters)

Finally we can compute the loss (error) of the trained model using our fitted parameters.

In [None]:
train_loss = mse_loss(X_train, y_train, w)
test_loss = mse_loss(X_test, y_test, w)
print(f"Train loss: {train_loss:.7f}")
print(f"Test loss: {test_loss:.7f}")

**Expected output:** with `alpha = 0.04` and `iters = 5000`.

|   |                                                  |
|---|--------------------------------------------------|
| **Train loss** | 0.0000248 |
| **Test loss** | 0.0000878 |

Note that the test and train loss are similar after training. This is once more a good sign: it suggests that our model generalizes well to unseen data (it doesn't __overfit__ the training data, see next lecture).

Let's check what the regression line looks like.

In [None]:
helpers.plot_linear_regression_2d(X=X_train, y=y_train, w=w, feature_name="Weight", label_name="CO2")

Looks pretty good! The red line is our trained model, it represents the estimated CO2 emissions of a new car for every possible weight. Remember that the model is 100% described by our parameters $\mathbf{w}$ (in this case $\mathbf{w} = [b, w_1]$). If we had chosen a $\mathbf{w}$ that doesn't fit the model well, we would have gotten a red line that doesn't fit the data.

Since the gradient descent function also outputs a vector with the loss at each training iteration, we can plot that as well. The goal of gradient descent is to get a model that fits the data well, so we hope that the loss decreases throughout the iterations of gradient descent. Minimizing the MSE in linear regression is a convex optimization problem, so if everything goes well, it should reach a global minima.

In [None]:
helpers.plot_loss(loss)

If the plot had shown a non-decreasing function, it would have raised questions about the validity of our implementation of gradient descent. In any case, it's always good practice to plot this graph to see if our algorithm works as expected. 

## 4. Least squares method

It turns out that linear regression with MSE is one of these rare cases where we can compute the optimum of the loss function analytically. Let's see how:


Let's start from the loss function: 
$$
\begin{align}
J(\mathbf{w})  = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}^{i} -y^{i})^{2} \\
 = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{w}^T \boldsymbol{x}^{i} -y^{i})^{2} \\
= \frac{1}{N} (\mathbf{X}\mathbf{w}-\mathbf{y})^{T}(\mathbf{X}\mathbf{w}-\mathbf{y})
\end{align}
$$

This function is convex in $\mathbf{w}$, so let's try to find its minimum.

Take the derivative with respect to $\mathbf{w}$: (Use http://www.matrixcalculus.org/ if necessary)
$$
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}=\frac{2}{N} \mathbf{X}^{\top}(\mathbf{X} \mathbf{w} - \mathbf{y})
$$
Set to 0 and solve:
$$
\begin{align}
\frac{2}{N} \mathbf{X}^{\top}(\mathbf{X} \mathbf{w} - \mathbf{y}) = 0 \\
\Leftrightarrow \mathbf{X}^{T} \mathbf{X} \mathbf{w} = \mathbf{X}^{T} \mathbf{y}
\end{align}
$$


Therefore, the linear regression model has an analytical solution:
$$\hat{\mathbf{w}} = (\mathbf{X}^{T}\mathbf{X)}^{-1} \ \mathbf{X}^{T} \ \mathbf{y}$$
This is known as the **least squares** method. The advantage of this method is that you can directly get the optimal weights $\mathbf{w}$ from this short matrix expression.

Please use this solution to complete the function `least_squares` and to obtain the weight parameters $\mathbf{w}$. 

**Note:** Use `np.linalg.solve` to solve a linear matrix equation, as it is more stable and more accurate than computing the inverse. You can find the documentation for this method [here](https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html).

In [None]:
def least_squares(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """Solves linear regression using least squares

    Args:
        X: Data of shape (N, D)
        y: Labels of shape (N, )

    Returns:
        Weight parameters of shape (D, )
    """

    ### START CODE HERE ### (≈ 1 line of code)
    w = ...
    ### END CODE HERE ###
    return w

In [None]:
ls_w = least_squares(X_train, y_train)
print(ls_w)

In [None]:
train_loss = mse_loss(X_train, y_train, ls_w)
test_loss = mse_loss(X_test, y_test, ls_w)
print(f"Train loss: {train_loss:.7f}")
print(f"Test loss: {test_loss:.7f}")

**Expected output:** 

|   |                                                  |
|---|--------------------------------------------------|
| **Train loss** | 0.0000248 |
| **Test loss** | 0.0000878 |

In [None]:
helpers.plot_linear_regression_2d(X=X_train, y=y_train, w=ls_w, feature_name="weight", label_name="CO2")

**Question:** Compare the loss and plot obtained with least-squares to the loss and plot obtained with gradient descent. What can you say about these two methods, is the end result similar?

## 5. Prediction
Based on the weights ($\mathbf{w}$), we just computed and the linear model, let's define a function `predict`, which we'll use to give a prediction of the expected CO2 emissions of a car ($\hat{y}$) based on its weight ($x$).

In [None]:
def predict(X, w):
    """Predicts value using linear regression weights

        Args:
            X: Dataset (without the offset) of shape (M, D-1)
            w: Weights (with bias term) of shape (D,)

        Returns:
            Predictions of shape (M, )
    """
    ### START CODE HERE (≈ 1 line of code)
    y_hat = ...
    ### END CODE HERE
    return y_hat

In [None]:
# What are the predicted CO2 emissions of a car weighing 2 tonnes?
expected_co2 = predict([[2]], w)
print(f"A new big car model weighing 2 tonnes is expected to emit {expected_co2[0]:.3f} kg of C02 per km.")

## 6. Multiple Linear Regression


Now, we're tasked with implementing linear regression with multiple features to predict the power generation of a solar power plant. We'll  see that the code implemented in the previous parts works just as well for multiple features.

*__Background:__ As an engineer tasked with optimizing the performance of a solar power plant, you're analyzing data on solar power generation alongside corresponding measurements of irradiation (sunlight intensity) and ambient temperature. Your goal is to predict the expected solar power output based on these environmental factors. This analysis enables you to fine-tune the operation of the plant, ensuring its efficiency and reliability under varying weather conditions.*

### 6.1. Solar Plant Dataset


This dataset represents observations collected from a solar power plant in India spanning a 34-day period. The data includes power generation measurements obtained at the inverter level, where each inverter manages multiple lines of solar panels. Sensor data, on the other hand, is gathered at the plant level, with a single array of sensors strategically positioned across the plant.

<img src="images/solar-power-plant.png" alt="Solar power plant" style="width: 450px"/>

[__Source__](https://www.researchgate.net/publication/367057927/figure/fig3/AS:11431281112734704@1673527500078/Main-components-of-a-solar-power-plant.png)

Take a look at the file `solar_plant_data.csv` ([__Source__](https://www.kaggle.com/datasets/anikannal/solar-power-generation-data)) and see how it's loaded by running the cell below.

In [None]:
solar_df = pd.read_csv('data/solar_plant_data.csv')

print(f"There are {solar_df.shape[0]} rows and {solar_df.shape[1]} columns.")
# Show the first 5 rows of the data
solar_df.head(5)

We want to predict the dc power generation (in MW) of the plant as a function of the ambient temperature (in °C) and irradation levels (in kW/m²). 

In [None]:
# We'll use 80% of our data as training data and the remaining 20% as test data
# Here, we use a random seed to ensure that the data shuffling and splitting can be reproduced
X_train_mult, y_train_mult, X_test_mult, y_test_mult, feature_names_mult = helpers.preprocess_data(solar_df, label="dc_power", train_size=0.8, seed=11)

As in the simple linear regression case, we'll first add a constant term to our training data for the intercept.

In [None]:
X_train_mult = add_constant(X_train_mult)
X_test_mult = add_constant(X_test_mult)

In [None]:
print(f"Features: {feature_names_mult}")

This time, there are several features. To be exact, we have 2 features and we can plot according to one feature at a time, to see how each feature correlates to the target variable `y`.

Run the following cell with `feature_num = 1` and then with `feature_num = 2` (`0` is the constant term).

In [None]:
feature_num = 1
plt.scatter(X_train_mult[:,feature_num], y_train_mult)

plt.ylabel("dc_power")
plt.xlabel(f"{feature_names_mult[feature_num - 1]}")
plt.title(f"dc_power vs {feature_names_mult[feature_num - 1]}")

__Question:__ How well do ambient_temperature and irratiadiation respectively correlate with dc_power?

It is also possible to plot the target variable according to both features. 

Using `plot_data_3d` from `helpers.py`, we can generate 3D plot that shows the training and test set according to both features. 
- You can toggle each dataset on or off by clicking on the legend (upper left). 
- You can also interact with the plot, zoom in and out, and see it through different angles. Try to carefully choose the view angle in order to get the equivalent of the 2 plots above (cancel one dimension).

In [None]:
helpers.plot_data_3d(X_train=X_train_mult, y_train=y_train_mult, X_test=X_test_mult, y_test=y_test_mult, feature_names=feature_names_mult, label_name="dc_power")

### 6.2 Training

#### 6.2.1. Gradient Descent

If you have implemented the function `gradient_descent` correctly in section 1, you should be able to use it for multiple features.

Try to call `gradient_descent(X_train_mult, y_train_mult, np.zeros((X_train_mult.shape[1], )), 0.001, 300000)`. 
If it doesn't work, go back to your `gradient_descent` function in section 1, write in matrix form, rerun the function cell and try to call it again with the above parameters. 

In [None]:
w_mult, loss = gradient_descent(X_train_mult, y_train_mult, np.zeros((X_train_mult.shape[1], )), 0.001, 300000)

After making sure that the loss has stopped going down (training has converged - else consider increasing the number of iterations), we can now print the learned parameter values.

In [None]:
w_mult

Then, as usual, we can compute the loss of our newly trained model.

In [None]:
train_loss = mse_loss(X_train_mult, y_train_mult, w_mult)
test_loss = mse_loss(X_test_mult, y_test_mult, w_mult)
print(f"Train loss: {train_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")

**Expected output:** 

|   |                                                  |
|---|--------------------------------------------------|
| **Train loss** | 0.847 |
| **Test loss** | 1.033 |

Reassuringly, the test loss is not much higher than the training loss. Our model therefore seems to generalize well to unseen data.

#### 6.2.2. Least squares

If `least_squares` is implemented correctly, it should also be able to work without any modification for multiple features.

In [None]:
ls_w_mult = least_squares(X_train_mult, y_train_mult)
print(ls_w_mult)

In [None]:
train_loss = mse_loss(X_train_mult, y_train_mult, ls_w_mult)
test_loss = mse_loss(X_test_mult, y_test_mult, ls_w_mult)
print(f"Train loss: {train_loss:.3f}")
print(f"Test loss: {test_loss:.3f}")

**Expected output:** 

|   |                                                  |
|---|--------------------------------------------------|
| **Train loss** | 0.847 |
| **Test loss** | 1.033|

This output is a realistic one. Taking the square root of the test loss we get an approximation of the average difference between our model prediction and the reality (the Root Mean Square Error, or RMSE).

In [None]:
average_difference = np.sqrt(test_loss)
print(f"The average difference between the predicted power generation of the plant and the actual generation (on the test set) is {average_difference:.3f} MW.")

When using MSE, this is a good practice to make sense of the result loss in a tangible way in order to evaluate if the model performs well or not. A big loss doesn't necessarily translates to a poor model. For example, here, our model has a ~33.8% relative error (because the average power generation is 3 MW). If we had obtained the same loss for a model predicting a variable with an average value 1 MW, the same loss would have translated to a ~101.6% relative error, which is much worse.

Still, a relative error of 33.8% is not particularly good. A next step could therefore be to consider more predictors or to change the model used. See the upcoming lectures for more expressive models than the linear regression.

### 6.3. Plotting the regression surface

Now that our model is trained, we can plot the regression surface using `plot_surface_3d` from `helpers`. 

In [None]:
helpers.plot_surface_3d(w=w_mult,
                        X_train=X_train_mult, 
                        y_train=y_train_mult, 
                        X_test=X_test_mult, 
                        y_test=y_test_mult,
                        feature_names=feature_names_mult, 
                        label_name="dc power")

**Question:**
What are your thoughts on this regression fit?

However, rotating the plot, you might realise that the ambient temperature seems to impact the power generation of the plant much less than irradiation.

This indicates that only using the irradiation as a feature might be beneficial. Let's try!

In [None]:
w_1, loss = gradient_descent(X_train_mult[:,[0,2]], y_train_mult, np.zeros((X_train_mult.shape[1]-1, )), 0.0015, 100000)

In [None]:
train_loss = mse_loss(X_train_mult[:,[0,2]], y_train_mult, w_1)
test_loss = mse_loss(X_test_mult[:,[0,2]], y_test_mult, w_1)
print(f"Train loss: {train_loss:.3f}")
print(f"Test loss: {test_loss:.3f}\n")

average_difference = np.sqrt(test_loss)
print(f"The average difference between the predicted power generation and the actual generation of the plant (on the test set) is {average_difference:.3f} MW.")

In [None]:
helpers.plot_linear_regression_2d(X=X_train_mult[:,[0,2]], y=y_train_mult, w=w_1, feature_name="irradiation", label_name="dc power")

**Question** You can now compare the losses obtained here with the ones from the multiple linear regression. What do you observe?

**Question** Why is that?

Congratulations on finishing this exercise! In the next exercise, we'll take a look at linear regression for system identification applied to a ground robot. In other words, we will try to identify the equations that beset describe the behavior of the robot.