<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 2: Advanced linear algebra</a>
## <a name="0">Lab 2.1: Ordinary least squares</a>

 1. <a href="#1">Business Problem: Checkout Amount?</a> 
 2. <a href="#2">Linear models</a> 
 3. <a href="#3">Ordinary least squares</a> 
 4. <a href="#4">Scikit-learn implementation</a> 
 5. <a href="#5">Multivariate linear regression</a> 

Linear models play a crucial role in machine learning due to their simplicity, interpretability, and effectiveness in capturing linear relationships between input features and target variables. 
 - Many real-world problems exhibit non-linear patterns; however, linear models serve as a starting point and provide a **baseline** for understanding and evaluating more complex models. 
 - By providing a clear relationship between input features and output predictions, linear models offer insights into **feature importance** that serve to gain a deeper understanding of the underlying problem. 
 - Furthermore, the concepts underlying linear models form the foundation for many advanced machine learning techniques, most importantly **neural networks**. 

In this lab we will frame linear model in terms of **matrix and vector operations**. Posing a linear model in terms of matrix multiplication is useful because it allows us to take advantage of linear algebra’s efficient, well-studied computational methods. Representing a linear model as a matrix equation allows to express complex systems of equations compactly and solve them quickly, even for large datasets. This matrix form also makes it straightforward to perform operations like calculating derivatives for optimization, which is crucial for training models. Ultimately, framing linear models in terms of matrices enhances both the computational efficiency and conceptual clarity of model building and makes scaling up to high-dimensional data feasible.

In [None]:
# Upgrade libraries
!pip install -q --upgrade pip
!pip install -q --upgrade scikit-learn

In [None]:
%%capture
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

from IPython.display import Markdown, display

%matplotlib inline

## <a name="1">1. Business Problem: Checkout Amount?</a>
(<a href="#0">Go to top</a>)

A product manager in an Amazon department examines customer purchase data, like: 

* __Visitor Tag__: Customer ID
* __Product Category__: Gardening
* __Spend in Category__: Visitor’s spend in the gardening product category for the 6 months prior to visiting
* __Prime Member__:  Months as a Prime member
* __Checkout Amount__:  Checkout amount within the gardening category at the end of that visit

The manager would like to predict the ```Checkout Amount``` at the end of the customer's visit on the gardening product page, based on the amount the customer has spent in the category over the last 6 months (```Spend in Category```) or the amount of time in the Prime program (```Prime Member```).

This is a __regression__ problem that can be handled with ML techniques such as __Linear Regression__. In this notebook you will apply the __Ordinary Least Squares (OLS)__ to find the best linear model. OLS is a very commonly used method for linear regression because it has desirable properties:

 - It provides an unbiased estimator for the coefficients (weights) of the model
 - It minimizes the variance in the estimates (under certain conditions)
 - It produces closed-form solutions, in many cases, through the normal equation
 
#### Data Loading and Data Splitting

First, let's load and inspect the dataset.

In [None]:
# Read csv containing dataset
data = pd.read_csv("../../data/MATH_LAB_2_1_Data.csv")

display(Markdown(f"Shape of the dataset: {data.shape}"))
data.head()

Before we start building a Machine Learning model, it is crucial to split the data into training and testing sets. The model will be trained on the training split. The test split allows to evaluate how well the model generalizes to unseen data. 

Let's use standard tools in `sklearn` to perform the data split.

In [None]:
# Using train_test_split without further indications will assign 25% data to the test set
data_train, data_test = train_test_split(data, random_state=222)

display(Markdown(f"Shape of the training split: {data_train.shape}"))
display(Markdown(f"Shape of the test split: {data_test.shape}"))

## <a name="2">2. Linear models</a>
(<a href="#0">Go to top</a>)

Linear regression may be both the simplest and most popular among the standard tools for regression, flowing from a few simple assumptions. We assume that the relationship between the model features $x_1, ... x_n$ (```Spend in Category``` and ```Prime Member``` in our case) and the target $y$ (```Checkout Amount```), is linear, i.e. that the target can be expressed as a **weighted sum** of the features. 

Let's simplify the problem by considering first only one independent feature, ```Spend in Category```. In this case, the problem is a univariate linear regression. In that case, the linearity assumption can be simply written as, 

$$\text{Checkout Amount} = b + \text{Spend in Category} \cdot w_{\text{Spend in Category}}$$
 
where $w_{\text{Spent in Category}}$ is called the weight of the respective feature, and $b$ is called a bias (or offset or intercept). The weight determines the influence of the feature on our prediction, and the bias is independent of the features. $b$ just indicates the value of the prediction when all features are 0. The bias term improves the expressivity of the model, its usefulness being more apparent when one considers regressors that can be expressed in linearly related but nonproportional units (like for example Celsius and Fahrenheit units). 

Linear models are __easily interpretable__, in the sense that one unit increase in the feature, ```Spend in Category``` here, leads to an increase (or decrease) by $w_{\text{Spend in Category}}$ units of the target ```Checkout Amount```.

Also note that this is in fact the equation of a line! For univariate regression, solving the linear regression problem reduces to finding a line that best fits the data. In a more general situation with more regressor features, or multivariate regression, the solution takes the shape of an hyperplane that best fits the multidimensional dataset. 

There is a little more to it though. According to the model equation, the training datapoints should lie very close to the model line. They would lie exactly on the line if the target variable ```Checkout Amount``` was completely free of experimental error. However, because of the errors in the target variable (assumed as well-behaved Gaussian noise), an exact model cannot be determined, but it can be approximated by a fitted line. This fitted line is what we traditionally call a __linear regression model__.

Let's employ some linear algebra notations to simplify the linear regression equation. This will also make it easier to extend beyond univariate regression, to multivariate regression. To start, we can collect the bias and the weight in one vector $\mathbf{w}$,

$$\mathbf{w} = [b, w_{\text{Spend in Category}}],$$

by appending a column of ones to the original features matrix, here containing only one ```Spend in Category``` feature $x$, and thus creating the extended matrix $\mathbf{X}$, also called the **design matrix**:

$$
\mathbf{X} = \begin{pmatrix}
1 & x_{11} \\
1 & x_{21} \\
\vdots & \vdots \\
1 & x_{n1} \\
\end{pmatrix},
$$

where $n$ is the number of data samples. These are the inputs to our model. If we also denote our target outputs, the true values of the ```Checkout Amount```, as $\mathbf{y}$, we can re-write our linear regression equation as

$$\mathbf{y} = \mathbf{X}\mathbf{w}.$$

Thus, the goal of linear regression is to find the weights $\mathbf{w}$ such that, on average, the predictions made according to our model best fit the true observed data $\mathbf{y}$. In other words, __the $\mathbf{w}$ vector is the linear regression model__. Note that for univariate regression, the $\mathbf{w}$ components $b$ and $w_{\text{Spend in Category}}$ are the slope and the intercept of the best fit line. 

The question is now, how do we find $\mathbf{w}$, given $\mathbf{X}$ and $\mathbf{y}$?

## <a name="3">3. Ordinary least squares</a>
(<a href="#0">Go to top</a>)

Under some mathematical assumptions, and imposing the minimization of the Sum of Square Errors, the problem leads to the normal equation:

$$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}.$$

This closed-form solution of linear regression is known as the **Ordinary Least Squares (OLS) solution**. The square matrix $\mathbf{X}^T \mathbf{X}$, which is the *covariance matrix* of $\mathbf{X}$ (up to a dimensional scaler), is a $(k+1)\times (k+1)$ square matrix, where $k$ is the number of features in the dataset. Ideally (assuming no collinearities), $\mathbf{X}^T \mathbf{X}$ has an inverse that is easy to compute to solve the normal equation. 

Let's apply the normal equation to build a univariante linear regression model, using for the moment only the first feature `Spend in Cateogry`. 

In [None]:
# Define features and target columns
selected_features = ["Spend in Category"]
target = ["Checkout Amount"]

# Train data as numpy arrays
X_train = data_train[selected_features].values
y_train = data_train[target].values

# Test data as numpy arrays
X_test = data_test[selected_features].values
y_test = data_test[target].values

### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>It is your turn!</b></p>
        <p><b>Exercise 1. Ordinary least squares.</b></p>
        <p>Implement a function to compute the OLS solution by coding the normal equation. The function will take as inputs a matrix of features for the model and a target vector <b>y</b>. It will compute the design matrix <b>X</b> by prepending a column with 1's to the matrix of features and will return the vector of weights <b>w</b> as output. Output the solution of applying your OLS function to the <code>X_train</code> defined above as a variable named <code>weights</code>.</p>
        <p>Implement another function to predict with your linear model. Name it <code>linear_model</code>. The function should take as inputs the vector of weights <b>b</b> and a matrix of features. The output will be the weighted sum of weights and features, i.e. the dot product, that gives the predictions of the linear model for the given input dataset.</p>
        </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Challenge Help</b></p>
        <p>You can use <code>np.append()</code> to assemble the design matrix joining the column vector with 1s to the rest of the features. The shape of the <code>weights</code> vector output by your OLS function needs to be (2, 1).</p>
        <p>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display a sample solution.</p>
    </span>
</div>

In [None]:
# %load solutions/lab21_ex1_solutions.txt

### Plotting and evaluating the solution

Once you have computed the `weights` of the model and defined a function `linear_model` to apply it to any input data, we can plot the best fit line on top of the training data.

In [None]:
# Raise errors if variable and function from Exercise 1 are not defined
if "weights" not in dir():
    raise NameError("Please define a `weights` variable containing the solution to the linear regression model.")
if "linear_model" not in dir():
    raise NameError("Please define a `linear_model` function to predict with the linear regression model.")

# Plot data and solution
# Data
plt.scatter(X_train, y_train, label="Training data")
# Model predictions over a line
xfit = np.linspace(X_train.min(), X_train.max(), 10)
yfit = linear_model(weights, xfit[:, np.newaxis])
plt.plot(xfit, yfit, 'orange', label="Best fit")
plt.xlabel("Spend in Category")
plt.ylabel("Checkout Amount")
plt.legend();

#### $R^2$ coefficient
To evaluate the goodness of fit, we can compute the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination). A “good” value for $R^2$ in a linear regression model 
depends heavily on the context and the field. In fields like physics or engineering, where relationships between variables are often more deterministic, $R^2$ values above 0.9 are often expected. In social sciences, medicine, or economics, where data tends to be more variable and affected by multiple unknown factors, an $R^2$ around 0.5–0.7 can be considered reasonable. In business applications, especially those with inherently high variability (e.g., consumer behavior), an $R^2$ of 0.3–0.5 might still be useful.

In [None]:
# Apply the model to the train and test datasets
y_hat_train = linear_model(weights, X_train)
y_hat_test = linear_model(weights, X_test)

print(f"Weights:\n{weights}")

In [None]:
print(f"R2 on training data: {r2_score(y_train, y_hat_train):.3f}")
print(f"R2 on test data: {r2_score(y_test, y_hat_test):.3f}")

#### Mean squared error

Next we can compute the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error) to get an indication of how well the model performs: lower MSE values indicate better model accuracy, as they imply that predictions are closer to the actual data points. 

In [None]:
print(f"MSE on training data: {mean_squared_error(y_train, y_hat_train):.2f}")
print(f"MSE on test data: {mean_squared_error(y_test, y_hat_test):.2f}")

## <a name="4">4. Scikit-learn implementation</a>
(<a href="#0">Go to top</a>)

Ordinary Least Squares (OLS) linear regression is implemented in scikit-learn through the `LinearRegression` class in the `linear_model` module. The sklearn implementation uses different algorithms to fit the regression model for numerical stability and efficiency.

Let's compare if the sklearn solution arrives to the same solution.

In [None]:
# Use sklearn implementation
lr = LinearRegression()
lr.fit(X_train, y_train)

print(f"sklearn LinearRegression model parameters: {lr.intercept_}, {lr.coef_}")
print(f"Number of sklearn LinearRegression model parameters: {len(lr.intercept_) + len(lr.coef_[0])}")
print()
print(f"sklearn LinearRegression training R2: {r2_score(y_train, lr.predict(X_train)):.2f}")
print(f"sklearn LinearRegression test R2: {r2_score(y_test, lr.predict(X_test)):.2f}")
print()
print(f"sklearn LinearRegression training MSE: {mean_squared_error(y_train, lr.predict(X_train)):.2f}")
print(f"sklearn LinearRegression test MSE: {mean_squared_error(y_test, lr.predict(X_test)):.2f}")

## <a name="5">5. Multivariate linear regression</a>
(<a href="#0">Go to top</a>)

The formulation of the linear problem in terms of matrices makes the extension to multiple input features straightforward. Thus we can apply all code above to multivariate linear regression. Given that the input data contains a second feature `Prime Member`, you can investigate whether this extra information yields a better linear regression model than the previous univariate model.

### Exercise 2

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>It is your turn!</b></p>
        <p><b>Exercise 1. Multivariate regression.</b></p>
        <p>Train a linear regression model using multiple features. You should be able to reuse most of the code written above. You will now need to regenerate the matrix of features including the 2 features present in the original data.</p>
        <p>Which of the linear regression models perform better as predictor on unseen data?</p>
        </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Challenge Help</b></p>
        <p>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display a sample solution.</p>
    </span>
</div>

In [None]:
# %load solutions/lab21_ex2_solutions.txt

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 2.1: Ordinary least squares of Lecture 2: Advanced linear algebra of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>