[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/ipml/blob/master/tutorial_notebooks/5_SML_for_regression_solutions.ipynb) 

# Supervised Machine Learning for Regression

<span style="color:red; font-weight: bold;"> This notebook includes solutions to all programming tasks.</span>

<hr>
<br>

This notebook revisits the lecture on **Supervised Machine Learning (SML) for Regression**. There, we studied the famous linear regression model in a resales price forecasting context, and discussed the difference between explanatory and predictive modeling. This notebook revisits the concepts and exemplify relevant steps.

Key topics:
- Preliminaries
    - Libraries
    - Resale price forecasting data set
    - Preparing a simplified data set with only a few numerical features
- Linear Regression in various flavors
    - Computing the OLS estimator *by hand*
    - The `statsmodels` library
    - The `sklearn` library

# Preliminaries
Before starting with the main content, we load some standard libraries and our data. For the latter task, we reuse the content from the previous tutorial. Recall that our synthetic resale price modeling data set is readily available in our [GitHub repository](https://github.com/Humboldt-WI/IPML). 

In [33]:
# Load standard libraries
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns

In [34]:
# Load resale price forecasting data set from GitHub    

url = 'https://raw.githubusercontent.com/Humboldt-WI/IPML/main/data/resale_price_dataset.csv'
resale_data = pd.read_csv(url)

# Display a preview of the data
resale_data

Unnamed: 0,Brand,Model,Release year,Screen size (inches),Hard drive size (GB),RAM size (GB),Weight (grams),Retail price,Industry,Contract Lease Duration (months),Actual Lease Duration (months),Battery capacity (%),Observed resale price
0,Crest,Elevation Elite,2016,15,512,16,1150,2699,Automotive and Transportation,60,63,87.90,751
1,Crest,Elevation Elite,2016,15,512,16,1150,2719,Healthcare,6,5,95.59,2599
2,Crest,Elevation Elite,2016,15,512,16,1150,2759,Automotive and Transportation,24,27,95.05,1358
3,Crest,Elevation Elite,2016,15,512,16,1150,2639,Automotive and Transportation,36,31,94.66,1166
4,Crest,Elevation Elite,2016,15,512,16,1150,2659,Agriculture and Farming,12,10,89.12,1915
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,Zephyr,WindRider W2,2017,15,1028,16,958,3989,Event Management,48,48,94.81,1334
4996,Zephyr,WindRider W2,2017,13,1028,8,830,2859,Aerospace and Defense,36,34,85.46,988
4997,Zephyr,WindRider W2,2017,13,1028,8,830,2859,Construction and Engineering,12,10,95.02,1967
4998,Zephyr,WindRider W2,2017,13,1028,8,830,2829,Automotive and Transportation,60,65,92.36,879


The data comprises categorical and numerical features. The former need special treatment when using them for regression analysis. In the interest of simplicity, we will begin with creating a subset of the data containing only the numerical feature.

In [35]:
# Create new data set including only the numerical features
df = resale_data.select_dtypes(include=[np.number])

# Verify the new data includes only numerical features
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Release year                      5000 non-null   int64  
 1   Screen size (inches)              5000 non-null   int64  
 2   Hard drive size (GB)              5000 non-null   int64  
 3   RAM size (GB)                     5000 non-null   int64  
 4   Weight (grams)                    5000 non-null   int64  
 5   Retail price                      5000 non-null   int64  
 6   Contract Lease Duration (months)  5000 non-null   int64  
 7   Actual Lease Duration (months)    5000 non-null   int64  
 8   Battery capacity (%)              5000 non-null   float64
 9   Observed resale price             5000 non-null   int64  
dtypes: float64(1), int64(9)
memory usage: 390.8 KB


# Linear Regression
The lecture introduced the famous linear regression model. Recall the linear equation for the multivariate regression model: 

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon$$

where:  
- $Y$ is the target variable
- $\beta_0$ is the intercept
- $\beta_1, \beta_2, \ldots, \beta_m$ are the coefficients of the $m$ features $X_1, X_2, \ldots, X_m$

In this part, we start by computing the ordinary least squares (OLS) estimator using `Numpy`. This is to further advance our programming skills. Later, we introduce the library `statsmodels` to perform regression analysis. 

To remain consistent with our standard notation, we first create separate variables to store the feature matrix $\mathbf{X}$ and the target variable $Y$.

In [36]:
# Variables for regression
y = df['Observed resale price']
X = df.drop('Observed resale price', axis=1)


The lecture introduced the normal equation for linear regression given by:

$$
\mathbf{\hat{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}
$$

where:
- $\mathbf{\hat{\beta}}$ is the vector of estimated coefficients,
- $\mathbf{X}$ is the feature matrix,
- $\mathbf{Y}$ is the target variable vector.

We can compute this estimator using `Numpy` functions.

### Exercises 1: Linear regression from scratch
1. Create a function `ols_estimator` that computes the OLS estimator given the feature matrix `X` and the target variable vector `y`.
2. Create a function `predict` that computes the predicted values given the feature matrix `X` and the estimated coefficients `beta_hat`. 
3. Create a function `r_squared` that computes the coefficient of determination as follows:
$$ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $$
with $\bar{y}$ being the mean of the target variable vector `y`, and $\hat{y}_i$ being the predicted value.

4. Add a column of ones to the feature matrix $\mathbf{X}$. This is mathematically equivalent to adding an intercept $\beta_0$ to the regression function. `Numpy` provides a function `ones()` that you can use to create a tensor with all elements being 1. You only need to specify the tensor's dimensionality. Another useful `Numpy` function is `.c_`. It allows you to concatenate two tensors along a specific axis. For example, if you have two tensors `a` and `b`, you can concatenate them along the second axis by writing `np.c_[a, b]`. 
5. Putting everything together<br>
  a. Using the augmented feature matrix from subtask 4, compute the OLS estimator using the function `ols_estimator` and the target variable vector `y`.<br>
  b. Using the estimated coefficients, compute regression predictions using your function `predict`. <br>
  c. Finally, compute the $R^2$ of your regression function using the `r_squared`. <br>

> Hints:
> Use the `Numpy` function `transpose` to transpose a matrix. Alternatively, if variable `X` is a matrix, you can write `X.T` to obtain its transpose.
> Use the `Numpy` function `dot` for matrix multiplication.  
> Use the `Numpy` function `inv` to compute the inverse of a matrix. Note that this function is in the `linalg` module.


In [37]:
## Task 1
def ols_estimator(X, y):
    """
    Compute the OLS estimator given the feature matrix X and the target variable vector y.
    
    Parameters:
    X (numpy.ndarray): Feature matrix
    y (numpy.ndarray): Target variable vector
    
    Returns:
    numpy.ndarray: Estimated coefficients
    """
    tmp = np.linalg.inv(np.dot(X.T, X))  # compute the estimator step by step
    return  np.dot( np.dot(tmp, X.T) , y)

In [38]:
# Task 2
def ols_predictor(X, beta_hat):
    """
    Compute the OLS prediction given the feature matrix X and the estimated coefficients beta_hat.
    
    Parameters:
    X (numpy.ndarray): Feature matrix
    beta_hat (numpy.ndarray): Estimated coefficients
    
    Returns:
    numpy.ndarray: Predicted values
    """
    return np.dot(X, beta_hat)

In [39]:
# Task 3
def r_squared(y, y_pred):
    """
    Compute the R-squared of the OLS prediction.
    
    Parameters:
    y (numpy.ndarray): True target variable vector
    y_pred (numpy.ndarray): Predicted target variable vector
    
    Returns:
    float: R-squared
    """
    return 1 - np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2)

In [40]:
# Task 4: Augment feature matrix with a column of 1
X_aug = np.c_[np.ones(X.shape[0]), X]

In [41]:
# Task 5: Putting everything together
beta_hat = ols_estimator(X_aug, y)
y_pred = ols_predictor(X_aug, beta_hat)
r2 = r_squared(y, y_pred)

# Print the coefficient estimates
print(f'Coefficient estimates: {beta_hat}')

# Print the R-squared
print(f'R-squared: {r2}')



Coefficient estimates: [-1.32195565e+03  6.31705234e-01 -2.55055713e+01 -7.21029320e-01
 -2.16317994e+01 -1.51988790e-01  7.94818235e-01 -9.80390226e+00
 -1.02573751e+01  1.20479194e+01]
R-squared: 0.86835661100624



## The Library Statsmodels
While being a useful exercise to develop coding skills, in practice, we would not implement linear regression from scratch. Instead, we would use a library that provides a wide range of functionalities for regression analysis. One such library is `statsmodels`. 

The following demo shows how to use `statsmodels` to perform linear regression. 

In [42]:
import statsmodels.api as sm

# Add a constant term to the feature matrix X
X_const = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X_const).fit()

# Print the summary of the regression results
print(model.summary())

                              OLS Regression Results                             
Dep. Variable:     Observed resale price   R-squared:                       0.868
Model:                               OLS   Adj. R-squared:                  0.868
Method:                    Least Squares   F-statistic:                     3657.
Date:                   Tue, 10 Dec 2024   Prob (F-statistic):               0.00
Time:                           17:10:34   Log-Likelihood:                -34403.
No. Observations:                   5000   AIC:                         6.883e+04
Df Residuals:                       4990   BIC:                         6.889e+04
Df Model:                              9                                         
Covariance Type:               nonrobust                                         
                                       coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------

The regression output provides a lot of information. For instance, it includes the estimated coefficients, their standard errors, t-values, p-values, and confidence intervals. Of course, you also find the $R^2$ and adjusted $R^2$ values, as well as the F-statistic and its p-value. Recall that these statistics assess the model as a whole. It is worth examining the output in detail and to make sure you understand the meaning of key statistics. Interpreting the results of a regression analysis is a crucial skill for researchers and data scientists.

### Exercise 2: Coefficient estimates
Given that we implemented the normal equation in exercise 1, we can expect that the estimate coefficients are the same as those coming from `statsmodels`. To check this, compute the difference between the two estimates, the `beta_hat` that you calculated in task 1 of exercise 1 and the estimated coefficients from `statsmodels`. To achieve this, you need to search the library's documentation to find a way to access the estimated coefficients. 

In [43]:
# Estimated coefficients from statsmodels
beta_hat_sm = model.params.values

# Difference between the estimates
beta_hat - beta_hat_sm

array([ 6.42551458e-06, -2.86517121e-09, -7.68451969e-11,  1.51989532e-13,
        2.67590394e-11,  1.05693232e-12, -7.86037901e-14, -1.82609483e-12,
        2.65210076e-12,  1.24344979e-13])

### Exercise 3: Compute forecasts
To be fair, our `statsmodels` demo leaves out one bit of functionality. We obtain the coefficients and the $R^2$ value right from estimating the model but have never computed predictions. Let's finish this demo by studying how `statsmodels` would allow us to do so. Specifically, compute predictions `y_hat_sm` for your `statsmodels` regression model. Then (re-)calculate the $R^2$ of the model using your custom function `r_squared`.



In [44]:
# Predictions from statsmodels
yhat_sm = model.predict(X_const)

r2_sm = r_squared(y, yhat_sm)

print(f'R-squared: {r2_sm}')

R-squared: 0.8683566110062401


# The Library `sklearn`
The library `sklearn` is the *goto* library for machine learning in Python. It provides a wide range of machine learning functionalities, including regression. For regression analysis, that is explanatory model, `sklearn` is a bad choice and offers far less functionality than `statsmodels`. We still demonstrate its use because the way in which we apply `sklearn` to perform regression is identical to the way in which we would use `sklearn` for other, more advanced learning algorithms. Thus, it is useful to familiarize yourself with the key `sklearn` functions `fit()` and `predict()` 

In [45]:
# Load the library
from sklearn.linear_model import LinearRegression

# Create an instance of the linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)  # No need to add a constant term. This is done automatically, and controlled by the parameter fit_intercept of the fit() method.

# Compute predictions
y_pred_sklearn = model.predict(X)

# Print the coefficient estimates
print(f'Coefficient estimates: {model.coef_}')

# Print the intercept   
print(f'Intercept: {model.intercept_}')

# Print the R-squared
r2_sklearn = model.score(X, y)  # Compute R-squared. You could also use your custom function r_squared() here.
print(f'R-squared: {r2_sklearn}')


Coefficient estimates: [  0.63170524 -25.50557135  -0.72102932 -21.63179936  -0.15198879
   0.79481824  -9.80390226 -10.25737509  12.04791938]
Intercept: -1321.9556608133712
R-squared: 0.8683566110062401
