In [1]:
# !pip install nb_black
%load_ext nb_black

ModuleNotFoundError: No module named 'nb_black'

# Introduction

The purpose of the following exercises is to gain familiarity with some core concepts underlying linear regression (as well as more sophisticated machine learning algorithms). You'll practice implementing gradient descent, coding models with the sklearn library, engineering features, and regularising. The dataset you'll use is [from Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) and describes houses that were sold in Ames, Iowa. The modelling objective is to use available features to predict the sale price of a house.

In [None]:
# Packages and functions you may find useful
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Load the housing dataset
data = pd.read_csv("housing.csv")

# Explore and preprocess the data

**Exercise:** Spend a little time exploring the dataset. What features do you think could be predictive of sale price? Are there potential issues to flag, e.g. missing data?

**Exercise:** Make a scatterplot of `SalePrice` vs. `LotArea`. Does there appear to be a linear relationship? Is there a transformation we can apply to make the data appear more linear?

**Exercise:** Divide the data into training and test sets. A train-test split of at least 90-10 is recommended. You may want to further divide the data into 80-10-10 training-validation-test sets if you'd like to tune hyperparameters or select the best model in the end.

# Regression on one variable

For ease of demonstration, we'll start by just using log `LotArea` to predict log `SalePrice`. As you saw above, the dataset offers many more features that could be useful, so a regression in just one variable may not be the best performing. Towards the end of the exercises, you'll have an opportunity to develop your own regression model. 

## Gradient Descent

For a first pass, let's fit a regression model without bias term, i.e.
$$log(SalePrice) = \theta * log(LotArea)$$

**Exercise:** In the cell below, define the MSE loss function, the gradient of the loss function with respect to the model parameter $\theta$, and a function that performs gradient descent. The type hints and docstrings suggest a way to implement these functions, but feel free to write them your own way. 

In [9]:
import pandas as pd 

def loss_fcn(y: pd.Series, x: pd.Series, theta: float) -> float:
    """MSE for univariate linear regression without bias term.
    Parameters
    ----------
    y: pd.Series
        The ground truth.
    x: pd.Series
        Feature observations.
    theta: float
        The regression parameter.
    Returns
    -------
    float
        The MSE for the regression on the input data.
    """
    pass


def grad_loss_fcn(y: pd.Series, x: pd.Series, theta: float) -> float:
    """Gradient of the MSE loss function with respect to theta."""
    pass


def gradient_descent(
    y: pd.Series,
    x: pd.Series,
    init: float,
    alpha: float,
    steps: int,
) -> list:
    """Gradient descent algorithm for a univariate linear regression without bias.
    Parameters
    ----------
    y: pd.Series
        The ground truth.
    x: pd.Series
        Feature observations.
    init: float
        Initial value of the regression parameter.
    alpha: float
        Learning rate.
    steps: int
        Number of iterations of the algorithm.
    Returns
    -------
    list
        List of updated values for theta from gradient descent.
    """
    pass

ModuleNotFoundError: No module named 'pandas'

**Exercise:** Using gradient descent function you defined above, find the optimal value for $\theta$. You may have to experiment with the learning rate and number of steps in order to achieve convergence. Try initialising $\theta$ at different values. Plot the loss function and plot the loss at each stage of the gradient descent.  

**Exercise** Plot the regression you optimised by gradient descent against the data. Evaluate the performance of your model by a metric of your choice.

## Sklearn implementation

It's great that you can implement linear regression from scratch, but in practice, you'll probably use a library like `sklearn`. The process of fitting and making predictions is essentially the same with `sklearn` no matter what the modelling task is.
1. Instantiate the model class with whatever hyperparameters you choose.
2. Call the fit method on the training data.
3. Call the predict method on the test data.

When in doubt about how to use anything in `sklearn`, refer to the documentation or google your question!

**Exercise**: Use the `LinearRegression` class from `sklearn` to regress log `SalePrice` onto log `LotArea`, and compare the model to the one you developed above. Note that default in `sklearn` is to fit a bias term.

**Exercise:** In training the model with an MSE loss function, we assume that the residuals are normally distributed. Assess whether this is a reasonable assumption for our data.

## Polynomial regression

Let's see if we can get a better fit by introducing higher powers of log `LotArea` as features. The form of the regression will then be
$$ log(SalePrice) = \theta_0 + \theta_1 * log(LotArea) + \dots + \theta_n * log(LotArea)^n$$
for some $n \geq 1$. With higher order features, there's a risk of overfitting the model to the training data, so we should consider using a regularisation method. 

**Exercise:** Engineer polynomial features in log `LotArea` and fit a linear regression. When working with multiple features, it's good practice to scale them before fitting a model. You may want to use the `Ridge` or `Lasso` class from `sklearn` to regularise the fit and then experiment with the regularisation parameter. Compare the contribution of each feature to the model prediction. Evaluate your model's performance, comparing to the performance of the model you developed above.

# Extension: develop your own regression model

**Exercise:** Now that we've pretty much exhausted the modelling possibilities with just one feature, develop your own regression model with all the features in the dataset at your disposal. Compare the performance of your model to the peformance of the models we produced above. Keep in mind the following tips as you train and evaluate your model.
* Consider one-hot-encoding categorical features.
* Features on different scales should be transformed so that they're comparable.
* If tuning a hyperparameter like the regularisation parameter, it's best practice to further divide your training set into training and validation sets or to cross validate.