# Regression on the California Housing Dataset

Michael Mommert, Stuttgart University of Applied Sciences, 2024

This Notebook introduces regression as a task. We will use the [California Housing dataset](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html), which contains a number of numerical features, such as the number of rooms, the total population and median income of the neighborhood, as well as geographical location. We will implement and test a number of different traditional Machine Learning models to predict the *median house value* per district based on a number of numerical features.

In [None]:
%pip install numpy \
    scipy \
    pandas \
    matplotlib \
    scikit-learn \
    seaborn

In [None]:
import numpy as np
import pandas as pd

## Data Download

The dataset is easily accessible through the SciKit-Learn module. The module provides functionality to download the dataset and read it in.

In [None]:
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()

## Data Exploration

Before we use the dataset, let's do some data exploration to better get to know the dataset. Let's see what `data` contains (`data` is actually a dictionary, so we can check its keys to learn more):

In [None]:
data.keys()

Conveniently, `data` contains a full description of the dataset. Let's consult that:

In [None]:
print(data['DESCR'])

This is very interesting. Before we move on, let's convert the data into a more useful form

First, we store the actual data into an array, which we call `X`. `X` will serve as the input data array with a shape of (20640, 8), 20640 samples with 8 features, each. Correspondingly, we create an array `y` that contains the target (the median house value per district).

Second, we turn `X` into a Pandas dataframe for better human readability.

In [None]:
X = data['data']
y = data['target']
df = pd.DataFrame(X, columns=data['feature_names'])

Now, let's have a look at some numbers:

In [None]:
df.describe()

It is obvious that many columns in this table are subject to significant outliers. This is something to keep in mind.

Let's visualize some data. We will plot the distribution of datapoints as a function their geographical longitude and latitude. To add more value to the plot, we will color the data points based on their *median house value*:

In [None]:
import matplotlib.pyplot as plt

p = plt.scatter(X[:, 7], X[:, 6], c=y, alpha=0.3, edgecolor='none')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar(p, label='median house value')

This looks familiar... What do you see?

**Exercise**: Repurpose the map above to show the *population* per district as colored dots. Can you identify and name the most densely populated district?

In [None]:
# use this cell for the exercise

To explore correlations between the individual features and the target, we can use a pairplot:

In [None]:
from seaborn import pairplot

df = pd.DataFrame(X, columns=data['feature_names'])
df = pd.concat([df, pd.DataFrame(y, columns=data['target_names'])], axis=1)

pairplot(df)

There seem to be correlations between some of the features and the target quantity (*Median House Value*). Let's quantify these correlations using the **Pearson Correlation Coefficient**:

In [None]:
from scipy.stats import pearsonr

pearsonr_results = []
for i in range(X.shape[1]):
    pearsonr_results.append(pearsonr(X[:, i], y))

pearsonr_df = pd.DataFrame({
    'feature': data['feature_names'],
    'r': [r.statistic for r in pearsonr_results],
    'pvalue': [r.pvalue for r in pearsonr_results]})
pearsonr_df

The [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) provides a means to estimate the linear correlation between two variables. (The following introduction is provided by the [SciPy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html).)

    The Pearson correlation coefficient measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

    This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

What does this analysis tell us? A few variables (like the median income, the house age, the average number of rooms and the geographic latitude) seem to be correlated to the median house value. These might be good candidates for using them in linear models for predicting the median house value.


**Exercise**:  Plot the *median house value* as a function of the *median income* per district. What do you observe? Is the correlation linear?

In [None]:
# use this cell for the exercise

## Data Preparation

Before we start the modeling process, we have to prepare the dataset. This includes the **splitting** of the dataset into a *train*, *validation* and *test* split and the subsequent **scaling** (or normalization) of each split. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_remain, y_train, y_remain = train_test_split(X, y, shuffle=True, train_size=0.6)
X_val, X_test, y_val, y_test = train_test_split(X_remain, y_remain, shuffle=True, train_size=0.5)
X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)
X_val = scaler.fit_transform(X_val)
X_test = scaler.fit_transform(X_test)

Now that the dataset is prepared, we can start the modeling using different regression approaches.

## Linear Regression

We will start with a simple linear regression model. Based on our earlier findings, we will only use the input variables median income (index 0), the house age (1), the average number of rooms (2) and the geographic latitude (6).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train[:, [0, 1, 2, 6]], y_train)

We evaluate the model on all three data splits:

In [None]:
# train split
pred = model.predict(X_train[:, [0, 1, 2, 6]])
print('train', np.sqrt(mean_squared_error(y_train, pred)))

# validation split
pred = model.predict(X_val[:, [0, 1, 2, 6]])
print('val', np.sqrt(mean_squared_error(y_val, pred)))

# train split
pred = model.predict(X_test[:, [0, 1, 2, 6]])
print('test', np.sqrt(mean_squared_error(y_test, pred)))

The performance across all three datasets is very similar. Therefore, we can confidently rule out overfitting in this case.

By studying the trained model's coefficients, we can get a feeling for which of the input variables are most important:

In [None]:
list(zip(['MedInc', 'HouseAge', 'AveRooms', 'Latitude'], model.coef_))

**Exercise**: Implement a linear model that uses all available input variables. How do the model's coefficients change?

In [None]:
# use this cell for the exercise

## LASSO

[LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) (least absolute shrinkage and selection operator) consists of a linear model in combination with L1-regularization. The degree of the regularization is controlled via the $\alpha$ parameter. Strong regularization leads to the elimination of some of the input variables by setting their weight coefficients to zero. Therefore, LASSO can be used as a method to identify the most import input variables. Let's apply LASSO to our full input dataset: 

In [None]:
from sklearn.linear_model import Lasso

for alpha in [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]:
    model = Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    
    pred_train = model.predict(X_train)
    rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))
    pred_val = model.predict(X_val)
    rmse_val = np.sqrt(mean_squared_error(y_val, pred_val))
    print('alpha: {:.1e}, rmse_train: {:.3f}, rmse_val: {:.3f}, \ncoeffs: {:s}'.format(alpha, rmse_train, rmse_val, str(model.coef_)))

Two things are noteworthy here:

* the higher alpha, the smaller the gap between the training and validation datasets: this indicates that higher values of alpha lead to stronger regularization
* the higher alpha, the larger then number of model weight coefficients that are set to zero: regularization leads to the elimination of input variables

**Exercise**: Compare the list of remaining input variables for the highest value of alpha with the list of correlation coefficients that we derived above. Are the remaining variables those with the highest correlation coefficients?

In [None]:
# use this cell for the exercise

## $k$-Nearest Neighbor Regression

We will now use a $k$ nearest neighbor model for our regression task. Since $k$ is the method's hyperparameter, we will train the model for different values of $k$ and evaluate on the training and validation datasets to identify overfitting: 

In [None]:
from sklearn.neighbors import KNeighborsRegressor

for n in [1, 5, 10, 20, 50, 100, 1000]:
    model = KNeighborsRegressor(n_neighbors=n)
    model.fit(X_train, y_train)

    pred = model.predict(X_train)
    rmse_train = np.sqrt(mean_squared_error(y_train, pred))
        
    pred = model.predict(X_val)
    rmse_val = np.sqrt(mean_squared_error(y_val, pred))

    print('n: {:d}, rmse_train: {:.3f}, rmse_val: {:.3f}'.format(n, rmse_train, rmse_val))

There are a few observations here:
* the RMSE on the training data is zero by definition, since we derive the value of each sample from the training dataset based on its closest member in the training dataset - this is, of course, the exact same sample; therefore, the RMSE is zero
* the RMSE on the validation dataset starts high for small values of $k$ and then decreases before it increases again; this is a result of overfitting for small $k$ and underfitting for large $k$
* the discrepancy between the training RMSE and the validation RMSE is a result of overfitting; this discrepancy is highest for small $k$, a clear indicator of overfitting

**Exercise**: Which value of $k$ would you adopt and why? Retrain the model for this value of $k$ and evaluate the trained model on the test dataset.

In [None]:
# use this cell for the exercise

## Random Forest Regression

Finally, we will train a random forest model for regression. Random forests are rather powerful and have a number of hyperparameters. For now, we will only consider the number of estimators:

In [None]:
from sklearn.ensemble import RandomForestRegressor

for n in [1, 50, 100]:
    model = RandomForestRegressor(n_estimators=n)
    model.fit(X_train, y_train)

    pred_train = model.predict(X_train)
    rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))
    
    pred_val = model.predict(X_val)
    rmse_val = np.sqrt(mean_squared_error(y_val, pred_val))
    print('n: {:d}, rmse_train: {:.3f}, rmse_val: {:.3f}'.format(n, rmse_train, rmse_val))

The evaluation results are strongly affected by overfitting.

**Exercise**: Explore ways to regularize this model based on the [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). Evaluate your model on the test dataset.

In [None]:
# use this cell for exercise

One big advantage of random forests (or decision trees in general) is their ability to produce *feature importances*, indicating how useful or important each input feature is for the model:

In [None]:
list(zip(data['feature_names'], model.feature_importances_))

**Exercise**: Compare these feature importances with the results from our Spearman Rank correlation and LASSO.

In [None]:
# use this cell for the exercise