# Linear Regression - Interpretation of Coefficients

This notebooks aims to provide some clarification and more insights into interpreting Linear Regression Models and the effect of data normalization on this task. We talked about the main take-away messages in the lecture, but it becomes arguably much clearer and more convincing with some hands-on example.

Let's get started...


## Setting up the Notebook

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

## Loading the Data

To make the use case interesting, I aligned it with the Kaggle InClass competition for predicting the resale prices of HDB flats. I used the original data but made a series of changes to keep it simple for the context of this notebook:

* Removed a whole bunch of attributes/features from the dataset

* Considered only 1 month of transactions

* Considered only transactions with `flat_type` being "x rooms" (e.g., I excluded "executive"); this allows to generate a new column `num_rooms` which is now numerical

* Converted `storey_range` to `floor` by converting, e.g., "07 TO 09" to 8, making it a numerical feature as well

* Converted the area from square meters to square feet simply to increase the magnitude of the values

In [2]:
df = pd.read_csv('data/hdb-resale-prizes-demo.csv')

df.head()

Unnamed: 0,num_rooms,floor,area_sqft,price
0,3,5,785.8,238500.0
1,3,20,753.5,423000.0
2,3,2,785.8,207000.0
3,3,5,785.8,252000.0
4,3,8,731.9,198000.0


Convert the dataframe into numpy arrays to be fed into the Linear Regression implementation

In [3]:
X = df[['num_rooms', 'floor', 'area_sqft']].to_numpy()
y = df['price'].to_numpy()

num_samples, num_features = X.shape

print('The dataset contains {} samples with {} features'.format(num_samples, num_features))

The dataset contains 1780 samples with 3 features


## Fitting Linear Regression Models


### Without Data Normalization

Let's first train a Linear Regression model without normalizing the data. As your data is already in a proper shape -- there are only basic numerical features -- we can immediately use scikit-learn's [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) implementation. Note that we do not split the data in to training and test data, since we're not interested in the predictive power of the model but only in the coefficients.

In [4]:
lin_reg = LinearRegression(fit_intercept=True).fit(X, y)

#### Coefficients

The coefficients can be found in `lin_reg.coef_` and are in the same order as the features in your dataset (both `X` and `df`) which allows us to match the coefficients to the respective feature names. The code below is just doing that:

In [5]:
for idx in range(num_features):
    
    # Get feature name as the column name from the dataframe
    feature = df.columns[idx]
    
    # Get the corresponding coefficient
    coefficient = np.around(lin_reg.coef_[idx], 2)
    
    # Print the information
    print('Increasing [{}] by 1 unit will increase the resale price by {} SGD'.format(feature, coefficient))

Increasing [num_rooms] by 1 unit will increase the resale price by 49836.83 SGD
Increasing [floor] by 1 unit will increase the resale price by 7765.77 SGD
Increasing [area_sqft] by 1 unit will increase the resale price by 152.7 SGD


Since we did not normalize the data, we have preserved to original units of each feature (number of rooms, floor, area in squere feet). This allows us to make the statement that, for example, increasing the number of rooms by one will increase our estimation for the resale price about S\\$50k.

However, we cannot say that `num_rooms` is the most important feature since the coefficients are not comparable which each other because the features have values of different magnitudes. While `num_rooms` and `floor` have single-digit or low double-digit values, `area_sqft` has values in the range of several hundreds.

#### Intercept (Offset, Bias)

Apart from the coefficients, we can also look at the intercept (i.e., $\theta_0$). This reflects the estimated prices if all features are 0. The intercept can be found in `lin_reg_intercept_`:

In [6]:
print('Intercept value: {}'.format(np.around(lin_reg.intercept_), 2))

Intercept value: -43845.0


A negative estimate resale for the resale prize is of course completely counter-intuitive. This doesn't really pose a problem as there is not flat with 0 rooms and an area of 0.0 square feet (a flat on floor 0 is realistic, though). 

This quick exercise was mainly to show that the model is basically only defined over the same range as the training data. While it might extrapolate well to some extent, there are obvious limits when it comes to completely unrealistic values for the different input features.

#### Making Predictions

We can also experiment with a range of valid inputs by making other predictions that include features values way outside the range of the training data. The following example, assumes a float on the 887th floor, with the other features values realistic. The 887 would be the top floor of the [Tower of Babel](https://en.wikipedia.org/wiki/Tower_of_Babel) (2,484m) assuming an [average floor-to-floor height of 2.8 m](https://www.hdb.gov.sg/-/media/doc/HMG/11-hdb-requirements-for-aa-work-on-hdb-premises.pdf) -- just don't takes this too serious :).

In [7]:
X_babel = [[3, 887, 750]] # Original values [[3, 887, 750]]

y_babel = np.around(lin_reg.predict(X_babel), 2)

print('The estimated resale prices is {} SGD'.format(y_babel.squeeze()))

The estimated resale prices is 7108433.76 SGD


In case you only have 2 Million SGD to spare but still want to live on the top floor, you only need to find a flat with `-100` rooms (just replace `3` with `-100` as the first values). Again, this is simply to showcase that the model is only (well) define over the range of feature values covered by the training data.

### With Data Normalization

No we train a Linear Regression model after normalizing the data. In this case, we use standardization to normalize all features. While this bring all features in the same range, it naturally destroys the original units.

In [8]:
scaler = StandardScaler().fit(X)

X_scaled = scaler.transform(X)

Let's have look at a single sample

In [9]:
print(X_scaled[0])

[-1.26009391 -0.61894455 -0.97408545]


Not only are in this case all values negative, but real values for `num_rooms` and `floor` is not meaningful as well. However, the Linear Regression doesn't care, and we can simple train a model same as above (just using the scaled data):

In [10]:
lin_reg_scaled = LinearRegression(fit_intercept=True).fit(X_scaled, y)

#### Coefficients

Again, we first look at the learned coefficients can be found in `lin_reg.coef_`:

In [11]:
for idx in range(num_features):
    
    # Get feature name as the column name from the dataframe
    feature = df.columns[idx]
    
    # Get the corresponding coefficient
    coefficient = np.around(lin_reg_scaled.coef_[idx], 2)
    
    # Print the information
    print('Coefficient for [{}]: by {}'.format(feature, coefficient))

Coefficient for [num_rooms]: by 38394.7
Coefficient for [floor]: by 45781.72
Coefficient for [area_sqft]: by 33992.9


Now we can no longer say that increase the number of rooms by 1 will increase the estimated resale price by around S\\$38. The only reason why it's even similar to S\\$50k (see above) is because the range of values for `num_rooms` before and after the normalization is comparable. Not the difference is much more pronounced for `area_sqft` and even `floor`.

On the upside, we can now directly compare the different coefficients. Since `floor` has the largest coefficient, we can argue that `floor` is in fact the most important feature/predictor for estimating the resale price and `area_sqft` the least important one.

#### Intercept (Offset, Bias)

Finally, let's check the intercept $\theta_0$ for model fitted over the normalized data

In [12]:
print('Intercept value: {}'.format(np.around(lin_reg_scaled.intercept_), 2))

Intercept value: 374318.0


This value looks now much more reasonable, and this should be a surprise. Recall that standardization centers the data by subtracting the mean from each feature value. This means that `num_rooms=0.0`, `floor=0.0` and `area_sqft=0.0` are meaning values, in fact representing the "most average flat"

#### Making Predictions

Of course, we can now use this model to predict the resale price for new data samples. The only important step we have to perform is to normalize the samples the same way as we did the training data. The code below uses the same sample `X_babel` from above:

In [13]:
# Normalize data sample
X_babel_scaled = scaler.transform(X_babel)

y_babel = np.around(lin_reg.predict(X_babel), 2)

print('The estimated resale prices is {} SGD'.format(y_babel.squeeze()))

The estimated resale prices is 7108433.76 SGD


The fundamental takeaway message is that the predicted resale prices is the same for the normalized and unnormalized data. So regarding model performance, normalization does not affect the results here.

## Summary

For basic Linear Regression -- that is, without regularization -- normalization does not affect model performance. Here, whether to normalize the data or not is a question regarding the interpretation of the learned coefficients $\theta_i$. Without normalization, $\theta_i$ directly indicate the effect of feature $i$ in the estimate. On the other hand, normalization allows to compare the different $\theta$ values of the same model. For example, $\theta_i > \theta_j$ indicates that feature $i$ is a more important feature/predictor that feature $j$.

Once we include regularization, normalizing the data is usually recommended. Recall that the regularization term we introduced in the lecture (there are many alternatives) calculated

$$
\sum_{i=1}^d \theta_{i}^2
$$

Without normalizing the data, the magnitudes of the $\theta_i$ can be very different (see the example above), which means that the regularization term would "punish" the $\theta_i$ differently. Normalizing the data will bring the magnitudes of the $\theta_i$ into the same range, enabling a fair regularization.