First, let's import needed modules and set random seed (we'll use it if needed).

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split

from linear_regression import SciPyLinearRegressionOLS, LinearRegressionOLS, \
    SciPyLinearRegressionNNOLS
from utils.scaler import Scaler

SEED = 42

Loading california housing dataset

In [2]:
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

Splitting data into training and testing dataset (10% for test)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, shuffle=True)

Standardize features by removing the mean and scaling to unit variance

In [4]:
scaler = Scaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

First solution will use the analytical solution to OLS to get the weights/coefficients: $\hat{\beta} = (X^TX)^{-1}(X^TY)$

In [5]:
ols_reg = LinearRegressionOLS()
ols_reg.fit(X_train_scaled, y_train)
print(f"The coefficients are {ols_reg.coef_}")
print(f"The intercept is {ols_reg.intercept_}")

The coefficients are [ 0.82588866  0.12255387 -0.27011224  0.31475287 -0.00336918 -0.03998842
 -0.89114681 -0.86305879]
The intercept is 2.0689959388458363


Second solution is the sklearn's solution

In [6]:
sklearn_ols_reg = LinearRegression(positive=False)
sklearn_ols_reg.fit(X_train_scaled, y_train)
print(f"The coefficients are {sklearn_ols_reg.coef_}")
print(f"The intercept is {sklearn_ols_reg.intercept_}")

The coefficients are [ 0.82588866  0.12255387 -0.27011224  0.31475287 -0.00336918 -0.03998842
 -0.89114681 -0.86305879]
The intercept is 2.068995938845836


As we can see the results are the same, because sklearn is using <code>scipy.linalg.lstsq</code> method under it's hood. Which, also, wrapped by me below:

In [7]:
scipy_ols_reg = SciPyLinearRegressionOLS()
scipy_ols_reg.fit(X_train_scaled, y_train)
print(f"The coefficients are {scipy_ols_reg.coef_}")
print(f"The intercept is {scipy_ols_reg.intercept_}")

The coefficients are [ 0.82588866  0.12255387 -0.27011224  0.31475287 -0.00336918 -0.03998842
 -0.89114681 -0.86305879]
The intercept is 2.0689959388458323


The results are expectedly the same


If we specify <code>positive=False</code> when initializing <code>sklearn.linear_model.LinearRegression</code> object, then sklearn will use <code>scipy.optimize.nnls</code> method

In [8]:
sklearn_ols_reg_pos = LinearRegression(positive=True)
sklearn_ols_reg_pos.fit(X_train_scaled, y_train)
print(f"The coefficients are {sklearn_ols_reg_pos.coef_}")
print(f"The intercept is {sklearn_ols_reg_pos.intercept_}")

The coefficients are [0.81969412 0.23567187 0.         0.02021197 0.03733492 0.
 0.         0.        ]
The intercept is 2.0689959388458226


Which is naturally equal to the results from the wrapped method

In [9]:
scipy_nn_ols_reg = SciPyLinearRegressionNNOLS()
scipy_nn_ols_reg.fit(X_train_scaled, y_train)
print(f"The coefficients are {scipy_nn_ols_reg.coef_}")
print(f"The intercept is {scipy_nn_ols_reg.intercept_}")

The coefficients are [0.81969412 0.23567187 0.         0.02021197 0.03733492 0.
 0.         0.        ]
The intercept is 2.068995938845836


When <code>positive</code> set to <code>True</code>, it forces the coefficients to be positive.

We can see that the solution looks completely different - all the coefficients are either zero or positive.

We would want to constrain the coefficients to non-negative values whenever a negative value
makes no physical sense, say because it represents the intensity of a pixel, or the price
of an object, or a frequency count, or a chemical concentration, etc.

TODO:
 - the idea of regularization and why it is needed
 - using gradient descent instead of using formula (analytical solution)
 - comparing results of SGDRegressor with results of ridge regression
 - in which situations using gradient descent is more advisable
 - R2 and adjusted R2
 - Linear regression metrics

In [18]:
sklearn_sgd_reg = SGDRegressor()
sklearn_sgd_reg.fit(X_train_scaled, y_train)
print(f"The coefficients are {sklearn_sgd_reg.coef_}")
print(f"The intercept is {sklearn_sgd_reg.intercept_}")

The coefficients are [  0.57504629   0.14611444   0.26600053  -0.18758554   0.12954952
 -15.09777784  -1.76790904  -1.60107809]
The intercept is [1.9034144]
