In [1]:
from sklearn.datasets import load_boston

In [2]:
(train_input, train_output) = load_boston(return_X_y = True)

Loaded Boston housing data consisting of 506 training samples. The input is a collection of thirteen home features (see https://scikit-learn.org/stable/datasets/index.html#boston-dataset for details), and the output is a home price (in thousands of dollars). 

In [3]:
train_input.shape

(506, 13)

In [4]:
train_output.shape

(506,)

In [5]:
train_input[0]

array([6.320e-03, 1.800e+01, 2.310e+00, 0.000e+00, 5.380e-01, 6.575e+00,
       6.520e+01, 4.090e+00, 1.000e+00, 2.960e+02, 1.530e+01, 3.969e+02,
       4.980e+00])

In [6]:
mean = train_input.mean(axis = 0)
std = train_input.std(axis = 0)

In [7]:
mean

array([3.61352356e+00, 1.13636364e+01, 1.11367787e+01, 6.91699605e-02,
       5.54695059e-01, 6.28463439e+00, 6.85749012e+01, 3.79504269e+00,
       9.54940711e+00, 4.08237154e+02, 1.84555336e+01, 3.56674032e+02,
       1.26530632e+01])

In [8]:
std

array([8.59304135e+00, 2.32993957e+01, 6.85357058e+00, 2.53742935e-01,
       1.15763115e-01, 7.01922514e-01, 2.81210326e+01, 2.10362836e+00,
       8.69865112e+00, 1.68370495e+02, 2.16280519e+00, 9.12046075e+01,
       7.13400164e+00])

In [9]:
train_input -= mean

In [10]:
train_input.mean(axis = 0)

array([-3.05245508e-15,  2.07616094e-14, -2.80039496e-14, -1.18975975e-16,
        2.64171645e-16, -8.14105042e-15, -3.55271368e-14,  1.72018745e-15,
        4.68312258e-15,  3.37016317e-13, -2.34858246e-14,  7.63229620e-13,
       -3.56500073e-15])

In [11]:
train_input /= std

In [12]:
train_input.std(axis = 0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Centered and scaled each of the input features via subtraction of the mean and (elementwise) division of the standard deviation.  Notice the use of broadcasting.

We could equally well use the sklearn.preprocessing.StandardScaler helper object to perform this feature normalization; see e.g. logistic_regression.ipynb.

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor

In [14]:
model1 = LinearRegression()

In [15]:
model1.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}

In [16]:
model1.fit(train_input, train_output)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [17]:
model1.score(train_input, train_output)

0.7406426641094095

LinearRegression learns the feature weights by means of the closed form normal equations.  As such, there are relatively few parameters governing the object.  Note the 'fit_intercept' variable, True when we wish to include a bias term (roughly, Y = \Theta . X + B), False otherwise (Y = \Theta . X).  Note also the 'normalize' parameter, True if we wish to normalize our feature data (done earlier in this notebook), False otherwise.  Defaults are given above: fit_intercept = True, normalize = False.

Performance of the model is indicated via the .score() method, which returns the R^2 value of an (output, input) collection of data.

In [18]:
model2 = SGDRegressor()

In [19]:
model2.get_params()

{'alpha': 0.0001,
 'average': False,
 'early_stopping': False,
 'epsilon': 0.1,
 'eta0': 0.01,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'invscaling',
 'loss': 'squared_loss',
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'penalty': 'l2',
 'power_t': 0.25,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [20]:
model2.fit(train_input, train_output)

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [21]:
model2.score(train_input, train_output)

0.7396987369567145

SGDRegressor applies stochastic gradient descent to a given loss function.  Note its large number of parameters relative to the earlier LinearRegression (including the earlier 'fit_intercept' parameter).  

The optimization objective is specified via the 'loss' parameter, providing the loss function, the 'penalty' parameter, providing regularization terms, and the 'alpha' parameter, providing the regularization coefficient.  By default, we have a standard L^2 loss function (loss = 'squared_loss') with L^2 regularization (penalty = 'l2') and regularization coefficient alpha = .0001 (i.e., ridge regression).  One could also consider L^1 regularization (lasso regression) or a combination of L^2 and L^1 regularization (via the 'l1_ratio' parameter).

Next, we have parameters governing the stochastic gradient descent algorithm.  In particular, we specify a learning rate scheme (static or adaptive) using various parameters ('learning_rate', 'eta0', 'power_t') and control the total number of iterations over the data set via the 'max_iter' parameter.  Early stopping criterion can also be specified (via e.g. 'early_stopping', 'n_iter_no_change', 'validation_fraction', 'tol' parameters).

Notice that there is no specification of batch size in the parameter list.  In fact, one must use the .partial_fit() method to train over batches of data.  In the absence of this method, the SGDRegressor object performs ordinary gradient descent.  Useful parameters for batch gradient descent include 'shuffle', 'random_state' and 'warm_start'.

As with the LinearRegression object, performance of the model is indicated via the .score() method, which returns the R^2 value of an (output, input) collection of data.