pysgd package performs stochastic gradient descent in accordance with several leading algorithms. The package is used as a function that accepts data, an objective function, a gradient descent adaptation and algorithm hyperparameters as arguments and returns an array of model parameters and cost history.
The intention of this package is to present reasonably efficient, working algorithms that are easy to understand.
The code is structured with one function (
sgd) in the main module, one class (
Objective), and modules for adaptations and objective functions.
sgd creates an
Objective instance using the adaptation and objective function modules and any data specified. The objective modules include both gradient and cost functions. The gradient functions accept
batch as arguments and the cost functions accept
data. The adaptations are set up as decorated generator functions that accept
grad_fun as arguments.
params is a dict of hyperparameters and
grad_fun is a gradient function.
Objective creates a gradient generator and a cost function (injecting a batch generator or data if required) that each accept
theta as a single argument. These functions are then used in the main gradient descent algorithm in the
The package is structured to facilitate the inclusion of additional algorithms with minimal additional boilerplate. Additional objective functions and gradient adaptations can be included by placing them in their respective folders, following the forms of the existing ones.
pysgd | `-- __init__.py |-- adaptations | |-- __init__.py | |-- adagrad.py | |-- adam.py | `-- constant.py |-- objectives | |-- __init__.py | |-- linear.py | |-- logistic.py | `-- stab_tang.py `-- tests
Gradient descent is a method for minimizing an objective function. In machine learning applications the objective function to be minimized is the error (or cost),
J, of a predictive model. A predictive model consists of a parameters,
theta, that are applied to inputs,
X, (also called training samples, features, observations or independent variables) in order to estimate an output,
y_hat (also called a label or dependent variable). Gradient descent attempts to determine the parameters that when applied to a set of inputs result in the lowest total error (the difference between the actual outcome and the one predicted by the model). Below is the basic predictive formula.
H(X,theta) = y_hat
And here is an illustrative formula for determining the total error of a model.
J(theta) = sum(|h_i(theta,x_i) - y_i| for each training observation, i)
Different formulas for computing cost are used depending on the application, but the formula above expresses the essence of predicting actual outcomes as closely as possible.
In order to minimze
J with respect to
theta, the algorithm starts with an abitrary value of
theta, determines the "direction" that would result in the fastest decrease in cost (called the
theta in that direction by a small amount (called the learning rate or
alpha) and then repeats until cost
J has been minimized. Below is the basic formula for calculating
theta at each iteration of the gradient descent algorithm.
theta_(j+1) = theta_j - alpha * gradient_j
The package has one main function,
sgd, that returns a
j x (n+2) array, where
j is the number of iterations and
n is the number of features in the data.
theta_j is in the first
n+1 columns and the cost
J_j in the last column.
||Starting value of
||Objective function to be minimized in the form of a string with a value of
||Gradient descent adaptation in the form of a string with a value of
||Data in the form of an
||Batch size in the form of an integer between
||Hyperparameter used by
||Hyperparamter used by
||Hyperparamter used by
||Maximum change in all elements of
||Maximum number of batches to evaluate if convergence is not achieved in fewer iterations.|
Tests are are performed with pytest and cover 100% of the code.
In addition to sample data sets, the Stablinsky-Tang function is used for testing. This function is non-convex with straightforward gradient computations that makes it possible to compare the values produced by the algorithms with known values. By using two dimensional inputs all of the possible values of
J within a range of
theta values can be plotted as a surface.
J(theta) for each iteration of the algorithm can then be plotted on the surface in order to visualize the gradient descent.
The color scale of the surfaces in the plots in the included notebook corresponds to the value of
J. The color scale of the points on the surface, which represent
J(theta) at each iteration, corresponds to the iteration of the algorithm.