In [None]:
# default_exp gradient

In [None]:
#hide
%load_ext autoreload
%autoreload 2

# gradient

Spall's simultaneous perturbation stochastic approximation (SPSA) of the gradient provides an efficient means to approximate the gradient of high-dimensional models, even when only noisy evaluations of the objective function are available. This is in constrast to more typical applications of stochastic gradient descent, where the noisiness of the gradient comes not from the objective function itself, but rather from evaluating the gradient on subsets of the data. 

### Approximating the gradient with SPSA 

The general idea of SPSA is reasonably straightforward. Given a step size $c_k$ and a vector of perturbations $\delta$, we first generate forward and backward perturbations all model parameters simultaneously

$$\theta^+ = \theta + c_k \delta$$
$$\theta^- = \theta - c_k \delta$$

The perturbation, $\delta$ is often sampled from a shifted and rescaled Bernoulli distribution as follows:

$$b_1, b_2,..., b_m \sim Bernoulli(p=.5)$$
$$\delta_i = 2b_i -1$$

where $\delta_i$ is the direction in which the $i$-th model parameter will be moved in the forward perturbation.

We then evaluate the cost function at the two perturbed parameters

$$y^+ = F(\theta^+, X)$$
$$y^- = F(\theta^-, X)$$

Note that cost function itself may be noisy.

We approximate the gradient as the slope of the line between the points $(\theta^+, y^+)$ and $(\theta^-, y^-)$:

$$\hat{g}= \frac{y^+-y^-}{\theta^+ - \theta^-}= \frac{y^+-y^-}{2 c_k \delta}$$


[place holder text]

### Perturbing subset of parameters

In some models, it might be desirable to evaluate the gradient separately for difference subsets of parameters. For example, in variational inference, the means of the posterior approximation have a much stronger impact on the loss function than the standard deviations do. In that case, perturbing all parameters at once is likely to pick up the impact of perturbing the means on the gradient, but perhaps not the standard deviations.

The ```param_labels``` option permits to the gradient approximation to be evaluated separately for subsets of parameters. If, for example. ```param_labels=[0,0,0,1,1,1]```, then the gradient will be approximated in two steps. The gradient will be estimated first for the three first parameters, perturbing them while holding the other parameters constant. Then the parameters labelled ```1``` will be perturbed, while all others are held constant. The cost of doing this is the number of cost function evaluations increases from $2$ to $2n$, where is $n$ number of parameter subset to be evaluated separately. 

### Averaging multiple gradient approximations

Rather than approximating the gradient from a single perturbation, the parameter ```gradient_reps``` can be employed to instead return the average of multiple gradient evaluations. This may lead to more efficient parameter updates.


In [None]:
#hide
from nbdev.showdoc import *


In [None]:
#export 
import numpy
import scipy


In [None]:
#export
class SPSAGradient():
    def __init__(self, param_subsets=None):
        self.param_subsets=param_subsets
        if self.param_subsets is not None:
            self.param_subsets=numpy.array(self.param_subsets)
            self.subsets=set(list(param_subsets))
        
    
    def evaluate(self, cost, theta, c_k, gradient_reps=1):
#         assert len(theta)==len(self.)
        #If no subsets were defined, then now we'll define all model parameters as one set
        if self.param_subsets is None:
            self.param_subsets=numpy.zeros(theta.shape[0])
            self.subsets=set(list(self.param_subsets))
        #evaluate the gradient separately for different groups of parameters
        grad_list=[]
        for rep in range(gradient_reps):
            
            ghat=numpy.zeros(theta.shape)
            for s in self.subsets:
                param_filter=self.param_subsets==s
                ghat+=self.SPSA(cost, theta, c_k, param_filter)
            grad_list.append(ghat)
        if gradient_reps==1:
            return grad_list[0]
        else: #We need to average
            return numpy.mean(grad_list,0)
        
    def SPSA(self, cost, theta, ck, param_ind):
        """ Inputs:
            cost - a function that takes model parameters and data as inputs
                    and returns a single float
            data - the data the model is being fit to
            theta - a set model parameters
            ck - the step size to be used during perturbation of the model parameters

            Outputs:
            An estimate of the gradient
        """
        #Draw the perturbation

        delta=2*scipy.stats.bernoulli.rvs(p=.5,size=theta.shape[0])-1
        #hold delta constant for the parameters not under consideration
        delta[~param_ind]=0.
        #Perturb the parameters forwards and backwards
        thetaplus=theta+ck*delta
        thetaminus=theta-ck*delta

        #Evaluate the objective after the perturbations
        yplus=cost.evaluate(thetaplus)
        yminus=cost.evaluate(thetaminus)

        #Compute the slope across the perturbation

        ghat=(yplus-yminus)/(2*ck*delta)

        ghat[~param_ind]=0
        return ghat

### Testing the gradient

In [None]:
from gradless.costs import CustomCost