# Custom Op Gradients in TensorFlow
In my [forward prop implementation](https://github.com/IdRatherBeCoding/sparse_cnn/blob/master/sparse_cnn.ipynb) for sparse CNNs, I used [tf.py_func](https://www.tensorflow.org/api_docs/python/tf/py_func) to create a custom op to build $H_\mathrm{out}$ and $Q$ from the sparse representation of the previous layer activations, $a^{[l-1]}$. The output activations are computed from Q using TensorFlow matmul and relu ($g$) ops:

\begin{equation*}
a^{[l]} = g(Q(a^{[l-1]})\cdot W + b).
\end{equation*}

Since we are using TensorFlow ops to compute the matrix product and relu, TensorFlow will handle the derivatives for $g$ and the $Q.W$ product; we only have to implement the gradient of the custom py_func op itself. Specifically, given the gradient of the Loss with respect to our function's output, $\frac{\partial L}{\partial Q}$, our gradient function needs to compute

\begin{equation*}
\frac{\partial L}{\partial a^{[l-1]}_{ij}} = \sum_{pq} \frac{\partial L}{\partial Q_{pq}} \frac{\partial Q_{pq}}{\partial a^{[l-1]}_{ij}}.
\end{equation*}

## Gradients of py_func ops
I came across several discussions concerning this ([issue#1095](https://github.com/tensorflow/tensorflow/issues/1095), [SO1](https://datascience.stackexchange.com/questions/12974/tensorflow-how-to-set-gradient-of-an-external-process-py-func), [issue#3710](https://github.com/tensorflow/tensorflow/issues/3710), [SO2](https://stackoverflow.com/questions/38833934/write-custom-python-based-gradient-function-for-an-operation-without-c-imple)), but there doesn't appear to be an official guide specifically for py_func ops.

The [adding an op](https://www.tensorflow.org/extend/adding_an_op#implement_the_gradient_in_python) guide describes how to register a gradient function using the [tf.RegisterGradient](https://www.tensorflow.org/api_docs/python/tf/RegisterGradient) decorator for an Op registered in C++. Unfortunately, RegisterGradient only registers functions to ops by type name. Since we're using py_func, the type of our custom op is always PyFunc. From the links above, there are two possible approaches: *Defun* and *gradient_override_map*.

## The Defun approach
Based on [this SO answer](https://stackoverflow.com/questions/38833934/write-custom-python-based-gradient-function-for-an-operation-without-c-imple). It it only [experimental](https://github.com/tensorflow/tensorflow/issues/14080) and [not ready for py_func](https://github.com/tensorflow/tensorflow/issues/10282), which I'll show below.

### Simple example: custom gradient for tf.square

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.python.framework import function

In [9]:
def squared_back_prop(op, grad):
    return tf.multiply(op.inputs[0] * 2.0, grad)

@function.Defun(tf.float32, python_grad_func=squared_back_prop)
def squared_forward_prop(a):
    return tf.square(a)

In [7]:
tf.reset_default_graph()

x = tf.Variable(tf.constant(np.array([1., 2., 3., 4.]), dtype=tf.float32))
x2 = squared_forward_prop(x)
L = tf.reduce_sum(x2)
dL = tf.gradients(L, [x])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(dL))
    print("error:", tf.test.compute_gradient_error(x, [4], L, [1]))

[array([ 2.,  4.,  6.,  8.], dtype=float32)]
error: 2.15768814087e-05


### Defun example with py_func

In [10]:
def square_numpy(x):
    return np.square(x)

@function.Defun(tf.float32, python_grad_func=squared_back_prop)
def squared_forward_prop_py_func(a):
    return tf.py_func(square_numpy, [a], tf.float32)

In [21]:
x2 = squared_forward_prop_py_func(x)
L = tf.reduce_sum(x2)
dL = tf.gradients(L, [x])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    try:
        print(sess.run(dL))
    except:
        pass

KeyError: 'pyfunc_0'

## The gradient_override_map approach
I will use the approach suggested in [issue#1095](https://github.com/tensorflow/tensorflow/issues/1095), and demonstrated in [this gist](https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342).

A custom py_func function is defined, which takes a grad function. The grad function is given a random name and registered with tf.RegisterGradient.

Finally, *gradient_override_map* is called before calling tf.py_func.

In [28]:
from tensorflow.python.framework import ops

# directly taken from https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342#gistcomment-2011084
#
# Define custom py_func which takes also a grad op as argument:
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
    
    # Need to generate a unique name to avoid duplicates:
    rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
    
    tf.RegisterGradient(rnd_name)(grad)  # see _MySquareGrad for grad example
    g = tf.get_default_graph()
    with g.gradient_override_map({"PyFunc": rnd_name}):
        return tf.py_func(func, inp, Tout, stateful=stateful, name=name)

# Actual gradient:
def _MySquareGrad(op, grad):
    x = op.inputs[0]
    return grad * 2 * x  # add a "small" error just to see the difference:

# Def custom square function using np.square instead of tf.square:
def mysquare(x, name=None):
    
    with ops.name_scope(name, "Mysquare", [x]) as name:
        sqr_x = py_func(np.square,
                        [x],
                        [tf.float32],
                        name=name,
                        grad=_MySquareGrad)  # <-- here's the call to the gradient
        return sqr_x[0]

In [29]:
tf.reset_default_graph()

x = tf.Variable(tf.constant(np.array([1., 2., 3., 4.]), dtype=tf.float32))
x2 = mysquare(x)
L = tf.reduce_sum(x2)
dL = tf.gradients(L, [x])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(dL))
    print("error:", tf.test.compute_gradient_error(x, [4], L, [1]))

[array([ 2.,  4.,  6.,  8.], dtype=float32)]
error: 4.91291284561e-05


##### Great, that worked. Now let's try with a py_func op for the gradient too.

In [41]:
def _MyCubeGrad(op, grad):
    name = "MyCubeGrad"
    x = op.inputs[0]
    cube_x_grad = py_func(lambda a: np.power(a, 2) * 3,
                    [x],
                    [tf.float32],
                    name=name,
                    grad=_MyCubeGrad)
    return cube_x_grad[0]

def my_cube(x, name=None):
    
    with ops.name_scope(name, "MyCube", [x]) as name:
        cube_x = py_func(lambda a: np.power(a, 3),
                        [x],
                        [tf.float32],
                        name=name,
                        grad=_MyCubeGrad)
        return cube_x[0]

In [42]:
tf.reset_default_graph()

x = tf.Variable(tf.constant(np.array([1., 2., 3., 4.]), dtype=tf.float32))
x3 = my_cube(x)
L = tf.reduce_sum(x3)
dL = tf.gradients(L, [x])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(dL))
    print("error:", tf.test.compute_gradient_error(x, [4], L, [1]))

[array([  3.,  12.,  27.,  48.], dtype=float32)]
error: 4.61935997009e-06


##### Ok, that's all good, now we can implement the gradient of Q.