In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# Wrting a training loop

- If I want to customize the learning algorithm of model while still leveraging the convenience of fit, I can use subclassing Model and implement own train_step() method, which is called repeatedly during fit().

- If I want very low-level control over training & evaluation, I shold write my own training and evaluation loops from scratch.

# Using [GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape)
- [Intro](https://www.tensorflow.org/guide/autodiff)
- [Advanced](https://www.tensorflow.org/guide/advanced_autodiff )


# Auto-differentiation -> backpropagation
# It is so tricky
## what is backpropagation
- https://brilliant.org/wiki/backpropagation/

    Backpropagation, short for "backward propagation of errors," is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network's weights. It is a generalization of the delta rule for perceptrons to multilayer feedforward neural networks.


- https://en.wikipedia.org/wiki/Backpropagation

  In machine learning, backpropagation (backprop,[1] BP) is a widely used algorithm for training feedforward neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions generally. These classes of algorithms are all referred to generically as "backpropagation".[2] In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants such as stochastic gradient descent, are commonly used. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.[3]

## what is automatic differentiation
- https://en.wikipedia.org/wiki/Automatic_differentiation

  In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation, computational differentiation,[1][2] auto-differentiation, or simply autodiff, is a set of techniques to evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.
  Figure 1: How automatic differentiation relates to symbolic differentiation
  Automatic differentiation is distinct from symbolic differentiation and numerical differentiation. Symbolic differentiation faces the difficulty of converting a computer program into a single mathematical expression and can lead to inefficient code. Numerical differentiation (the method of finite differences) can introduce round-off errors in the discretization process and cancellation. Both of these classical methods have problems with calculating higher derivatives, where complexity and errors increase. Finally, both of these classical methods are slow at computing partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems.


Autodiff -> useful to backpropagation.

To autodiff, tensorflow remembers the operations in some order in forward pass. 
Tensorflow records relevant operations executed inside the context of a tf.GradientsTape onto a tape.


Tensorflow -> GradientTape for autodiff. = computing the gradient of a computation with respect to some inputs, usually tf.Variables.



Tensorflow -> 


# tf.Variable 

## trainable (default : True)

## Scalar case

In [11]:
x = tf.Variable(3.0)

True

### Records operation in context of GradientTape

In [16]:
with tf.GradientTape() as tape :
  y = x ** 2

### GradientTape.gradient(target,sources) -> calcuate the gradient of some target(often a loss) relative to some source(oftne the model's variables)

In [17]:
dy_dx = tape.gradient(y,x)
dy_dx

<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

## More than scalar

### tf.Variable(initial_value = tf.random.normal((3,2))) -> 

  initial_value	A Tensor, or Python object convertible to a Tensor, which is the initial value for the Variable. The initial value must have a shape specified unless validate_shape is set to False. Can also be a callable with no argument that returns the initial value when called. In that case, dtype must be specified. (Note that initializer functions from init_ops.py must first be bound to a shape before being used here.)

### [tf.reduce_mean()](https://www.tensorflow.org/api_docs/python/tf/Variable)
  - Computes the mean of elements across dimensions of a tensor.
  - If axis is None, all dimensions are reduced, and a tensor with a single element is returned.

# with tf.GradientTape(persistent=True) as tape:
  - What does "persistent" mean ? -> To compute multiple gradients over the same computation, create a persistent gradient tape. 

```python 
x = tf.constant(3.0)
with tf.GradientTape(persistent=False) as g:
  g.watch(x)
  y = x * x
  z = y * y
dz_dx = g.gradient(z, x)  # (4*x^3 at x = 3)
print(dz_dx)

dy_dx = g.gradient(y, x)
print(dy_dx)
tf.Tensor(108.0, shape=(), dtype=float32)
```
```
---------------------------------------------------------------------------



  RuntimeError                              Traceback (most recent call last)
  <ipython-input-39-2429d3837f86> in <module>()
        7 print(dz_dx)
        8 
  ----> 9 dy_dx = g.gradient(y, x)
      10 print(dy_dx)

  /usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/backprop.py in gradient(self, target, sources, output_gradients, unconnected_gradients)
    1030     """
    1031     if self._tape is None:
  -> 1032       raise RuntimeError("A non-persistent GradientTape can only be used to "
    1033                          "compute one set of gradients (or jacobians)")
    1034     if self._recording:

  RuntimeError: A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)
```

### [numpy @ operator](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html#numpy.matmul)

-> matrix multiplication

In [18]:
w = tf.Variable(tf.random.normal((3,2)),name = 'w')
b = tf.Variable(tf.zeros(2,dtype = tf.float32),name = 'b')
x = [[1.,2.,3.]]

In [34]:
with tf.GradientTape(persistent=True) as tape:
  y = x@ w + b
  loss = tf.reduce_mean(y**2)

In [35]:
[dl_dw,dl_db] = tape.gradient(loss,[w,b])

In [42]:
grad = tape.gradient(loss,[w,b])
grad

[<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
 array([[ 2.170581 ,  3.5292852],
        [ 4.341162 ,  7.0585704],
        [ 6.5117435, 10.587855 ]], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.170581 , 3.5292852], dtype=float32)>]

In [43]:
my_vars = { 'w' : w,
           'b' : b}
grad = tape.gradient(loss,my_vars)
grad

{'b': <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.170581 , 3.5292852], dtype=float32)>,
 'w': <tf.Tensor: shape=(3, 2), dtype=float32, numpy=
 array([[ 2.170581 ,  3.5292852],
        [ 4.341162 ,  7.0585704],
        [ 6.5117435, 10.587855 ]], dtype=float32)>}

# Gradients with respect to a model

- In many cases, I will compute gradients regarding a model's trainable variables.
- All subclasses of tf.Module aggregate their variables -> Module.trainable_variables. -> I can use it.

In [74]:
layer = tf.keras.layers.Dense(2,activation = 'relu')
x = tf.constant([[1.,2.,3.]])

### forward pass

In [75]:
with tf.GradientTape() as tape:
  y = layer(x)
  loss = tf.reduce_mean(y**2)


### Calculate gradients


In [76]:
grad = tape.gradient(loss,layer.trainable_variables)
grad

[<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
 array([[0., 0.],
        [0., 0.],
        [0., 0.]], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>]

## Controlling what the tape watches

- The default behavior is to record all operations after accessing a trainalbe tf.Variable.

## tf.tensor is not watched. To record gradients with respect to a tf.tensor, I need to call tape.watch()

A trainable variable

In [77]:
x0 = tf.Variable(3.0,name = 'x0')

Not trainable

In [78]:
x1 = tf.Variable(3.0,name = 'x1',trainable = False)

Not a Variable : A variable + tensor returns a tensor

In [79]:
x2 = tf.Variable(2.0,name = 'x2')+ 1.0

Not a variable

In [80]:
x3 = tf.constant(3.0,name = 'x3')

In [81]:
with tf.GradientTape() as tape:
  tape.watch(x2)
  y = (x0**2) + (x1**2) + (x2**2)

In [82]:
grad = tape.gradient(y, [x0,x1,x2,x3])

In [83]:
for g in grad :
  print(g)

tf.Tensor(6.0, shape=(), dtype=float32)
None
tf.Tensor(6.0, shape=(), dtype=float32)
None


## GradientTape.watched_variables

In [84]:
for i in tape.watched_variables():
  print(i.name)

x0:0


## Disable the default behavior of watching all tf.Variables.

- Set watch_accessed_variables =False 

In [85]:
x0 = tf.Variable(0.0)
x1 = tf.Variable(10.0)

with tf.GradientTape(watch_accessed_variables=False) as tape:
  tape.watch(x1)
  y0 = tf.math.sin(x0)
  y1 = tf.nn.softplus(x1)
  y = y0 + y1
  ys = tf.reduce_sum(x)

grad = tape.gradient(y1, {"x0" : x0 , "x1" : x1})
grad

{'x0': None, 'x1': <tf.Tensor: shape=(), dtype=float32, numpy=0.9999546>}

## Intermediate results

In [92]:
'''It can be done

x = tf.constant([1,3.0])

with tf.GradientTape() as tape:
  tape.watch(x)
  y= x*x
  z = y*y
grad1 = tape.gradient(z,[x,y]) 
grad1
#grad2 = tape.gradient(y,x)
'''

''' It doesn't work

A non-persistent GradientTape can only be used to compute one set of gradients (or jacobians)

x = tf.constant([1,3.0])

with tf.GradientTape() as tape:
  tape.watch(x)
  y= x*x
  z = y*y
grad1 = tape.gradient(z,x) 
grad2 = tape.gradient(y,x)
'''


x = tf.constant([1,3.0])

with tf.GradientTape(persistent=True) as tape:
  tape.watch(x)
  y= x*x
  z = y*y
grad1 = tape.gradient(z,x) 
grad2 = tape.gradient(y,x)


# Notes

- There is a tiny overhead associated with doing operations inside a gradient tape context.

- Graidient tapes use memory to store intermediate results, including inputs and outputs for use during the backwards pass.


## Gradients of non-scalar targets

In [94]:
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
  y0 = x** 2
  y1 = 1/x

tape.gradient({"y0": y0, "y1":y1},x)

<tf.Tensor: shape=(), dtype=float32, numpy=3.75>

In [96]:
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
  y = x * [3.,4.]
  print(y)
tape.gradient(y,x)

tf.Tensor([6. 8.], shape=(2,), dtype=float32)


<tf.Tensor: shape=(), dtype=float32, numpy=7.0>

In [99]:
#x = tf.linspace(-10.0,10.0,200+1)
x = tf.Variable(2.0)

with tf.GradientTape() as tape :
  tape.watch(x)
  y = tf.nn.sigmoid(x)
  print("y : ",y )
dy_dx = tape.gradient(y,x)

dy_dx

y :  tf.Tensor(0.8807971, shape=(), dtype=float32)


<tf.Tensor: shape=(), dtype=float32, numpy=0.104993574>

## Control flow

- Control flow statements such as If and While can be used in context of gradient tape.
- The control statements are not differentiable, so they are invisible to gradient-based optimizers.

In [102]:
x = tf.constant(1.0)

v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)

with tf.GradientTape(persistent = True) as tape:
  tape.watch(x)
  if x > 0.0 :
    result = v0
  else:
    result = v1 ** 2

dv0, dv1 = tape.gradient(result,[v0,v1])

print(dv0)
print(dv1)

dx = tape.gradient(result,x)

print(dx)

tf.Tensor(1.0, shape=(), dtype=float32)
None
None
