<a href="https://colab.research.google.com/github/RoshanM-18/Intermediate-Deep-Learning-projects-using-Tensorflow-Keras/blob/main/Tensorflow_research.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
import zipfile
import random
import time

In [None]:
from sklearn import metrics

### tf gradient

1. tf gradient tape can be used to make your own custom optimizer and that is used for automatic differentiation
3. now if you wanna keep the value of x in the next gradient tape as well you have to use persistent = True. To compute multiple gradients over the same coomputation we have to create a persistent gradient tape . this allows multiple calls to the gradient function

#### Automatic differentiation

There are four types of differentiation
1. manual -> the one we used to do in college
2. numeric -> didnt understand
3. symbolic -> automated version of manual differentiation
4. automatic -> abstractions that enable you to write a function and efficiently apply the chain rule to it

automatic differentiation (autodiff) refers to a general way of taking a program which computes a value and automatically constructing a procedure for computing derivatives of that value. 

There are two versions of automatic differentiation
1. forward mode              2. reverse mode


In [None]:
tf.executing_eagerly()

True

### Eager execution

1. in PyTorch, `eager execution` is termed as "dynamic computation graphs" (default execution method) and `graph execution` is termed as "static computation graphs"
2. It does not build graphs, and the operations return actual values instead of computational graphs to run later. With Eager execution, TensorFlow calculates the values of tensors as they occur in your code.

``` Eager execution is slower than graph execution! ```

### Graph execution

1. Graph execution extracts tensor computations from Python and builds an efficient graph before evaluation. Graphs, or `tf.Graph` objects, are special data structures with `tf.Operation` and `tf.Tensor` objects.
2. While `tf.Operation` objects represent computational units, `tf.Tensor` objects represent data units. 
3. With a graph, you can take advantage of your model in mobile, embedded, and backend environment where Python is unavailable.

```  graph execution is ideal for large model training ```

``` For small model training, eager execution is better suited. ```

<a href="https://towardsdatascience.com/eager-execution-vs-graph-execution-which-is-better-38162ea4dbf6#:~:text=Eager%20execution%20is,they%20occur%20in%20your%20code"> Graph executions vs Eager Execution </a>

##### How does tensorflow work?

1. every computation in tensorflow is described as directed graph which is composed of nodes and edges where nodes are composed of operations/functions and edges are the inputs/outputs flowing in or out those functions. 

2. inputs and outputs in tensorflow are called as tensors. 

3. before tf 2.0 users need to manually create sessions but after tf 2.0 you dont need to create sessions explicitly. 

4. whenever you create tensors or write operations the session is automatically invoked and execution graph is created. 

5. tf.Variable is a mutable tensor which can survive during multiple execution unlike tf.Constant which dies/remains unchanged. 

6. weights and biases of a model are stored in variables. 

##### tf autograph

1.  AutoGraph takes in your eager-style Python code and converts it to graph-generating code.
2. autograph converts python code into pure tensorflow graph code. 

### ` tf.function `

1. you can actually build models just like eager execution and then run it with graph execution. And thats what `tf.function` does. 
2. This simplification is achieved by using `tf.function()` decorators.
3. In TensorFlow 2.0, you can decorate a Python function using tf.function() to run it as a single graph object. With this new method, you can easily build models and gain all the graph execution benefits.

In [None]:
x = tf.ones((2,2))

with tf.GradientTape() as t:
  t.watch(x)
  y = tf.reduce_sum(x)
  z = tf.multiply(x, x)

dz_dx = t.gradient(z, x)

# for i in [0, 1]:
#   for j in [0, 1]:

In [None]:
x = tf.constant(3.0)
with tf.GradientTape(persistent=True) as g:
  g.watch(x)
  y = x**2
  z = y**2

dz_dx = g.gradient(z, x)
dy_dx = g.gradient(y, x)

dz_dx.numpy(), dy_dx.numpy()

(108.0, 6.0)

In [None]:
x = tf.constant(10.0)
with tf.GradientTape() as g:
  g.watch(x)

  with tf.GradientTape() as gg:
    gg.watch(x)
    y = x**2
    dy_dx = gg.gradient(y, x)

  d2y_dx2 = g.gradient(dy_dx, x)

dy_dx.numpy(), d2y_dx2.numpy()

(20.0, 2.0)

In [None]:
x = tf.constant(3.0)
with tf.GradientTape() as g:
  g.watch(x)  # trainable parameters are automatically watched if set trainable = True and also tensors can be manually watched if this method
              # is invoked in the context manager
  y = x**2
  dy_dx = g.gradient(y, x)
dy_dx.numpy()

6.0

In [None]:
%%time

def eager_function(x):

  result = x**2
  return result

l = tf.constant([1, 2, 3, 4, 5], dtype=tf.float32)
print(eager_function(l))

tf.Tensor([ 1.  4.  9. 16. 25.], shape=(5,), dtype=float32)
CPU times: user 6.59 ms, sys: 842 µs, total: 7.43 ms
Wall time: 9.62 ms


In [None]:
%%time

graph_func = tf.function(tf.autograph.experimental.do_not_convert(eager_function))
print(graph_func(l))

tf.Tensor([ 1.  4.  9. 16. 25.], shape=(5,), dtype=float32)
CPU times: user 13.4 ms, sys: 0 ns, total: 13.4 ms
Wall time: 13.3 ms


In [None]:
import timeit

In [None]:
%%time 

print(f"eager function {timeit.timeit(lambda: eager_function(x), number=100)}")

eager function 0.00692260800042277
CPU times: user 6.82 ms, sys: 68 µs, total: 6.89 ms
Wall time: 8.26 ms


In [None]:
%%time

print(f"graph execution {timeit.timeit(lambda: graph_func(x), number=100)}")

graph execution 0.041643855999609514
CPU times: user 40.9 ms, sys: 4.18 ms, total: 45.1 ms
Wall time: 42.8 ms


In [None]:
inputs = keras.layers.Input(shape=(28, 28))
x = keras.layers.Flatten()(inputs)
x = keras.layers.Dense(128, activation="relu")(x)
x = keras.layers.Dense(128, activation="relu")(x)
outputs = keras.layers.Dense(10, activation="softmax")(x)

input = tf.random.uniform([100, 28, 28])

In [None]:
eager_model = keras.Model(inputs=inputs, outputs=outputs)
print(f"eager time: {timeit.timeit(lambda: eager_model(input), number=10000)}")

graph_model = tf.function(eager_model) # wrapping the model with tf.function
print(f"graph time: {timeit.timeit(lambda: graph_model(input), number=10000)}")

eager time: 20.07116643399968
graph time: 9.569780077999894


In [None]:
x = tf.ones([2,2])

print(x.numpy())

with tf.GradientTape() as t:

  t.watch(x)
  y = tf.reduce_sum(x)
  print(y.numpy())
  z = tf.multiply(y, y)
  print(z.numpy())
  dz_dx = t.gradient(z, x)

dz_dx.numpy()

[[1. 1.]
 [1. 1.]]
4.0
16.0


array([[8., 8.],
       [8., 8.]], dtype=float32)

In [None]:
x = tf.constant(3.0)

with tf.GradientTape(persistent=True) as t:

  t.watch(x)
  y = x*x
  print(y.numpy())
  z = y*y
  print(z.numpy())

dz_dx = t.gradient(z, x)
print(dz_dx.numpy())
dy_dx = t.gradient(y, x)
print(dy_dx.numpy())

9.0
81.0
108.0
6.0


In [None]:
x = tf.Variable(1.0)

with tf.GradientTape() as t:
  with tf.GradientTape() as t2:
    y = x*x*x
  dy_dx = t2.gradient(y, x)
d2y_dx2 = t.gradient(dy_dx, x)

print(dy_dx.numpy())
print(d2y_dx2.numpy())

3.0
6.0


In [None]:
def circle_area(r):

  circle = 3.14 * (r)**2
  return circle

In [None]:
print(tf.autograph.to_code(circle_area))

def tf__circle_area(r):
    with ag__.FunctionScope('circle_area', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as fscope:
        do_return = False
        retval_ = ag__.UndefinedReturnValue()
        circle = (3.14 * (ag__.ld(r) ** 2))
        try:
            do_return = True
            retval_ = ag__.ld(circle)
        except:
            do_return = False
            raise
        return fscope.ret(retval_, do_return)



In [None]:
print(tf.autograph.to_graph(circle_area))

<function outer_factory.<locals>.inner_factory.<locals>.tf__circle_area at 0x7f211ad5c8c0>


In [None]:
@tf.function
def circle_area(r):

  area = 3.14 * (r)**2
  return area

In [None]:
print(tf.autograph.to_code(circle_area.python_function))

def tf__circle_area(r):
    with ag__.FunctionScope('circle_area', 'fscope', ag__.ConversionOptions(recursive=True, user_requested=True, optional_features=(), internal_convert_user_code=True)) as fscope:
        do_return = False
        retval_ = ag__.UndefinedReturnValue()
        area = (3.14 * (ag__.ld(r) ** 2))
        try:
            do_return = True
            retval_ = ag__.ld(area)
        except:
            do_return = False
            raise
        return fscope.ret(retval_, do_return)



### Creating a simple model with Tensorflow and tf.GradientTape()

In [None]:
from tensorflow.keras.datasets import mnist

In [None]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
X_train.shape, y_train.shape

((60000, 28, 28), (60000,))

In [None]:
X_test.shape, y_test.shape

((10000, 28, 28), (10000,))

In [None]:
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype("float32")
X_train = X_train/255.0

In [None]:
X_train.shape

(60000, 28, 28, 1)

In [None]:
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype("float32")
X_test = X_test/255.0

In [None]:
X_test.shape

(10000, 28, 28, 1)

In [None]:
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(10000).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(64)

In [None]:
model = keras.Sequential([
    keras.layers.Conv2D(16, (3,3), padding="same", input_shape=(28, 28, 1), activation="relu"),
    keras.layers.BatchNormalization(axis=-1),
    keras.layers.MaxPooling2D(pool_size=(2,2)),

    keras.layers.Conv2D(32, (3,3), padding="same", activation="relu"),
    keras.layers.BatchNormalization(axis=-1),
    keras.layers.MaxPooling2D(pool_size=(2,2)),
    keras.layers.Conv2D(32, (3,3), padding="same", activation="relu"),
    keras.layers.BatchNormalization(axis=-1),
    keras.layers.MaxPooling2D(pool_size=(2,2)),

    keras.layers.Flatten(),
    keras.layers.Dense(256, activation="relu"),
    keras.layers.BatchNormalization(axis=-1),
    keras.layers.Dropout(0.5),

    keras.layers.Dense(10, activation="softmax")])

In [None]:
cce_loss = keras.losses.CategoricalCrossentropy()
adam = keras.optimizers.Adam()

In [None]:
EPOCHS = 10

for epoch in range(EPOCHS):
  print(f"Epoch {epoch}")
  for step, (X_train, y_train) in enumerate(train_dataset):

    with tf.GradientTape() as tape:
      preds = model(X_train, training=True)
      loss_val = cce_loss(y_train, preds)

    grads = tape.gradient(loss_val, model.trainable_weights)
    adam.apply_gradients(zip(grads, model.trainable_weights))

    if step%200==0:
      print(f"Training loss at step {step} = {np.round(loss_val, 3)}")
      print(f"Samples seen so far = {(step+1)*64}")

Epoch 0
Training loss at step 0 = 3.5339999198913574
Samples seen so far = 64
Training loss at step 200 = 0.11999999731779099
Samples seen so far = 12864
Training loss at step 400 = 0.15399999916553497
Samples seen so far = 25664
Training loss at step 600 = 0.05299999937415123
Samples seen so far = 38464
Training loss at step 800 = 0.10100000351667404
Samples seen so far = 51264
Epoch 1
Training loss at step 0 = 0.16599999368190765
Samples seen so far = 64
Training loss at step 200 = 0.03200000151991844
Samples seen so far = 12864
Training loss at step 400 = 0.010999999940395355
Samples seen so far = 25664
Training loss at step 600 = 0.06199999898672104
Samples seen so far = 38464
Training loss at step 800 = 0.12700000405311584
Samples seen so far = 51264
Epoch 2
Training loss at step 0 = 0.013000000268220901
Samples seen so far = 64
Training loss at step 200 = 0.009999999776482582
Samples seen so far = 12864
Training loss at step 400 = 0.010999999940395355
Samples seen so far = 25664


In [None]:
# def step(X, y):

#   # keep track of gradients
#   with tf.GradientTape() as tape:
#     # make a prediction using the model and calculate the loss
#     pred = model(X)
#     loss = cce(y, pred)
    
#   print(f"Loss = {loss}")

#   # calculate the gradients using tape and update the model weights
#   grads = tape.gradient(loss, model.trainable_variables)
#   opt.apply_gradients(zip(grads, model.trainable_variables))

In [None]:
# model = build_model(28, 28, 1, 10)

In [None]:
# num_updates = int(X_train.shape[0]/64)  # bqtch size = 64

# for epoch in range(1, 20+1):
#   # show the current epoch number
#   epoch_start = time.time()
#   print(f"The current epoch is {epoch}")

#   for x in range(0, num_updates):
#     start = x*64
#     end = start+64

#     step(X_train[start:end], y_train[start:end])

#   epoch_end = time.time()
#   print(f"took {(epoch_end - epoch_start)} seconds")

In [None]:
model.compile(loss=cce_loss, optimizer=adam, metrics=["accuracy"])

In [None]:
model.evaluate(X_test, y_test)



[0.0371207594871521, 0.9887999892234802]

#### tf distributed training

1. distribution is not automatic

there are two algorithms for distributed training
1. data parallelism
2. model parallelism

`model parallelism` works best for models where there are independent parts of computation that you can run in parallel. in other words, putting different layers of the model on different machines 

`data parallelism` works best for any model architecture which makes it much more widely adopted for distributed training. the main idea in data parallelism is that with more gpus your model would be able to see more data on each training step which means it will take less time to finish an epoch. 

      model.fit(X, y, batch_size=(32*num_gpus)) 
      # in this case each gpu gets a slice of the data, the gradients are updated and then those gradients are averaged.

