# Introduction to Tensorflow and Keras

We don't have to code up back propagation for every possible function or neural network architecture that we want to fit. There are lots of libraries targeted towards machine learning that make this task easy and computationally efficient. One of the most popular libraries is [TensorFlow](https://www.tensorflow.org/). It was developed by Google Brain and is now open source under the Apache License 2.0.

(Other popular choices in 2022 are [PyTorch](https://pytorch.org/) and [JAX](https://jax.readthedocs.io/))

The workflow consists of building a computational graph where "operations" act on "tensors" that can be automatically differentiated. Starting from tensorflow version 2 the operations are by default executed "eagerly" such that one can work with tensors in a similar way as with numpy arrays and typically does not have to worry about building the graph.

The TensorFlow website contains a much more [detailed introduction](https://www.tensorflow.org/guide/low_level_intro) if you want to learn more.

## Numpy-like syntax

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Tensors can be created via `tf.constant` from python lists or numpy arrays. Similar to numpy arrays, they have a `shape` and a `dtype`.

In [None]:
tf.constant([1, 2, 3], dtype=tf.float32)

In [None]:
tf.constant(np.array([1, 2, 3]), dtype=tf.float32)

In [None]:
tf.constant([[1, 2], [3, 4], [5, 6]])

In [None]:
tf.constant([[[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]],
             [[13, 14, 15, 16],
              [17, 18, 19, 20],
              [21, 22, 23, 24]]])

There are also convenience functions, e.g. to create equidistant or random values and all sorts of mathematical functions that represent operations on tensors.

In [None]:
tf.random.uniform((10, 2))

In [None]:
t = tf.linspace(0., 2.*np.pi, 10)
t

In [None]:
2 * t

In [None]:
tf.sin(t)

Tensors can be plotted like numpy arrays

In [None]:
plt.plot(t, tf.sin(t))

Or explicitely converted via `.numpy()`

In [None]:
t.numpy()

In [None]:
tf.sin(t).numpy()

## Auto differentiation
The real power comes from tracing operations that allows automatic backpropagation to calculate gradients. This can be done using `tf.GradientTape`. By default the gradients w.r.t. tensors (constants) are not recorded, but only for `tf.Variable`. A `tf.Variable` represents a mutable state - this makes sense, since in many cases we want to modify the values on which we calculate gradients (e.g. training a neural network).

In [None]:
t = tf.Variable(tf.linspace(0., 2.*np.pi, 100))
t

We can now calculate the derivative of the `sin` function w.r.t. `t` using `tf.GradientTape` in a context manager

In [None]:
with tf.GradientTape() as tape:
    f = tf.sin(t)
df = tape.gradient(f, t)

In [None]:
# Note: for plotting tf.Variable one always has to explicitely convert via .numpy()
# (not nescessary for Tensors/tf.constant)
plt.plot(t.numpy(), f, label="sin(t)")
plt.plot(t.numpy(), df, label="sin'(t)")
plt.legend()

To calculate gradients w.r.t. Tensors (`tf.constant`) instead of `tf.Variable`, use `tape.watch`:

In [None]:
t_const = tf.linspace(0., 2.*np.pi, 100)
with tf.GradientTape() as tape:
    tape.watch(t_const)
    f = tf.sin(t_const)
plt.plot(t_const, f, label="sin(t)")
plt.plot(t_const, tape.gradient(f, t_const), label="sin'(t)")
plt.legend()

The computation of the gradient can also be recorded and we can calculate the gradient of the gradient to get the second derivative.

In [None]:
with tf.GradientTape() as tape0:
    with tf.GradientTape() as tape1:
        f = tf.sin(t)
    df = tape1.gradient(f, t)
ddf = tape0.gradient(df, t)

The two gradient tapes are nescessary since tensorflow by default only allows one gradient to be calculated from a tape. If recording gradients themselves to the tape is intended one has to pass `persistent=True` - so the following works as well:

In [None]:
with tf.GradientTape(persistent=True) as tape:
    f = tf.sin(t)
    # this is inside the with block, so the gradient itself will also be recorded to the gradient tape
    df = tape.gradient(f, t)
# now we can calculate the gradient of the gradient
ddf_alternative = tape.gradient(df, t)

In [None]:
plt.plot(t.numpy(), f.numpy(), label="sin(t)")
plt.plot(t.numpy(), df.numpy(), label="sin'(t)")
plt.plot(t.numpy(), ddf.numpy(), label="sin''(t)")
plt.legend()

# Keras

The most convenient way to use TensorFlow with neural networks is through [Keras](http://keras.io). It provides a high-level interface that is somewhat a compromise between very high-level abstractions like scikit-learn and the complete control of every detail you get when directly using the low-level APIs of libraries like TensorFlow. There is a separate [Keras Documentation](https://keras.io), as well as [Guides](https://www.tensorflow.org/guide/keras), [Tutorials](https://www.tensorflow.org/tutorials/keras), and the [Keras section on the TensorFlow API Documentation](https://www.tensorflow.org/api_docs/python/tf/keras).

Keras is the recommended/default way to work with neural networks in TensorFlow.

## Build a model in Keras

As a quick example, let's again build a model to classify the "Moons" dataset:

In [None]:
from sklearn.datasets import make_moons

In [None]:
x, y = make_moons(n_samples=10000, noise=0.2)

In [None]:
plt.scatter(*x[y==0].T, label="y=0", alpha=0.1)
plt.scatter(*x[y==1].T, label="y=1", alpha=0.1)
plt.legend()

There are 3 ways to use Keras - via the Sequential API, the Functional API or via creating layers and models by subclassing. Lets start with [`Sequential`](https://keras.io/guides/sequential_model/). This is convenient for all models where we just have one input and one output Tensor with stacked Layers in between. Here we use the `Dense` layer - which is precisely the fully connected NN layer that applies the $\sigma(W\mathbf{x} + \mathbf{b})$ operation.

In [None]:
from tensorflow.keras.layers import Dense

model = tf.keras.models.Sequential([
    # Hidden layer with 2 inputs, 16 outputs
    Dense(16, activation="relu", input_shape=(2,)),
    # Output layer with 16 inputs (determined automatically) and 1 output
    Dense(1, activation="sigmoid")
])

How much parameters will our model have? The answer:

In [None]:
model.summary()

We can also access the underlying Tensors if needed:

In [None]:
model.inputs

In [None]:
model.outputs

In [None]:
model.weights

In [None]:
model.layers

In [None]:
model.layers[0].input

In [None]:
model.layers[0].output

Both models and layers are callables, so you can feed them tensors to get transformed outputs. This can be very useful to experiment and understand what transformations are done:

In [None]:
inputs = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)

In [None]:
model(inputs)

In [None]:
layer = Dense(10)

In [None]:
layer(inputs)

In [None]:
layer.weights

In [None]:
tf.matmul(inputs, layer.weights[0])

## Train the model

Before we can run the training, we have to "compile" the model. This will configure the loss function and optimization Algorithm. You cat pass each loss from [`keras.losses`](https://keras.io/losses) and each optimizer from [`keras.optimizers`](https://keras.io/optimizers) also as a string with the name if you want to use it with default parameters. Here we want to use the "Adam" optimizer with an adjusted initial learning rate, so we pass it directly.

We could also pass some metrics that we want to monitor during training (in addition to the Loss value).

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.1), loss="binary_crossentropy")

The API for fitting looks similar to scikit-learn, but has additional options. In fact there also is a [scikit-learn API  wrapper](https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn) for Keras if you need that in some context.

In [None]:
history = model.fit(x, y, epochs=3, batch_size=128)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(history.epoch, history.history['loss'])

## Run the model

The model can be run using `model.predict` or simply calling it like a function on an input. The main difference is that `model.predict` supports several parameters (like `batch_size`) and returns a numpy array whereas calling the model like a function returns a Tensor.

In [None]:
import numpy as np

In [None]:
grid = np.meshgrid(
    np.arange(x[:,0].min(), x[:,0].max(), 0.1),
    np.arange(x[:,1].min(), x[:,1].max(), 0.1),
)

In [None]:
xy = np.stack([grid[0].ravel(), grid[1].ravel()], axis=1)
xy

In [None]:
model(xy)

In [None]:
model.predict(xy)

In [None]:
scores = model(xy).numpy()

In [None]:
plt.contourf(grid[0], grid[1], scores.reshape(grid[0].shape), cmap="Spectral_r")
plt.colorbar(label="NN output")
opts = dict(alpha=0.1, marker=".", edgecolors="black")
plt.scatter(x[y==0][:,0], x[y==0][:,1], color="blue", **opts)
plt.scatter(x[y==1][:,0], x[y==1][:,1], color="red", **opts)
plt.xlim(grid[0].min(), grid[0].max())
plt.ylim(grid[1].min(), grid[1].max())

<div class="alert alert-block alert-success">
    <p>
        <b>Question 1</b>: Do we need a hidden layer? What would happen if we dropped it?
    </p>
    <p>
        <b>Question 2</b>: What would happen if we used a linear activation function but kept the hidden layer?
    </p>
    <p>
        <b>Question 3</b>: What would happen if we shrank the size of the hidden layer?
    </p>
</div>

## Functional API
https://keras.io/guides/functional_api

The [functional API](https://keras.io/guides/functional_api/) allows building an arbitrary computation graph composed of keras layers in an abstract way (just specifying input/output shapes, but no data yet). Each layer can be called as a function on an input Tensor and return an output Tensor. One can then finally build a model by passing the input and output Tensors to the `Model` constructor. This is especially useful when we want to organize the processing into different inputs and different outputs or if you want to build computation graphs that have branches.

In [None]:
from tensorflow.keras import layers

All models start with one or more `Input` layers:

In [None]:
inp = layers.Input(shape=(2,))
inp

New nodes in the computation graph are then added by calling Layers with their inputs as an argument:

In [None]:
hidden = layers.Dense(16, activation="relu")(inp)
hidden

In [None]:
out = layers.Dense(1, activation="sigmoid")(hidden)
out

To create a model, specify the inputs and outputs in the `Model` constructor:

In [None]:
tf.keras.Model(inputs=[inp], outputs=[out])

Example for a model with 2 inputs and 2 outputs:

In [None]:
def build_model():
    inp1 = layers.Input(shape=(3,))
    inp2 = layers.Input(shape=(5,))
    hidden1 = layers.Dense(16, activation="relu")(inp1)
    hidden2 = layers.Dense(16, activation="relu")(inp2)
    hidden3 = layers.Concatenate()([hidden1, hidden2])
    out1 = layers.Dense(1, activation="sigmoid")(hidden3)
    out2 = layers.Dense(1, activation="linear")(hidden3)
    return tf.keras.Model(inputs=[inp1, inp2], outputs=[out1, out2])

In [None]:
multi_model = build_model()

In [None]:
multi_model([np.random.rand(10, 3), np.random.rand(10, 5)])

To train such models, one would then specify multiple loss functions (one for each output) in `.compile` - the total loss will then be the sum of all losses:

In [None]:
multi_model.compile(loss=["binary_crossentropy", "mean_squared_error"])

You can visualize the Graph to see if everything is connected correctly:

In [None]:
tf.keras.utils.plot_model(multi_model, show_shapes=True)

<div class="alert alert-block alert-success">
    <b>Exercise:</b> Create a model for the classification of the "Moons" dataset that outputs both the hidden layer state and the classification output.
</div>

## Subclass API
https://keras.io/guides/making_new_layers_and_models_via_subclassing

For maximum flexibility you can also inherit from `tf.keras.models.Model` or `tf.keras.layers.Layer` and implement your own forward pass. This is very similar to how [PyTorch models are commonly built](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html).

Both for models and for layers the minimum amount of methods that you have to implement are `__init__`, where you typically define parameters and any state and then the forward pass in `call`:

In [None]:
class MyDenseReluLayer(tf.keras.layers.Layer):
    
    def __init__(self, n_inputs, n_outputs):
        # call the base class constructor
        super().__init__()
        
        # initialize weights
        self.kernel = tf.Variable(tf.random.uniform((n_inputs, n_outputs)))
        self.biases = tf.Variable(tf.zeros(n_outputs))
        
    def call(self, inputs):
        return tf.nn.relu(tf.matmul(inputs, self.kernel) + self.biases)

Custom layers can be arbitrarily combined with existing layers e.g:

In [None]:
composed_model = tf.keras.models.Sequential([
        MyDenseReluLayer(2, 5),
        Dense(1, activation="sigmoid")
])

In [None]:
inputs = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)

In [None]:
composed_model(inputs)

In [None]:
composed_model.summary()

Models can also be used as layers for new models and you can use existing layers as members of custom layers etc.

## Visualize hidden layers

For models created with the Sequential or functional API it is easy to create new models that evaluate only part of the computation graph.
Let's use this to visualize the hidden layers of our first neural network in this notebook.

In [None]:
model.summary()

In [None]:
model.layers[0].output

In [None]:
model.input

In [None]:
hidden_output = tf.keras.Model(inputs=[model.input], outputs=[model.layers[0].output])

Let's feed it with a regular grid again for visualization.

In [None]:
step = 0.1
grid = np.meshgrid(
    np.arange(x[:,0].min(), x[:,0].max()+step, step),
    np.arange(x[:,1].min(), x[:,1].max()+step, step)
)

In [None]:
xp = np.stack([grid[0].ravel(), grid[1].ravel()], axis=-1)

In [None]:
hl_out = hidden_output(xp).numpy()

In [None]:
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(10, 10))
for i in range(16):
    axs.ravel()[i].contourf(grid[0], grid[1], hl_out[:,i].reshape(grid[0].shape))

In [None]:
weights = model.layers[1].weights[0]
bias = model.layers[1].weights[1]
weights, bias

In [None]:
fig, axs = plt.subplots(nrows=16, ncols=2, figsize=(2 * 2, 2 * 16))
total = np.zeros_like(hl_out[:, 0])
for i in range(16):
    total += weights[i, 0] * hl_out[:, i]
    axs[i, 0].contourf(grid[0], grid[1], hl_out[:,i].reshape(grid[0].shape))
    axs[i, 0].set_title(f"+ {weights[i, 0]:.3f} *")
    axs[i, 1].contourf(grid[0], grid[1], total.numpy().reshape(grid[0].shape))
    axs[i, 1].set_title("=")
    axs[i, 0].set_axis_off()
    axs[i, 1].set_axis_off()

In [None]:
hl_out.shape

This gives a nice idea about how a NN composes it's output by combining the outputs of the previous layer. A nice visualization of this can be seen at https://playground.tensorflow.org/