# Tensorflow baiscs


## Tensors and compute graph

### Tensors

Tensors can be seen the multidimensional version of matrices. It is the core datatype in TensorFlow.
TensorFlow has many numpy-like functions to create or operate on tensors.

Since TF2.0, TensorFlow tensors works very similar to numpy arrays, the most basics type of tensors is a `tf.constant`, you can create a tensor from a number, or a numpy array. You can even convert them to arrays using `tensor.numpy()`.



Here are a few examples:

In [None]:
import tensorflow as tf
import numpy as np

a = tf.constant(1)
b = tf.constant(np.linspace(0,1,11)) # from numpy array
c = tf.ones([3, 3]) # a 3x3 matrix


print('a:', a)
print('b:', b)
print('c:', c)
print('c as numpy:', c.numpy())

### Graphs and gradients

The power of TensorFlow comes from graphs. When ever you compute something 
using tensors, TensorFlow records what you have done by constructing a 
graph of computations.

![static_graph](https://www.tensorflow.org/images/tensors_flowing.gif)

One of purpose of doing this is that you can now compute the gradients of 
$y=f(x)$ with respect to $x$, using the chain rule of differentiation. 
So long as $x$ is computed  from $y$, you can trace back each operation back to
$x$ from $y$

To do this, you use the `GradientTape` to tell TensorFlow what are you interested
in as $x$. It works like this:

In [None]:
x = tf.constant(5.0)

# Inside the "with" call, we compute y from x
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x ** 2
    
# After we have y, we can have the gradient dy/dx
# which is "recorded" by the tape
dydx = tape.gradient(y, x)

print(dydx)

Does the number make sense? You can try a few more times to see what it does.

**TASK:** Try to find out:
- Can a tape record many x variables?
  (then we want the partial derivatives)
- Can it compute second order gradients?


## Linear Regression

We now have the power to compute gradients of any function, 
so long as we can compute that function in TensorFlow. 
The natural usage of this is to implement the gradient descent algorithm.
*You have used it before, remember?*

We showcase this with the simple linear regression problem, trying to fit some data with the function:
$$y_\mathrm{pred} = kx + b$$

We try to minimize the loss function:
$$\mathrm{loss} = (y_\mathrm{pred} - y_\mathrm{data})^2$$

Suppose the data we'd like to fit looks like this

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
x_data = np.linspace(0,10,20)
y_data = x_data *2 + 1
plt.plot(x_data, y_data, 'rx')

### Gradient descent

The idea is simple, if we know the gradient of the loss function (that is, 
the error of prediction) with respect to the variables,
we can minimize the error by going to the direction 
where loss function decreases.

![](https://ml-cheatsheet.readthedocs.io/en/latest/_images/gradient_descent_demystified.png)

To do that, we need to know the gradient of the loss function with respect to the parameters.

In the following block, two parameter $k$, $b$ and a loss function (the mean squared error) are defined.
The following code calculates the gradient of the loss function with respect to $k$ and $b$.
You need to complete the code by:

**TASK:** 
- Record the gradient for $k$ and $b$.
- Compute the prediction `y_pred` from `x_data`, `k` and `b`.
- Inspect the gradients. Can you verify that they are correct?

In [None]:
k = tf.constant(4.0)
b = tf.constant(2.0)

# note we added persisten=True here
# it allows you to do tape.gradient multiple times
with tf.GradientTape(persistent=True) as tape:
    # You need to complet the code from here
    
    
    
    # to here 
    loss = tf.reduce_mean((y_pred - y_data)**2)

dlossdk = tape.gradient(loss, k)
dlossdb = tape.gradient(loss, b)

We then use the gradients to update the values of $k$ and $b$. In each step, 
we update the values of k and b according to $d\mathrm{loss}/dk$ and $d\mathrm{loss}/db$.

**TASK:**
- In the next code block, compute the derivatives of the loss function inside the `for` loop. (you can reuse your previous code)

In [None]:
k = tf.constant(2.0)#
b = tf.constant(2.0)#

max_steps = 1000
step = 0.01

for i in range(max_steps):
    # You need to complete the code from here







    # to here
    k = k - step * dlossdk
    b = b - step * dlossdb
    if i%100 == 0:
        print(f'Step={i}, k={k:.2f}, b={b:.2f}')

Use the code block below to see how the resulting model looks like.

Did it work? Remember that gradient descent can be unstable when the step is large, 
modify the `max_steps` and `step` to see if it helps?

**BONUS**: store the loss function of each step and plot it.

In [None]:
plt.plot(x_data, y_pred)
plt.plot(x_data, y_data, 'rx')

## Neural network

Of course, we are not satisfied with linear regression,
or any regression with a known function. 
In the real work scenario we cannot even have a good guess of how a function looks like.
This is when neural network could save the day.


First, let's define our "very complex" function.

In [None]:
x_data = np.linspace(0,10,20)
y_data = np.sin(x_data)
plt.plot(x_data, y_data, 'rx')

### The structure
The idea of a single-layer neural network is that we take the input
variable, transform it into several hidden units with a linear
function and an activation function. And predict the output
as a linear combination of the hidden layers.

<img width=800px src="https://miro.medium.com/max/1000/1*_7Om4rgZytZe10fXUZkNCA.png" 
     style="width: 800px;"/>


### The operations

Mathematically, this means you need `n_hidden` $k_1$ values called
"weights", and `n_hidden` $b$ values called "biases" to transform the 
input to hidden units. Then `n_hidden` $k_2$ values and one $b_2$
to get the output.

Written in the vector form:
- $\vec{\mathrm{hidden}} = \mathrm{tanh}(\vec{k_1}\cdot x+\vec{b_1})$
- $y = \vec{k_2}^\top \cdot \vec{\mathrm{hidden}}+b_2$


### The activation function
The activation function is an essential part of neural networks, without an activation, this neural network is just a plain linear regression. It gets its name from the function used to approximate how a extracellular field affects (activates) neurons.

You can see the activation function as a "switch", which controls 
when a "hidden unit" in the neural network is "ON" or "OFF", depending on its input value. 
The activation function we use here the *tanh* function, 
it looks like this:

<img width=400px src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Hyperbolic_Tangent.svg/1280px-Hyperbolic_Tangent.svg.png" 
     style="width: 400px;"/>
          

An intuitive explanation of how this works is that tanh outputs "almost" a linear function between [-0.5, 0.5], and constants when the input is large. It means one hidden unit only "activates" between certain regions of the input. 

When we have a lot of hidden units, we can fit a complex function by having each of the hidden unit "activating" at specific input range. By combining those hidden units, we can fit a function with any shape.

TensorFlow has provided a [demo](https://playground.tensorflow.org/) where you can play with neural networks with different activation functions are layo

### Implementing a single-layer neural network

Now we'll implement such a neural network.

First we try to compute the hidden layer from $k_1$ and $b_1$:

You may see an error saying that the two shapes are incompatible.
This is because you want to compute the 10 multiplications 
for every data you have in the 20 samples. 

You therefore need a 2-d matrix with the shape (20 by 10) to represent the hidden units.

In [None]:
k_1 = tf.random.normal([10])
b_1 = tf.random.normal([10])


print(x_data.shape, k_1.shape)
print(x_data * k_1 + b_1)
# ^^^^^ This fails ^^^^

print(x_data[:, None].shape, k_1[None, :].shape)
hidden = x_data[:, None] * k_1[None, :] + b_1[None, :]
print(hidden.shape)
# ^^^^ This will work ^^^^^

The latter lines works because TensorFlow (also numpy) broadcasts arrays automatically.  
You can read more here: https://www.tensorflow.org/xla/broadcasting.

Now try to complete the following block (hint: you'll need an activation function like [`tf.tanh`](https://www.tensorflow.org/api_docs/python/tf/math/tanh) and also [`tf.reduce_sum`](https://www.tensorflow.org/api_docs/python/tf/math/reduce_sum) for adding up the hidden unit outputs, click on the functions to see their documentations if you are not sure):

In [None]:
k1 = tf.random.normal([25])
b1 = tf.random.normal([25])
k2 = tf.random.normal([25])
b2 = tf.random.normal([1])

def neural_net(x, k1, b1, k2, b2):
    # You need to complete the code from here

    # to here
    return y_pred

y_pred = neural_net(x_data, k1, b1, k2, b2)

### Initialization of the weights

You might notice that we initialize the $k$ and $b$ vairiables with `tf.random.normal`. It means that we randomly select their values from a normal distribution. 

**TASK:**
- Think about why this is necessary.
- Can we just set $k_1$ to be random? What about $k_2$?

With this you can use our previous gradient descent loop to train your neural network:

In [None]:
max_steps = 1000
step = 0.001

# You don't need to change the code below,
# but feel free to play with the step settings
# and number of hidden layers
k1 = tf.random.normal([25])
b1 = tf.random.normal([25])
k2 = tf.random.normal([25])
b2 = tf.random.normal([1])

for i in range(max_steps):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(k1)
        tape.watch(b1)
        tape.watch(k2)
        tape.watch(b2)
        
        y_pred = neural_net(x_data, k1, b1, k2, b2)
        loss = tf.reduce_mean((y_pred - y_data)**2)
    
    dk1 = tape.gradient(loss, k1)
    db1 = tape.gradient(loss, b1)
    dk2 = tape.gradient(loss, k2)
    db2 = tape.gradient(loss, b2)
    k1 = k1 - step * dk1
    b1 = b1 - step * db1
    k2 = k2 - step * dk2
    b2 = b2 - step * db2
    
    if i%100 == 0:
        print(f'Step={i}, loss={loss:.2f}.')

In [None]:
plt.plot(x_data, y_pred)
plt.plot(x_data, y_data, 'rx')

**TASK:** You might see a bad fit, you may now tune the number of hidden units,  
the steps of the gradient descent to see if it improves the performance.

**Note** that you are also allowed to add more points to the training set.

## Testing your neural network

**TASK:** After you feel happy about your neural network, you should test it 
on data that it has not seen. This is called a test (or validation) set.

- Does your neural network work well?
- What if you try to predict beyond the range you have trained it on?  
  (try change `x_test` to something below 0)

In [None]:
x_test = np.linspace(0,10,1000)
y_test = np.sin(x_test)
y_pred_test = neural_net(x_test, k1, b1, k2, b2)

plt.plot(x_test, y_test, 'r-')
plt.plot(x_test, y_pred_test, 'k--')

plt.legend(['test set label', 'test set prediction'])

## Deep neural network

Although it can be proven that a single layer neural network can fit **any** continuous function given enough data, the raise of neural network is due to the advance of deep neural networks (or so-called deep learning).

![](https://www.researchgate.net/profile/Martin_Musiol/publication/308414212/figure/fig1/AS:409040078295040@1474534162122/A-general-model-of-a-deep-neural-network-It-consists-of-an-input-layer-some-here-two.png)

The idea is rather simple: if transforming the input into a set of non-linear hidden units improves our model compared to a linear one, can we improve more by transforming the hidden units with yet another hidden layer? These models are conceptually possible, but we lack feasible ways to train the parameters.

Since we have now $n_\mathrm{hidden1} \times n_\mathrm{hidden2}$ weights to transform hidden layer 1 to hidden layer 2, the number of parameter explodes with number of hidden layers. 

Thanks to the development of libraries like TensorFlow (to compute the gradients for very deep models), gradient descent (to update the parameters but avoid oscillations)
as well as the increasing compute power, the deep neural networks are now much easier to train (or to optimize the parameters). 


**BONUS:** You are now able try to implement a multi-layer neural network using what we've learnt so far. It's Ok if you don't, TensorFlow provides handy functions to help you create deep neural networks efficiently. After the break, we will build 
a deep neural network with the Keras API.

Here are some tips if you're going to try:
- Mind the dimensions: at this point you will have to do many matrix multiplications, 
  it is important that you do the then properly and the dimension matches.
- Define a function that does one gradient descent step and gives: `new_params = grad_descent(old_params, x_data, y_data)`.
- After you're done, wrap the above function with the [`@tf.fucntion`
  decorator](https://www.tensorflow.org/api_docs/python/tf/function). TensorFlow will automatically optimize your function.

In [None]:
# No hints here



