# Introduction #

In the last lesson, we saw how to build and train a linear regression model in Keras. We saw how to build the model with a single `Dense` layer, and how to train the model by choosing an optimizer and a loss function. To develop the full deep learning framework, we'll spend the next two lessons elaborating on these two ideas. In this lesson, we're going to learn how to build *deep* networks by stacking layers, and then in Lesson 3 we'll look at more at training deep networks and what your choices here can mean for model performance.

# Deep Neural Networks #

In Keras, we build neural networks using **layers**. Conceptually, a layer is something that transforms data to produce a new, hopefully more informative *representation* of the data. 

There are various ways to adapt a linear model so that it can learn non-linear relationships. One way is to transform features directly -- squaring a feature to get a quadratic relationship, for instance.

What these methods have in common is that they *transform* the data before applying the linear regression. By and large, these classical methods are limited to a single relatively simple transformation. Deep neural networks, however, pass the data through multiple transformations -- sometimes hundreds! This stack of layers is what makes deep learning *deep*.

Often, the stack of hidden layers is called the **body** of the network and final layer is called the **head**. The job of the body is to transform the data into something that the head can use to solve its problem. The better the body of the network is able to do this, the better the network will perform on its task.

<!-- body and head -->

Neural networks transform data through stacks of neural layers. During training, a neural network will learn how to transform the data in a way that exposes its internal structure to the final layer. A neural network, in other words, learns how to perform its own feature engineering. (In the exercises, you'll get to watch how the layers of a neural network transform some input data step by step.)

# Parts of a Neural Network #

Let's start by describing the components of a neural network.

<!-- labelled network -->

Each circle we call a **neuron** and the lines between the neurons we call the **connections**. The neurons and the connections form what's called a **computational graph**.

This computational graph describes how the input becomes the output. Each neuron outputs a value, which we call its **activation**. We imagine the activations as starting at the input layer and flowing along the connections towards the outputs -- in this case, from left to right.

Each connection also has a value, called its **weight**. A weight is supposed to represent the strength of the association between neurons. When an activation flows through a connection, it gets multiplied by that connection's weight. To find the activation value of a neuron in the next layer, sum the all the values of the connections coming into it. The **bias** is simply an extra constant input we add to a neuron.

<!-- network computation diagram -->

## The Linear Model as a Neural Network ##

Let's adapt the linear model from Lesson 1 into a neural network to get a feel for how this works.

Say we just have a single input $x$ in a regression model like $y = 2x + 1$. The weight here is $2$ and the bias is $1$, so we could draw this as a neural network like:

<!-- one input linear network -->

If we happened to have an input $x=4$, this would flow through the network as an activation like:

<!-- activation flow -->

And the network outputs $9$, exactly the same as the equation: $y = 2(4) + 1 = 9$.

<!-- three variable? -->

# Activation Functions #

If we perform a linear transformation on a line, what we get back is just another line. To get other kinds of relationships, we need **activation functions**.

An activation function is simply some function we apply to each of a layer's outputs. In the network diagram, it might look like this:

<!-- activation in network -->

One of the most effective activation functions is the **Rectified Linear Unit**, or **ReLU**.

<!-- graph of ReLU -->

The ReLU function is simply the identity function with the negative part "rectified" to 0: `max(0, x)`.

It's possible to add an activation function as its own layer:

```
layers.Dense(8),
layers.Activation('relu')
```

More often, though, you'll just include it as part of another layer. Here's how you could apply a `'relu'` activation to the outputs of a `Dense` layer:
```
layers.Dense(8, activation='relu')
```

This single layer does exactly the same thing as the two layers before.

In a regression context, you could think about ReLU as putting a "bend" in the data. A neural network could use the ReLU function to bend some data to fit a curve.

<!-- bent data and x^2 -->

There is a whole family of variants of `relu`: `elu`, `gelu`, `selu`, `swish`, all of which you can use in Keras. On some datasets, models seem to perform better with one activation more than another. The ReLU function is a good one to start with, but you could experiment with the others as you develop your models.

<!-- note box -->
There is a second family of activation functions which we might call the "S-shaped" fuctions. While the ReLU functions are only bounded at the negative end, these functions are bounded at both ends. In the hidden layers, these haven't been as successful as the ReLU type. The **sigmoid** function, however, is often used in the head of a classifier network to convert real-valued numbers into probabilities. We'll see the sigmoid function again in Lesson 6 when we learn about classifiers.

# Dense Layers #

There are many kinds of layers you might use when building a neural network. The layers most involved in training are those that define connections between neurons. A `Dense` layer connects each of its neurons to all of the neurons in the layer before. When you create a network with only `Dense` layers, each layer is connected in every possible way to the layer before and after. These networks therefore are often called **fully connected**.

<!-- a fully connected network -->

In Keras, you would define the network above like this:

In [None]:
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

model = keras.Sequential([
    layers.Dense(units=2, activation='relu'), # 2 outputs
    layers.Dense(units=3, activation='relu'), # 3 outputs
    layers.Dense(units=1), # 1 output
])

Recall from Lesson 1 that you can use `Sequential` to build a network as a stack of layers.

With the `units` parameter you specify how many outputs you want the layer to produce. The number of inputs is determined by the layer before.

# Example - A Deep NN in Keras #

In [1]:
import tensorflow.keras as keras
import tensorflow.keras.layers as layers

model = keras.Sequential([
    layers.Dense(4, activation='relu'),
    layers.Dense(4, activation='relu'),
    layers.Dense(1),
])

model.compile(
    optimizer='SGD',
    loss'mse',
)

history = model.fit(x=x, y=y, epochs=20)

# Conclusion #

# Your Turn #