# Custom Models, Layers, and Loss Functions with TensorFlow
## TensorFlow: Advanced Techniques Specilaization
## By Karim Elshetihy
---

Welcome to this course on Advanced TensorFlow. You will explore scenarios where you can go beyond basic Keras layer definition for the creation of ML models by exploring custom loss functions, callbacks, layers and models. But to begin it all off, you'll look at a new way of coding your layers. It's called the functional API.

## Functional APIs

### Sequential Models
To explain how the functional model works in TensorFlow and Keras, it's probably easiest to use an example. In this case, we'll take a simple model architecture that you might already be familiar with. That's MNIST or the fashion MNIST one. There you build a classifier for data that's 28-by-28 pixels and shape and has 10 classes.

In [None]:
seq_model = Sequential([
    Flatten(input_shape=(28,28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

Here's a look at the code that defines the model architecture. You create a sequential and this contains a list of the layer definitions. The return from the sequential, is the model itself. You can say that your model equals sequential and within the parameters of sequential, you'll define each layer. For example, with a DNN, the density connected layers are defined as dense. The first layer is flattened and this takes the 28-by-28 image and flattens it to a one-dimensional array, which can then be fed to the dense network. Next, is the first layer of dense neurons. We've specified that there'll be a 128 of them with each neuron having ReLu as an activation function. The final layer, is 10 neurons with each neuron representing one of the 10 classes in the image. It's activation is a softmax, and the softmax helps us to identify which of the 10 classes is most likely for the given image.

Now that you've had a brief recap on a sequential architecture, let's take a look at what it would take to define the same model architecture using the functional API.

### Functional API Model

When creating a model architecture with the functional API, you follow three steps. 
1. First, is to explicitly define an input layer. This is probably the biggest deviation from the sequential one. 
2. Once you have this, you can then define the layers as before, connecting each layer using Python functional syntax and that's what gives the API it's name. Python functional syntax is when you specify that the current layer is a function and at the previous layer is a parameter to that function. That sounds a little confusing for now, but you'll see how this looks in just a moment. 
3. You then define the model by calling the model object and giving it the input and output layers. I know this has gone through pretty fast so let's step through each of these steps in a little more detail. 

In [None]:
from tensorflow.keras.layers import Input, Dense, Flatten

input_ = Input(shape=(28, 28))
x = Flatten()(input_)
x = Dense(128, actication='relu')(x)
predictions = Dense(10, actication='softmax')(x)

We'll start with defining the input. The code to define the input looks like this. It's a new layer type that we haven't shown in any of the previous courses and it's called input. You import it from tensorflow.keras layers and then declare an input object using it. This needs the shape of the input data and in the case of MNIST that we saw earlier, it's a 28-by-28 image, hence we set the shape parameter to be a 28-by-28 tuple.Next comes Step 2, where you define the layers of the model. This should look very similar to the sequential API that we saw earlier. 

The important thing to note, is that there's no list of layers like with the sequential declaration. Instead, when you define a layer, you add it to the next one by specifying in parentheses after the declaration of that next layer. For example, earlier you defined inputs. The next layer, will be a flattened and you specify that flattened follows the input layer like this. The flattened layer is stored in a variable named X. If you want to add a dense layer after the flattened layer, you'll then specify a dense layer and put the variable X in parentheses after it to specify that it's the next one and you continue to finding them like this. Our sequential started by flattening the 28-by-28 from the input and storing it in the variable X. Then we create a 128 neuron dense layer that follows this flattened layer X. Notice that we're reusing the variable X to store this dense layer. Now, you can choose unique variables to store each layer, but reusing the same variable is fine too. Finally, we'll create a predictions layer that follows the previous dense layer, thus defining a model that starts with an input, is then flattened, then fed into a dense, and then another dense layer in order to get the predictions. Now you've seen an example of functional syntax. The return value of each function is passed into the next function using parentheses. 

In [None]:
from tensorflow.keras.models import Model

func_model = Model(inputs=input_, outputs=predictions)

You've seen the first two steps in declaring a model using the functional syntax. You first define the inputs, then you define the model architecture. Next, you're going to put it all together into a model object. Let's take a look. We've done Steps 1 and 2. The final step, is now to define the model. You're going to do that by specifying the input and output layers. Let's take a look at the code. We first have to import the model object and we can then call this to get an instance of a model. The input layer we created earlier, was called input. We specify that the input layer is this one using the inputs equal parameter. Our output layer is called predictions, so we can specify that the outputs parameter equals predictions and that's it. It's very similar to the sequential API. But by having the named layers like this, we have some flexibility that isn't available when you use sequential. You're going to learn more about these later. 

In [None]:
input_ = Input(shape=(28, 28))
x = Flatten()(input_)
x = Dense(128, actication='relu')(x)
predictions = Dense(10, actication='softmax')(x)

func_model = Model(inputs=input_, outputs=predictions)

Here's the complete code for the model architecture. Remember, that you start by defining the inputs, then you define the layers in your neural network and its very similar way to the sequential model. But instead of having them in a list, you define the sequence by using the functional syntax shown here with a previous layer is in parentheses after the current one. You then define your output layer, which in this case, is stored in a variable called predictions before finally creating a model by calling the model objects and telling it your input and output layers. That's it for your first look at the functional API.

## Using the Functional APIs
### Declaring and Stacking Layers
Previously, you saw how you can take an existing sequential model and turn it into a functional base model with a little modification. One key difference with the functional base model is to define an explicit input layer and output layer that you'd pass to the model constructor. Your code looked a little bit like this:

In [None]:
def build_model_with_functional():
    input_layer = tf.keras.Input(shape=(28, 28))
    flatten_layer = tf.keras.Flatten()(input_layer)
    first_dense = tf.keras.Dense(128, activation=tf.nn.relu)(flatten_layer)
    output_layer = tf.keras.Dense(10, activation=tf.nn.softmax)(first_dense)
    
    func_model = Model(inputs=input_layer, outputs=output_layer)
    return func_model

Where you defined your input layer on the first line and then indicated that the next layer, the flatten layer, would follow it by placing the input layer after that flatten layer in parentheses. The flatten layer would similarly be indicated as preceding the next layer, a dense layer, by placing it in parentheses after the definition of that dense layer. In turn, that dense layer would be defined and would precede the output layer by having it specified as such on the following line. The output layer is the final one we declare, and we'll use it in the model constructor to say that this is the output. Additionally, the input that we created at the top will be defined as the input layer. 

Architecting like this might lead you to a few questions. The first is that it seems a little unusual to use that style of coding. When I define a layer, I place the preceding layer in parenthesis after the definition? That seems a little counter-intuitive and this double parentheses syntax can look a little mysterious. It actually comes from a Python syntax and a shortcut that is used when using a function without explicitly storing it in a named variable. Let me show you an example. Here you can see, if you consider this line, 
```
first_dense = tf.keras.Dense(128, activation=tf.nn.relu)(flatten_layer)

```
we're defining the first dense layer of a 128 neurons. We use the double bracket syntax like this to say that this layer should follow the layer that's called flatten layer. The double bracket syntax is merely a shortcut for this type of code. 
```
first_dense = tf.keras.Dense(128, activation=tf.nn.relu)
first_dense(flatten_layer)
```
With a tf dot Keras dot layers dot Dense returns an object that we store in the variable named first dense. Then we call the first dense object as a function and pass it in the flatten layer. We could have written the code in this way but the one-line version is just done for convenience. Now that you've seen the syntax for declaring how layers in your network are put into sequence, it's time to think about more complex model architectures.

### Branching Models
Now that you've seen how to use the functional API to declare layers and stack them. The next question you might have is, do I have to code up all of these layers in a specific order, like with the sequential API. In other words, do I have to code up the input layer, and then immediately after that code up the flattened layer. And then on to the next line, for the next layer, e.t.c, e.t.c. Thankfully, the answer is no, you can actually code the layers out of sequence if you want. And this opens a really interesting possibility, branched models, where instead of each layer following another layer in a linear stack. You can define a model architecture, where you split the model into different paths and merge them later. So, for example, consider the famous inception model, it's not a direct linear model from layer to layer.

<img src='Images/inceptionv3.png'>

It starts out that way, with the early layers being sequential, before branching into four separate parallel paths here. And they then get merged together here, before branching out again. 

In pseudo code that might look a little bit like this, we create a layer for example, a dense containing 32 neurons. We then define four layers that all used the double bracket syntax, to declare that all four of these layers follow the same layer, which is layer 1 like this. And then we can use the concatenate type player, which we'll learn more about later. To merge them together by specifying them as a list that gets passed to concatenate. You may have also noticed something interesting, when looking at the parameters of the model objects. The parameters are inputs, which is the plural of input, and outputs, which is the plural of output.

<img src='Images/branching model.png'>

In [None]:
Layer1 = Dense(32)
Layer2_1 = Dense(32)(Layer1)
Layer2_2 = Dense(32)(Layer1)
Layer2_3 = Dense(32)(Layer1)
Layer2_4 = Dense(32)(Layer1)

merge = Concatenate([layer2_1, layer2_2, layer2_3, layer2_4])

And you can see an example of that from earlier here, note that the parameter names are plural. Now you might be wondering, does that mean that we could have multiple inputs and multiple outputs? 

Well, the answer to that, is yes, and you can do them by specifying them as a list, a bit like this. Note the square bracket syntax, which indicates a python list. You could have defined multiple inputs, so in order to tell the model that you'll be using them, you simply list them out as shown. Hopefully, these questions and their answers, will help you understand the flexibility that the functional API gives you, compared to the sequential API. Next, we're going to explore some options that are now available to you, when you're using the functional API.
```
    func_model = Model(inputs=[input_layer1, input_layer2], outputs=[output_layer1,output_layer2])
```

Read more about the Inception Model Architecture [here](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

#### Summary of the Article


## Creating Multioutput Architecture
### Creating a Multioutput Model
Now that you've seen how the functional API opens up new scenarios. That weren't possible with sequential API including parallel layers. With splitting and contaminating as well as multiple inputs and outputs. Let's explore some scenarios of where these could be used in models on. Let's start with an example around energy efficiency. UCI data sets repository has an excellent data set to get started with. Let's call the energy efficiency data set. It's small enough that you can prototype and learn with it quickly. You can find it at this you are around.

Energy efficiency Data Set [Link](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency)

The data has 8 features listed X1 to X8 on two labels Y1 and Y2. This makes it perfect for us to create a multi output model. So instead of two separate models to predict Y1 and Y2. We can do it all in the same model. And maybe choose a different sequence of layer. To predict the Y1 output on another different sequence of layers. To predict the Y2 output, for example, you may decide to. Include an extra layer before the data flows to the Y2 output. We'll start with our input layer that takes in the features. We know that there are 8 features. But we don't know how many data items of 8 features there are. So you can see that the shape is donated by. The syntax question mark common 8. Meaning that the network knows that it will have. A number of 8 feature inputs, but it doesn't yet know how many. This is followed by 2 dense layers. Each of 128 neurons connected in sequence. After that is where we can split our network. The Y1 output can be retrieved directly from the second dense layer. Which is called dense four in this case. While the other branch will go through another dense layer first. For further deep learning before the whites output is made available. Thus, a single model can be designed to predict multiple outputs.

<img src='Images/multioutput model.png'>

Let's now explore the code:

In [None]:
input_layer = Input(shape=len(train.columns))
first_dense = Dense(128, activation='relu')(input_layer)
second_dense = Dense(128, activation='relu')(first_dense)

output_layer = Dense(units='1', name='y1_output')(second_dense)
third_dense = Dense(128, activation='relu')(second_dense)
output_layer = Dense(units='1', name='y2_output')(third_dense)

func_model = Model(inputs=input_layer, outputs=output_layer)

Shows up in the architecture's one quick note. In the visual of the architecture. Care ask default names with numbers to each layer. Such as input to dense 3 and dense 4. When I refer to the first dense or second dense layer. I'll be referring to the variable names in the code on. And not to the default names generated in the architecture diagram. 

First of all is the input. We specify that we wanted to be shaped based on the training data. Which has 8 input or ex columns. Then we can see that the input is specified to have. Question mark items on each has 8 features. We'll then see that our first dense layer, which is 128 neurons. So in the architectures we see that the input to it has 8 features. But the output, because of 128 neurons, is now shaped at 128. Similarly, for the next layer, we can see that there are 128 neurons. So the input and output will be the same shape. And here's where it begins to get interesting. The first output for Y 1 is fed by the layer named second dense in the code. Giving an output of 1 dimension because this is a regression problem. For Y2, to it takes a longer path, but we don't have the output yet. We define another dense layer, which will call third dense. Which gets its output from second dense. Which is the same dense layer that gets fed to the Y one output. But only has 64 neurons, so it has 64 in its output shape. And we can see the output for whites who here taking the output from. The third dense layer with 64 neurons and. Combining it into a single value.
<img src='Images/multioutput model.png'>

And this was just the model architecture part of the code. There's a lot more in particular getting the data from UCI. Which is an excel spreadsheet, on preparing it for training. As well as plotting the results of the classification. Next, I'll show a screen cast of this multi output classifier in action.

## Siamese Network
### Multiple Input Model
Previously, you saw an example of a multi-output model, where using the functional API, you could define a model that had multiple values predicted on different output layers. 

Now, let's explore a network that has multiple inputs and a particular type of machine learning architecture called a Siamese network. A Siamese architecture looks like this. You have two inputs, in this case two input images, which are processed with the two sub-networks that have the same base neural network architecture. We can measure the Euclidean distance between the output vectors of these networks to predict how similar these two input examples are or how different they are. For example, say we want to have a model that measures the amount of difference between two data items. In this case, it's images from Fashion-MNIST. We would pass one image into one side of the network and the other image into the other side of the network, and then each subnetwork will output a vector that represents each input image. A mathematical operation called Euclidean distance can then be used to calculate the amount of difference between these two output vectors to tell us if there are similar or not. When training the model, we feed it pairs of images with a label that specifies if they're similar or not, and the model will then learn from this. 

<img src='Images/Siamese Network.png'>

You'll see this shortly. The architecture comes from a number of different papers, and I've referred to some of them here. Fashion-MNIST comes with lots of data, but it's not in pairs. We have to write some code to preprocess it into pairs, and then label those pairs for similarity or dissimilarity based on their labels. The pair of images on the left, both of which are shirts, will be labeled as one because they're similar with this dictated by the fact that they have the same label, and the pair of images on the right will be marked as dissimilar because their labels are different and zero indicates dissimilarity. If we look back at the architecture of the Siamese network, you can see that these two parts of the architecture are supposed to be the same, with the same structure and the same weights. Let's create them as something that we're going to call the base network. Now that you've seen a complex architecture that has multiple inputs that can implement a Siamese network, the next video will show you how to code this with the functional APIs.

<img src='Images/Siamese Network Out.png'>

- [Learning a Similarity Metric Discriminatively, with Application to Face Verification](http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf) (Chopra, Hadsell, & LeCun, 2005)

- [Similarity Learning with (or without) Convolutional Neural Network](http://slazebni.cs.illinois.edu/spring17/lec09_similarity.pdf) (Chatterjee & Luo, n.d.)



### Coding a Multiple Input Siamese Network
Previously, you saw how to define the architecture of a Siamese Network, for determining the similarity or dissimilarity of clothing items. In this video, you'll look at building the code that implements this architecture.

In [None]:
def initialize_base_network():
    input_ = Input(shape=(28, 28))
    x = Flatten()(Input)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    
    return Model(inputs=input_, outputs=x)

Using MNIST data that's 28 by 28, you can have a simple, deep neural network, like this. It will use the functional API defined within a Python function. Note that it returns a model. You'll use this later. If you plot this architecture you'll see layers like this. 

<img src='Images/Base Model Plotted.png'>

Now that you have the base layer you can use it with two input layers, start by calling the initialized base network function to get a model back, and then you can specify input a as an input layer that's 28 by 28, and specify the base network, to follow it. 

In [None]:
base_network = intialize_base_network()

input_a = Input(shape=(28, 28))
input_b = Input(shape=(28, 28))

vect_output_a = base_network(input_a)
vect_output_b = base_network(input_b)

Then you'll do exactly the same for input b. That will give you an architecture like this, with two inputs to the base model architecture. While there's just one node plotted here, this model 15 is the entire base dense network that we saw earlier on.

<img src='Images/Input Base Model Plotted.png'>

Next, let's consider the output of the networks. The output vectors of each sub-network are compared to each other using euclidean distance, to determine their level of similarity. 

<img src='Images/Similarity between two inputs.png'>

[ The Distance Between Two Vectors (Mathonline)](http://mathonline.wikidot.com/)

The code to calculate the euclidean distance is shown here:

In [None]:
def euclidean_distance(vects):
    x, y = vects
    sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
    return K.sqrt(K.maximum(sum_square, K.epsilon()))

def euc_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)

We can use a Lambda layer to call the euclidean distance functions. Lambda layers and TensorFlow give you the ability to code custom code, so they're perfectly suited for it. You can also specify that it follows the two vector outputs from earlier, by putting them into a list, which you can see here. It's still using the functional API syntax. We can define our models as always by calling the model object, and specifying the inputs and outputs.

In [None]:
output = Lambda(euclidean_distance, output_shape=euc_dist_output_shape)/
([vect_output_a, vect_output_b])

In [None]:
model = Model([input_a, input_b], output)

 We created the inputs earlier, and the output is the Lambda layer that you just created. Now the architecture looks like this, with the Lambda layer following the base model. 

<img src='Images/Siamese Network Out Plotted.PNG'>


In [None]:
model = Model([input_a, input_b], output)

rms = RMSprop()
model.compile(loss=contrastive_loss, optimizer=rms)

We compile the model using a loss called contrastive loss, which is probably something new to you. This is a custom loss function that was written for this scenario. Now we get to training the network. Note how we feed in the training values. 

In [None]:
model.fit([tr_paires[:,0], tr_paires[:,1]], tr_y, # Training Data
          epochs=20,
          batch_size=128, 
          validation_data=([ts_paires[:,0], ts_paires[:,1]], ts_y)) # Testing Data

The Data has been arranged into pairs of images with a label denoting this similarity. The syntax ```ts_paires[:,0]```, will take the first item in the pair and feed it into the left side of the network, ```ts_paires[:,1]```, will take the second item in the pair and feed that into the right side of the network.```tr_y``` will contain the labels, where it's zero for dissimilar pairs and one for similar ones. 

If we determine that for the current pair, the two values being fed in are similar, this will be a one, and you can see that here where I show two examples of pairs: where on the left we have two t-shirts and they're labeled as having a similarity of one, whereas on the right we have a t- shirt and a jacket, where their similarity is labeled as zero.

<img src='Images/Siamese example.png'>

Once the model is trained it can be fed pairs of images to get outputs of similarity, or dissimilarity. A lower distance score indicates that two items are more similar and a higher distance score indicates the two items are different from each other. For example, these two shorts are similar and they get a low distance score of 0.17, but this jacket is very different from these pants and gets a high distance score of 1.26.

<img src='Images/Siamese example out.png'>

1. **Which of these steps are needed for building a model with the Functional API? (Select three from the list below)**

    1. Explicitly define an input layer to the model.

    2. Define the input layer of the model using any Keras layer class (e.g., Flatten(), Dense(), ...)
    3. Define disconnected intermediate layers of the model.
    4. Connect each layer using python functional syntax.
    5. Define the model using the input and output layers.
    6. Define the model using only the output layer(s).

- 2, 4, 5
- 1, 3, 5
- 1, 4, 6
- 1, 4, 5


2. **Is the following code correct for building a model with the Sequential API?**
<img src='Images/Q2.png'>
- False
- True


3. **Only a single input layer can be defined for a functional model.**
- True
- False


4. **What are Branch Models ?** 
- A model architecture with linear stack of layers.
- A model architecture with non-linear topology, shared layers, and even multiple inputs or outputs.
- A model architecture with a single recurring path.
- A model architecture where you can split the model into different paths, and cannot merge them later.


5. **One of the advantages of the Functional API is the option to build branched models with multiple outputs, where different loss functions can be implemented for each output.**
- True
- False


6. **A siamese network architecture has:**
- 1 input, 2 outputs
- 2 inputs, 2 outputs
- 1 input, 1 output
- 2 inputs, 1 output


7. **What is the output of each twin network inside a Siamese Network architecture?**
- A number
- A softmax probability
- An output vector
- Binary value, 1 or 0


8. **What is the purpose of using a custom contrastive loss function for a siamese model?**
- It is a custom built function that can calculate the loss on similarity comparison between two items.
- As a custom built function, it provides better results and it is faster to run.
- A custom built function is required because it is not possible to use a built-in loss function with the Lambda layer.
- A custom loss function is required for using the RMSprop() optimizer.

---

## Custom Loss Functions
### Creating a Custom Loss Function
One great advantage of using the functional API is the additional flexibility in your model architecture design, where instead of each layer being linearly stacked in turn with other layers, you can have branches, cycles, multiple inputs and outputs, and a whole lot more. 

One model you built was a Siamese Network where you had a pair of input paths that eventually joined together into a single layer. The outputs of this layer was a number indicating the amount of similar or difference between a pair of inputs, such as two pictures of clothing. Building the Siamese Network required a custom loss-function, since none of the built-in functions were suitable for the task, and the code that you worked with last week actually contained that custom loss-function. 

Now, we're going to learn all about custom loss functions, including taking a closer look at that one, and how you can use it as a template to create your own. 

In [None]:
model.compile(loss='mse', optimizer='sgd')

#### OR ####

from tensorflow.keras.losses import mean_squared_error
model.compile(loss=mean_squared_error, optimizer='sgd')

You've likely seen a lot of loss functions while you've been working in tensor flow on the loss function is called usually when he specified as a parameter in ```model.compile```. Now the last function itself can be declared using either a string with its name, such as ```mse``` here, which stands from mean squared error, or you can use a loss object. For example, here where you can import the mean squared error object from tensorflow dot caress that losses on then specify that object instead of the string that contain the name of the loss function.

The important difference with using the loss object is that you can add parameters to the object call on. This means that you can have much better flexibility to do things like tuning your hyper parameters.

```model.compile(loss=mean_squared_error(param=vlaue), optimizer='sgd')```

To create a custom loss function, you'll need to create your own function that accepts two parameters on these air, typically called ```y_true``` and ```y_pred``` as in prediction on these contain your true labels and your current predicted values. The loss will be some kind of a function that calculates the difference between the two.

```y_true``` will contain your labels Now the naming might seem a little bit odd, but ultimately you should see this as your source of truth. On this is how the data is supposed to be labeled on. We want the model's predictions to be as close as possible to this value.

```y_pred``` contains the predicted value. The OPTIMIZER has been working on tweaking the weights and biases within the network, and these have been used to calculate the current predicted values so that we can compare them against the labels. The best predictions are as close as possible to the labels, meaning that their loss is as little as possible. So our function has to calculate the losses somehow on, then return them so that they could be used by the optimizer for the next epoch of training. 

```
def my_loss_function(y_true, y_pred):
    return losses
```
    
So it's best to learn this by example. So let's take a look at one on a type of loss function that's called **huber loss**.

### Huber Loss
<img src='Images/Huber Loss Equation.png'>
<img src='Images/Huber Loss Equation Plot.png'>
Huber loss (green, ${\displaystyle \delta =1}\delta =1$) and squared error loss (blue) as a function of ${\displaystyle y-f(x)}y-f(x)$

[Huber Loss (Wikipedia)](https://en.wikipedia.org/wiki/Huber_loss)

There's a couple of variables in here that we need to consider when we're coding, The first is the **threshold**, and it's indicated by the Delta character $\delta$. Now this is pretty important as it appears in a number of different places within the formula. The second is the **error**, which is indicated by the letter $a$, as we're dealing with numeric values, will simply calculate a and it's the difference between the label and the prediction. So the rules of huber loss become when the loss is half of a squared when a is less than or equal to the threshold or the absolute value of a minus half of the threshold.

Here is the code for the function:

In [None]:
def my_huber_loss(y_true, y_pred):
    threshold = 1
    error = y_true - y_pred
    is_small_error = tf.abs(error) <= threshold
    small_error_loss = tf.square(error)/2
    big_error_loss = threshold * (tf.abs(error) - (0.5 * threshold))
    
    return tf.where(is_small_error, small_error_loss, big_error_loss)

The first line is just setting the threshold to 1. Now remember, this was called the Delta character in the formula. We will calculate the error by subtracting the prediction from the true label and remember this becomes a in the formula.

Next will set a Boolean value is small error, if the absolute value of the error is less than the threshold, and this will be used to decide which formula will calculate are lost value. The loss when the error is below the threshold is half of the error squared, so will calculate that here as the loss when the error is small. The loss when it's above the threshold is the absolute value of a minus half the threshold, so we can calculate that here as the loss when the error is large. 

Using ```tf.where```, you can specify three parameters a boolean the value to return. If that Boolean is true on the value to return if it's false, so if is small error is true, will return a small error loss otherwise will return the big error loss.

So let's take a look at this in action. I like to consider this the hello world of machine learning, where I have a set of x's and y's on the relationship between them is linear, so I could just use a single neuron with away and a bias. And then I can train the model with the built in mean and squared error loss a little bit like this, and it works pretty well.

In [None]:
model = tf.keras.Squential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(loss='mean_squared_error', 'sgd')

So if you want to replace this with the custom loss function, all you have to do is this. Create the loss function as a python function as before, and then give the name of that function. In this case, my_huber_loss on that's the parameter defining the loss function. When you compile the model. and that's it, you've just created your first custom last function.

In [None]:
model = tf.keras.Squential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(loss='my_huber_loss', 'sgd')

## Custom Loss Hyperparameters and Classes
### Adding Hyperparameters to Custom Loss Functions
Previously, you saw how to create a loss function based on the huber algorithm. You implemented it using python and then you saw how to use it in its simple neural network. In that case, the function had a hard-coded threshold value that was implemented within the function. But what if you wanted to parameterize that threshold to be able to customize a function? Let's now explore how that would work.

As the threshold is used a lot within the function. It feels like it could be a very powerful hyperparameter for us to tune. Thus, if we made it a parameter that we pass into the function, I think that would be a great way to go. 

In [None]:
def my_huber_loss_with_threshold(threshold):
    def my_huber_loss(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) <= threshold
        small_error_loss = tf.square(error)/2
        big_error_loss = threshold * (tf.abs(error) - (0.5 * threshold))

        return tf.where(is_small_error, small_error_loss, big_error_loss)
    return my_huber_loss

To do this, you can create a **wrapper function** that contains the original loss function. In this case, I've created my huber loss with threshold that accepts a threshold parameter and you can see all of the other code is within that. 

This accepts the threshold parameter so that when you call it, you can pass in that value. That threshold then gets used within the function instead of the hard-coded one in the inner function then gets returned by the outer function. If you call my huber loss with threshold, you'll actually get a loss-function back that implements that threshold.

In [None]:
model.compile(loss=my_huber_loss_with_threshold(threshold=1))

Now if you want to use the threshold, you call the outer function ```my_huber_loss_with_threshold```, which can accept the threshold parameter and then returns a reference to a customize my huber loss function, where the threshold is equal to the chosen parameter. Notice that this is why we introduced the threshold parameter using a wrapper function rather than just modifying the ```my_huber_loss``` function to take in that third parameter for the threshold. 

The model.compile parameter loss expects a function that takes in just ```y_true``` and ```y_pred```. By including the threshold by using a wrapper, you can still provide a loss function that just gives the expected parameters ```y_true``` and ```y_pred```. Now if you want to do some hyperparameter tuning within the loss function, you can tweak that threshold. 

Now you've created a loss function and you've learned how it can be parameterized using a wrapper function. You saw how your custom loss-function could be used within model training. If you've developed using TensorFlow a lot, you probably noticed that loss functions can also often be passed in as object instances of a class, instead of just using a string as the name. To do that, you'll need to implement your loss function as a Python class instead of as a Python function.

### Turning Loss Functions into Classes

Earlier, while looking at Loss functions, you could also see how they were implemented as classes. So instead of passing a string or a function, you could create a class for your loss function and we'll see how to do that. Here's the full code for Huber loss in a class:

In [None]:
from tensorflow.keras.losses import Loss

class MyHuberLoss(Loss):
    def __init__(self, threshold):
        super().__init__()
        
        self.threshold = threshold
        
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) <= threshold
        small_error_loss = tf.square(error)/2
        big_error_loss = threshold * (tf.abs(error) - (0.5 * threshold))

        return tf.where(is_small_error, small_error_loss, big_error_loss)

First, will start by seeing that when you implement a class in python with the class keyword, you do it like this and note that you do inheritance by putting the parent class in parentheses after the class. So this syntax means that ```MyHuberLoss``` will inherit from ```Loss```, this let's us use it as a loss function.

Then within a class you should have two functions and init function that initializes the object from the class and the call function that gets executed when an object is instantiated from the class. The init function gets the threshold and the call function gets the y_true and y_pred parameters that we saw previously. So we will declare threshold as a class variable, which allows us to give it an initial value. As mentioned earlier, the threshold parameter, when passed into the object will be received in the init function. So within the init function we can set the threshold class variable to be the parameterized one. And the threshold class variable will then be referred to within the call function as ```self.threshold```.

In [None]:
model.compile(loss=MyHuberLoss(threshold=1))

Then when we want to specify the loss function in our ```model.compile```, we can do so by specifying them, ```MyHuberLoss``` class and we can pass in our threshold value as a parameter. And that was pretty cool, right?

## Contrastive Loss
Now that we've seen Huber Loss in action, let's go back and take a look at the custom loss function when you created the Siamese Network for image similarity. 

Here's the architecture of the Siamese Network for image similarity that you looked at. It has two similar network architectures with different outputs that get compared to using Euclidean distance to get an overall output. 

<img src='Images/Siamese Network.PNG'>

To calculate the loss in this, we needed a new type of loss function that wasn't in our tool care. We called it contrastive loss as we wanted to contrast the images against each other. The idea is that if two images are similar, we want to produce a feature vector for each image where the vectors are very similar. If the images are different, we want their respective feature vectors to also be different. The paper dimensionality reduction by learning an invariant mapping is the basis for loss like this.

The formula for contrastive loss is here. It's $Y*D^2 + (1-Y) * max(margin - D, 0)^2$. 

Now there's a lot to breakdown here, so let's look at each of these elements in turn. Here, $Y$ is the **tensor of details about image similarities**. They are one if the images are similar and they are zero if they're not. $D$ is the **tensor of Euclidean distances between the pairs of images**. $margin$ is a **constant** that we can use to **enforce a minimum distance between them** in order to consider them similar or different.

Let's consider what happens when $Y$ is one, and I replace the Y's with one, so then this equation will be reduced down to $D^2$, so that we can see for similar images, we're going to have a high value. 

When Y is zero, and we sub this in for Y, then our value instead of D squared will be the max between the $margin - D, 0$, which is then squared, and this should be a much smaller value than $D^2$. You can think of the $1 - Y$ in this loss function as weights that will either allow the $D^2$ of the max part of the formula to dominate the overall loss. When Y is close to one, it gives more weight to the D squared term and less weight on the max term. The D squared term will dominate the calculation of the loss. Conversely, when Y is closer to zero, this gives much more weight to the max term and less weight to the D squared term, so the max term dominates the calculation of the loss. 

When we take into account the parameters that TensorFlow expects for a loss function, let's rewrite it like this. 

$Y_{true} * Y_{pred}^2 + (1 - Y_{true}) * max(margin - Y_{pred},0)^2$

The Y in the original formula becomes the $Y_{true}$ value. The $D$ in the original formula becomes the $Y_{pred}$ value, and now we have the two values you need. Now that we've explored the contrastive loss formula, we actually have the building blocks for coding it. 

- [Dimensionality Reduction by Learning an Invariant Mapping (Hadsell, Chopra & LeCun, 2005)](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)

In [None]:
def constrastive_loss(y_true, y_pred):
    margin = 1
    squared_pred = K.square(y_pred)
    margin_square = K.square(K.maximum(margin - y_pred, 0))
    return K.mean(y_true * squared_pred + (1-y_true) * margin_square)

We return the overall mean of the results of each of the calculations for each element in $y_{true}$ and $y_{pred}$ as shown.

In [None]:
model.compile(loss=consrastive_loss, optimizer=RMSprop())

We simply use it in the ```model.compile```. We specify the name of the function as our loss function, and off we go. 

Earlier we saw that the margin can be used to tune the loss function. Let's now take a look at what it might take to pass the margin as a parameter, so we can try different ones. Creating it is as simple as having an outer function that takes the margin as a parameter, and then has our original function as an inner function that it calls. 

In [None]:
def consrastive_loss_with_margin(margin):
    def consrastive_loss(y_true, y_pred):
        squared_pred = K.square(y_pred)
        margin_square = K.square(K.maximum(margin - y_pred, 0))
        return K.mean(y_true * squared_pred + (1-y_true) * margin_square)
    return consrastive_loss

When we want to use this function, we simply use it in the ```model.compile```. We can specify the margin parameter when calling it, by saying margin equals a value. 

In [None]:
model.compile(loss=consrastive_loss_with_margin(margin=1), optimizer=RMSprop())

For example, if we want it to be 1, you can do that here. As we saw with the Huper loss, we can create a class that derives from Keras loss, and that contains an init and a call method. Init can be used to accept a margin parameter which sets a class variable, the class variable can be used within the loss function as ```self.margin```. 

In [None]:
from tensorflow.keras.losses import Loss

class MyConstrastiveLoss(Loss):
    def __init__(self, margin):
        super().__init__()
        
        self.margin = margin
        
    def call(self, y_true, y_pred):
        squared_pred = K.square(y_pred)
        margin_square = K.square(K.maximum(margin - y_pred, 0))
        return K.mean(y_true * squared_pred + (1-y_true) * margin_square)

To use it, you then simply specify the class in your ```model.compile```, of course the parameter can be passed in its constructor like this.

In [None]:
model.compile(loss=MyConstrastiveLoss(margin=1), optimizer=RMSprop())

## Assignment
### Custom Loss
**1.One of the ways of declaring a loss function is to import its object. Is the following code correct for using a loss object?**
<img src='Images/Week2 Q1.png'>
- True
- False


**2.It is possible to add parameters to the object call when using the loss object.**
<img src='Images/Week2 Q2.png'>
- False
- True


**3.You learned that you can do hyperparameter tuning within custom-built loss functions by creating a wrapper function around the loss function with hyperparameters defined as its parameter. What is the purpose of creating a wrapper function around the original loss function?**
<img src='Images/Week2 Q3.png'>
- That’s one way of doing it. We can also do the same by passing y_true, y_pred and threshold as parameters to the loss function itself.
- No particular reason, it just looks neater this way.
- The loss ( model.compile(..., loss = ) ) expects a function with two parameters, y_true and y_pred, so it is not possible to pass a 3rd parameter (threshold) to the loss function itself. This can be achieved by creating a wrapper function around the original loss function.
- The loss ( model.compile(..., loss = ) ) expects a function that is only a wrapper function to the loss function itself.


**4.
Question 4
One other way of implementing a custom loss function is by creating a class with two function definitions, init and call.
Which of the following is correct?**
<img src='Images/Week2 Q4.png'>
- We pass the hyperparameter (threshold) , y_true and y_pred to the call function, and the init function returns the call function.
- We pass y_true and y_pred to the init function, the hyperparameter (threshold) to the call function.
- We pass the hyperparameter (threshold) to the init function, y_true and y_pred to the call function.
- We pass the hyperparameter (threshold) , y_true and y_pred to the init function, and the call function returns the init function.


**5.The formula for the contrastive loss, the function that is used in the siamese network for calculating image similarity, is defined as following:
Check all that are true:**
<img src='Images/Week2 Q5.png'>
- If the euclidean distance between the pair of images is low then it means the images are similar.
- Margin is a constant that we use to enforce a maximum distance between the two images in order to consider them similar or different from one another.
- Y is the tensor of details about image similarities.
- Ds are 1 if images are similar, 0 if they are not.

---

## Custom Lambda Layers
### Intro to Custom Layer
You've looked at how to extend Keras and TensorFlow with custom code. you saw the functional API, which allows you to break out of the strict linear definition of a neural network. You could have multiple inputs and outputs, you could split and merge layers, you could reuse layers and a whole lot more. 

And you then explored loss functions. In many ways these are the heart of any machine learning system. You saw how you could create your own from scratch or how you could subclass existing functionality. 

Now, we're going to look at the layers themselves, so that you can enhance your models through creating your own custom layers, and you're not just limited to the layer types that chip with TensorFlow. Let's get going.

### Introduction to Lambda Layers
There are a number of ways that you can customize layer behavior in tensorflow. The first and the easiest if you only need basic functionality as to not created custom layer at all, but to use a lambda layer. This is a layer type that can be used to execute arbitrary code. The purpose of the lambda layer, like I said, is to execute an arbitrary function within a sequential or a functional API model. It's best-suited for something quick and simple or if you want to experiment. Let's take a look at how to use it in code.

In [None]:
tf.keras.Lambda(lambda x: tf.abs(x))

The simplest lambda layer looks something like this and within the parameters, you'll specify the lambda value, in this case it's x and then that value gets mapped, in this case the absolute of x, the absolute value of x. For example, $-1$ gets mapped to one by this lambda layer. 

**How might you use this?** 
Well, consider the **fashion mnist** model architecture that we've used a lot and in this case, take a look at the dense layer. It's activated by a relu and what the relu does is effectively removed negative outputs from the layer that is only positive values flow down to the next layer and an aggregation negative and positive values then won't cancel each other out. 

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

Here's the result of training the network for five epochs, it does pretty well, it's $98.6\%$ of the training set and $97.5\%$ of the validation set. The training also takes about **4 seconds per step**. 

<img src='Images/Relu training result.png'>

If we remove the relu, you can see that it has an impact on the training and model performance and as you can see from these results, whether relu was removed, the accuracy has dropped to $92.2\%$ and $91.5\%$ percent for the training and validation sets respectively and the training time stays about the same.

<img src='Images/Relu training result2.png'>

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128),
    tf.keras.layers.Lambda(lambda x: tf.abs(x)),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

If we change our model architecture to this with no rally when on the dense layer and using an absolute function in lambda layer, we can see a slight increase in accuracy to $98.7\%$ to $97.5\%$ respectively. Along with a slight improvement in performance some epoch taken about **3 seconds** instead of $4$, this is very much a gratuitous demo, but hopefully it lays the foundation for the type of thing you can do with lambda layers.

<img src='Images/Relu training result3.png'>

### Custom Functions from Lambda Layers
Previously you saw how to use Lambda layers to execute arbitrary code within your layer definition. Another example is to have a custom function that the Lambda layer can call in order to encapsulate your code. So for example, if you wanted to implement a modified Relu with a threshold, you could do so. Let's take a look. 

In [None]:
def my_relu(x):
    return K.maximum(0.0, x)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128),
    tf.keras.layers.Lambda(my_relu),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

For example, here's a modified version of the same model where instead of mapping $x$ within the Lambda function directly, it's calling out to another function from within the Lambda layer. So my Relu for $x$ will return whichever is larger $x$ or $0$, then after our dense layer where we previously had the Lambda layer that did the absolute value, you can now put the Lambda layer the calls ```my_relu``` instead. 

In [None]:
def my_relu(x):
    return K.maximum(0.5, x)

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128),
    tf.keras.layers.Lambda(my_relu),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

Notice that in ```my_relu``` we specified it as a maximum of $x$ or $0$, but we could of course change this, for example here, now with $x$ is greater than $0.5$. It will return $x$, otherwise it would return $0.5$, that way you can tweak the Relu function and maybe have an impact on the learning behavior in your neural network. 

So while this isn't the custom layer per say, it can give you similar functionality with custom behavior in a layer that's accessible to the sequential functional APIs. Later you look into creating purely custom layers, but often a Lambda is enough, particularly if you're doing something very simple.

### Architecture of a Custom Layer
Lambda layers are great for basic and simple functionality and prototyping but if you want to do more advanced stuff, like having layers that are trainable then you're going to hit limitations using them. Fortunately, the Keras layers in Tensorflow are inheritable. You can create your own custom layers that are trainable and can perform complex functionality by inheriting from existing functionality. 

Tensorflow supports lots of layer types and we won't go through them all but here's some of the commonly used layers in TensorFlow. You've probably seen many of these already from conf 2_D for convolutional neural networks to LSTM and GRU in recurrent networks and also stuff like activation, dance dropout and of course lambda, which we just saw. 

<img src='Images/tf layers.png'>

**But what is a layer?**

Typically it's a class that collects parameters that encapsulates state and computation to achieve the layers purpose in a neural network. Whether you're using it in the sequential or functional API, every model architecture design item is a layer. 

<img src='Images/layer.png'>

When we say **state**, consider this to be a variable, something that makes a particular instance of a layer unique. These variables can be trainable where during ```model.fit```, TensorFlow can tweak their values to test for better performance or they can be non-trainable, where they might be used for some other feature. 

**Computation** is the means of transforming a batch of inputs into a batch of outputs. It's typically called the forward pass in neural networks, where a calculation is made and then pass to the next layer. 

For example, for a very simple dense layer in a neural network, it has the parameters W and C often called the kernel and the bias or the weight and the bias. It will return a computation Y equals WX plus C. W and C in this case are the state and the formula $Y = WX + C$ is the computation. You will next take a look at building a layer like this in practice.

<img src='Images/layer2.png'>

### Custom Dense Layer
Previously, you saw how a layer was architected using a state for its integers and a computation that transfers inputs to outputs. That might seem a little abstract right now, so let's clarify by getting down into the code. 

In [None]:
class simpleDense(Layer):
    def __init__(self, units=32):
        super(SimpleDense, self).__init__()
        self.units = units
        
    def build(self, input_shape): # Create the state of the layer (weights)
        # Initializing weights
        w_init = tf.random_normal_initializer()
        self.w = tf.Variable(name='kernel', 
                             initial_value=w_init(shape=(input_shape[-1], self.units), dtype='float32'),
                             trainable=True)
        # Initializing the bias
        bias_init = tf.zeros_initializer()
        self.bias = tf.Variable(name='bias', 
                             initial_value=bias_init(shape=(self.units,), dtype='float32'),
                             trainable=True)
    
    def call(self, inputs): # Create the computation from inputs to outputs
        return tf.matmul(inputs, self.w) + self.bias

Here's the complete code for this layer type, and I'm going to call this ```SimpleDense```. When creating a layer, you inherit from Keras's ```Layer``` class by specifying it in parentheses after your class name like this. Then your class will need at least these three methods. The first of them, 
1. ```init```, will initialize the class that accepts the parameters and it sets up the internal variables. 
2. The second, ```build```, will run when your instance is created. You'll use this to specify your local input states and any other housekeeping that's needed for that creation. 
3. The third, ```call```, performs the computation and it's called during training to get the output from this cell. 

Let's go back to ```init```. The first thing that it needs to do is pass any initialization back to the base class. Remember that this is inheriting from the ```Layer``` class, so some initialization needs to be performed there too, and that's done using the ```super``` keyword. Then a local class variable called units will be set up to the parameter value of units that was passed in, will default to 32 units in this case, so if nothing is specified, this layer will have 32 units in it.

Within the ```build```, you'll initialize the states. In this case, we're calling them w and b. Remember that when we create the layer, we're not creating a single neuron, but a number of neurons specified by the units variable. Every neuron will need to be initialized, and TensorFlow supports a number of built-in functions to initialize there values. One of these is the ```random_normal_initializer```, which as its name suggests, initializes them randomly using a normal distribution. ```self.w``` will hold the states of the ws and they'll be in a tensor by creating them as a ```tf.Variable```. This will be initialized using the ```w_init``` for its values, it's given the name ```kernel``` so that we can trace it later. Note that it's set up to be ```trainable```, so when you're doing a ```model.fit```, the value of w can be modified by TensorFlow. The bias is initialized differently using a ```tf.zeros_initializer``` function, which as the name suggests, will set it to zero. ```self.bias``` will then be a tensor of the number of units in the layer, and they'll all be initialized as zeros. As you can see, that'll also be trainable. Call will do the computation, and as our ```self.w``` and ```self.bias``` are tensors, we could do a matmul operation on them to multiply the inputs by w and then add b before returning it. The inputs here are our typical x-value, so $y = wx + b$.

In [None]:
my_dense = SimpleDense(units=1)
x = tf.ones((1,1))
y = m_dense(x)
print(my_dense.variables)

Now let's look at it in action. In this case, I'll create a dense layer called SimpleDense and I'll initialize it with just one neuron. I'm going to initialize an x as a tensor with ```tf.ones((1, 1))```, which returns a tensor of the shapes specified filled in with ones, so this will give me a one-by-one tensor, which contains the value one. I'm going to say y equals ```my_dense(x)```, so that a dense layer will be initialized. It has a single unit, so it will get the value of x. After this, we can look at the variables inside my dense, and we can see how they've been initialized by looking at the tensors. 

The kernel or w received a random normal distribution and ended up with 0.036. 

<img src='Images/custom layer w.PNG'>

The bias, as we saw before, is initialized with zeros, so it contains zero. 

<img src='Images/custom layer b.PNG'>

Now remember this is just the initial state of the layer, it only has one neuron in it, and that neuron is initialized with the value for the kernel and another for the bias. When we get to training the neural network, these values are going to change. If we don't train, then our answer will be way off. In other words, our loss will be very high. Let's take a look at this in action next.

### Training a Neural network with your Custom Layer
I've always liked to explain this as a neural network makes a guess as to the relationship between the ```xs``` and the ```ys```. It measures how good or how bad that guess is using the loss function and then it uses the optimizer to make another guess, and so on. 

In [None]:
import numpy as np

xs = np.array([-1,0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3,0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

model =tf.keras.Squential([SimpleDense(units=1)])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(xs, ys, epochs=500, verbose=0)
print(model.predict([10.0]))

The data that I've used here has a $y = 2x - 1$ relationship and the first guess is the random initialization of the neural network that we just saw. In other words, the first guess is $Y = 0.036x + 0$, and that's not even close. You see this in the early epochs of training, a very simple scenario like this has an enormous loss on the first epoch and the loss gradually reduces as the parameters get closer and closer to our desired ones. We created the data with the relationship $y = 2x - 1$, so $Y = 0.036x + 0$ is clearly way off. 

<img src='Images/custom layer t1.png'>
<img src='Images/custom layer t2.png'>

The next guess is a bit better, the next one is better, and better, and better, and so on. The internal parameters in the neurons are learning to get closer to the correct answer. 

In the past, you would have used the Keras dense layer here, but now we can replace that with our ```SimpleDense``` one. After training, you can try to predict the value, for 10 the answer is very close to 19. The parameters within the layer have gotten a pretty close estimate to two for the kernel, and minus one for the bias. 

Indeed, after training, you can inspect the variables and you'll see something like this, where the relationship was $y = 2x - 1$, the learned parameters where 1.9972587, which is pretty close to $2$, and $-0.991591$, which is pretty close to $-1$. 

<img src='Images/custom layer t3.png'>

If you want to try a more complex model, you can use the simple layer there too. For example, here's the model architecture that works nicely with mnist or fashion mnist. But one thing to pay attention to before you replace the dense type from Keras with your simple layer, is the ability to specify an activation function on the layer. The simple layer cannot do that yet, and you'll see how to do that later. 
```
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])
```

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    SimpleDense(128),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

If you wanted to use simple dense, you could declare it like this. This model without the ```reLu```, won't perform as well as the previous architecture because of the positive impact using ```reLu``` has an activation function. 

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    SimpleDense(128),
    tf.keras.layers.Lambda(my_relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

But you could use a ```Lambda``` function with the ```reLu``` you implemented earlier and then your model will perform quite well. That's it for creating your own simple dense type.

### Activationg your Custom Layer
Previously, you saw how to create a custom layer for TensorFlow and Keras that provided a layer of simple neurons that contain two trainable variables, a kernel or weight and bias, as well as the compute method that multiplies the input by the kernel and adds the weight. You saw how these can be the basic building blocks of a deep neural network by being able to add them in layers, and you also saw how to build some simple machine learning models using them. But your custom layer was missing the ability to do an activation on them. There was a workaround that we did using a ```Lambda``` layer, but it's simpler and cleaner to specify an activation function on a layer. Let's take a look at how to expand our dense class to be able to do that. 

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

Here's the basic model architecture to classify MNIST or fashion-MNIST, it should look pretty familiar by now. When specifying the dense layer of neurons, I can specify the number as well as an activation function, which you can see here is defined by name and in this case it's ```relu```. When using our simple dense layer, we didn't have the facility to specify ```relu```, but a workaround was to implement our own ```relu``` function and activate it using a ```Lambda``` layer. This worked pretty well, but it's a bit hacky. In order to see how to implement an activation, let's do a quick recap of our simple dense. 

In [None]:
class simpleDense(Layer):
    def __init__(self, units=32):
        super(SimpleDense, self).__init__()
        self.units = units
        
    def build(self, input_shape): # Create the state of the layer (weights)
        # Initializing weights
        w_init = tf.random_normal_initializer()
        self.w = tf.Variable(name='kernel', 
                             initial_value=w_init(shape=(input_shape[-1], self.units), dtype='float32'),
                             trainable=True)
        # Initializing the bias
        bias_init = tf.zeros_initializer()
        self.bias = tf.Variable(name='bias', 
                             initial_value=bias_init(shape=(self.units,), dtype='float32'),
                             trainable=True)
    
    def call(self, inputs): # Create the computation from inputs to outputs
        return tf.matmul(inputs, self.w) + self.bias

It's a pretty compact piece of code that has ```init```, ```build``` and ```call``` methods. It's the job of the ```init``` method to do the initial setup of the layer, including managing the inheritance from the layer base class. If we're going to use an activation function, we'll have to set that up here. The ```build``` initializes the internal state of the layer, typically giving the variables their initial values, doesn't need to do anything with the activation function. Then the ```call``` does the calculation on the layer and in the case of using an activation, we'll apply the activation function to that calculation before returning it. 

In [None]:
class simpleDense(Layer):
    def __init__(self, units=32, activation=None):
        super(SimpleDense, self).__init__()
        self.units = units
        self.activation = tf.keras.activations.get(activation)
        
    def build(self, input_shape): # Create the state of the layer (weights)
        # Initializing weights
        w_init = tf.random_normal_initializer()
        self.w = tf.Variable(name='kernel', 
                             initial_value=w_init(shape=(input_shape[-1], self.units), dtype='float32'),
                             trainable=True)
        # Initializing the bias
        bias_init = tf.zeros_initializer()
        self.bias = tf.Variable(name='bias', 
                             initial_value=bias_init(shape=(self.units,), dtype='float32'),
                             trainable=True)
    
    def call(self, inputs): # Create the computation from inputs to outputs
        return self.activation(tf.matmul(inputs, self.w) + self.bias)

To update the layer implementation for activations, we don't need to change the ```build```. We only need to edit the ```init``` and ```call``` functions. In the ```init``` function, we have to specify that we'll accept an activation function. The activation function can either be a string containing the name of the function or an instance of an activation object. We can default it to ```None``` so that if we don't receive the parameter, we won't use any activation function at all. Then we can set our ```self.activation``` variable to be the value of ```tf.keras.activations.get```, with this activation name. This will set ```self.activation``` to be an instance of the named activation function. 

For example, if we pass ```relu``` as the activation, Keras will give us a ```relu``` function as ```self.activation```. Remember, you can pass either a string naming the activation function or an object instance of one. If you pass something invalid, your code will fail at this line. 

Then in ```call```, as you might be familiar with, we calculate the return value on the layer to be the $inputs*w + b$. But we now need to activate that. For example, with ```relu```, if the value of that is less than zero, we just return zero. Activating is as simple as calling the activation function with the results of the calculation and then returning that. 

Now let's look back at our model architecture and we can implement our simple dense with a specified activation function, where we can see that we've implemented a simple dense layer with 128 neurons, which is activated by the ```relu``` function.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28,28)),
    SimpleDense(128, , activation='relu'),
    tf.keras.layers.Lambda(my_relu),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
    ])

## Assignment

**1.Lambda layer allows to execute an arbitrary function only within a Sequential API model.**
- False
- True


**2.Which one of the following is the correct syntax for mapping an increment of 2 to the value of “x” using a Lambda layer? (tf = Tensorflow)**
- ```tf.keras.layers.Lambda(lambda x: tf.math.add(x, 2.0))```
- ```tf.keras.layers(lambda x: tf.math.add(x, 2.0))```
- ```tf.keras.layers.Lambda(x: tf.math.add(x, 2.0))```
- ```tf.keras.Lambda(x: tf.math.add(x, 2.0))```


**3.One drawback of Lambda layers is that you cannot call a custom built function from within them.**
- False
- True


**4.A Layer is defined by having “States” and “Computation”. Consider the following code and check all that are true:**
<img src='Images/Week3 Q4.png'>

- You use ```def build(self, input_shape):``` to create the state of the layers and specify local input states.
- ```def call(self, inputs):``` performs the computation and is called when the Class is instantiated.
- In ```def __init__(self, units=32):``` you use the super keyword to initialize all of the custom layer attributes
- After training, this class will return a $w*X + b$ computation, where X is the input, w is the weight/kernel tensor with trained values, and b is the bias tensor with trained values.


**5.Consider the following code snippet.**
<img src='Images/Week3 Q5.png'>


**What are the function modifications that are needed for passing an activation function to this custom layer implementation?**
1. 


```
def build(self, input_shape): 
   self.activation = tf.keras.activations.get(activation) 

def call(self, inputs): 
   return self.activation(tf.matmul(inputs, self.w) + self.b)
```


2. 


```
def __init__(self, units=32, activation=None):
    self.activation = tf.keras.activations.get(activation) 

def call(self, inputs): 
   return self.activation(tf.matmul(inputs, self.w) + self.b)
```

3. 


```
def __init__(self, units=32):
    self.activation = tf.keras.activations.get(activation)

 def call(self, inputs):
    return self.activation(tf.matmul(inputs, self.w) + self.b)
```

4. 


```
def build(self, units=32, activation=None):
    self.activation = activation 

def call(self, inputs): 
   return self.activation(tf.matmul(inputs, self.w) + self.b)
```   

---

## Complex Architectures with the Functional API
You've come a long way in a very short time. You started by looking at the functional AP ice on you saw how you don't have to be limited to simple sequences of layers. You then looked into customizing lost functions by creating your own. And then last week, you saw how you can customize layers by sub classing the layers, a p I to create your own layer types for the exercise and a bit of fun. You looked at building a quadratic layer to replace the default dense layer type that's used in deep neural networks. This week you'll wrap it all up by looking at the model A P I, where you can explore how models work on. Look into How you can extend models with functionality that's neither in the sequential AP eyes nor the functional AP s. We'll start by looking at how to define a wide and deep model using the functional AP ice on. From there, you'll see how to extend the model in order to encapsulate it into a single class for more flexibility.

Let's re-explore creating models using the functional API and from there, we'll be able to understand the power of extending the model classes to encapsulate our code. So for example, consider this model. 

<img src='Images/Deep Model.png'>

It's a complex model in that it isn't a directed sequence of layers. There are multiple paths through the model that get concatenated, and there are multiple inputs. So we have to use the functional API to build something like this. This type of network is a very simple example of a **deep and wide** model, where one input paths to the network, it goes through with deep learning with multiple layers of neurons and the other doesn't, it will typically be a shallow or a wide layer. 

For the deep side of the model, we need to input data and you can see that here. The input is fed through a couple of dense layers, and that gives us the name deep. On the wide side, we have another input that operates in parallel to the deep one. The results of these two sides are then concatenated and the results of that are fed into the output dense. 

[Google AI Blog - Wide and Deep Learning: Better Together with TensorFlow](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html)

If we explore how to code, we'll see code that looks a little like this:

In [None]:
input_a = Input(shape=[1], name='Wide_Input')
input_b = Input(shape=[1], name='Deep_Input')
hidden_1 = Dense(30, activation='relu')(input_b)
hidden_2 = Dense(30, activation='relu')(hidden_1)
concat = concatenate([input_a, hidden_2])
output = Dense(1, name='Output')
model = Model(inputs=[inputa_a, input_b], outputs=[output])

We can declare a wide input first, and we'll refer to that as ```input_a```. Next is the deep inputs, which we'll refer to as ```input_b```. The first hidden layer, ```called hidden_1```, is declared as having 30 units and it's activated by ```relu```, by putting ```input_b``` in parentheses afterwards, we're also declaring that this level should follow ```input_b```, which is our deep input. So the layer ends up as shown. 

The second layer called ```hidden_2```, is architected similarly, but is declared to follow ```hidden_1```, so it will appear like this in the graph. We then declare the concatenate layer with a list of the layers whose outputs it will concatenate. Which in this case are ```hidden_2```, and the wide input layer, also known as ```input_a```. The output layer is defined as a dense which follows the concat. So it ends up getting the results of the concatenation as its input. When we declare the model, we do so with a model call, passing it a list of inputs and outputs. As we have two inputs, we list them both along with our single output. 

It's a very flexible way of creating models. So for example, if we want to add another output, we can do so and it's here using a single node dense layer that's specified to follow the second hidden layer. That layer also feeds the concatenate, so we end up with an architecture like this and when we declare the model, we have to add the new output in order for it to be recognized. Changes like this can also lead to the Keras model diagram being drawn differently, words rebalanced. So in this case, the wide input has moved to the right and the deep input to the left. The architecture itself hasn't changed, except of course for the addition of the auxiliary output. 

<img src='Images/Deep Model AUX.PNG'>

Now you can begin to see much of the flexibility offered by the functional API in creating complex architectures. You'll see how to encapsulate this code into a class, which will then keep your training code much cleaner.

Previously you saw how to implement a complex architecture using the functional API. In this video, you'll see how this architecture can be encapsulated into a class for tidy your code and easier reuse.

In [None]:
class WideDeepMoodel(Model):
    def __init__(self, units, activation='relu', **kwargs):
        super().__init__(**kwargs)
        self.hidden_1 = Dense(units, activation=activation)
        self.hidden_2 = Dense(units, activation=activation)
        self.main_output = Dense(1)
        self.aux_output = Dense(1)
        
    def call(self, inpus):
        input_a, input_b = inputs
        hidden_1 = self.hidden_1(input_b)
        hidden_2 = self.hidden_2(hidden_1)
        concat = concatenate([input_a, hidden_2])
        main_output = self.main_output(concat)
        aux_output = self.main_output(hidden_2)
        return main_output, aux_output

For cleaner codes when encapsulating the entire model, particularly useful if you want to orchestrate multiple models in a solution, you can create a class of your own. When this class extends the base Keras model, you can then do everything that you would do with a model such as training and running inference et cetera. When you extend the model class, you need to implement at least the following two methods. The first ```init```, initializes the class and should also be used to initialize the base class that this one extends. In this case that's the model class and then it creates the instances of the internal variables or states that this class will use. If it looks similar to what you did for custom layers last week, you'd be right, the pattern is pretty much identical. 

The model had hidden layers and outputs that we're dense layers. We can create class variables that represent them using the ```init``` function. The other function you'll need is the call function, and that gets executed when the class is constructed, and then here you can define your model outputs, which will be returned by the ```call``` function. The outputs are generated based on the inputs through the model architecture, so you effectively encapsulate the entire architecture here. You pass it the inputs, and it passes their data through the network architecture to get the outputs that it can then return back.

In [None]:
model = WideDeepModel()

Now, once you've defined this class, it's easy to create a model using an instance of it like this and this keeps your main code so much cleaner. 

But it also has lots more benefits because the creation of the layers is separated from the usage, you can do lots of interesting stuff in the ```call``` method. For example, you could have loops defining multiple layers, you could have if then statements or other operations. You aren't limited to the static declaration of the model that you'd have with the functional or sequential APIs. It would also allow you to define subnetworks that might be used in larger models, and you'll explore that next.

### Using the Model class to simplify architectures
Previously you saw how the Keras model class can be subclassed, allowing you to encapsulate your models and extending how models are executed by your code. You saw a code example for a complex wide and deep model that you could use in a single line of code where all of the layers where implemented within the custom class, instead of using them in your main code. 

Before we go deeper into some examples of how you might want to use this, let's do a quick recap of the implications of writing your code this way. First of all, you're using the model class and it's the same one that you'd be using if you are creating a model using the sequential or the functional APIs. 

<img src='Images/Model Class.png'>

Because you're subclassing and extending the original model class, you can take advantage of the functionality that's already present in the model class, including being able to use the built-in training evaluation and prediction loops, so things like ```model.fit```, ```evaluate``` and ```predict``` are already there, you don't need to re-implement them, which is nice. But of course, if you wanted to, you could override and customize this functionality. Additionally, APIs for saving and serialization are available to you. Things like saving the model or a particular set of weights can be done easily. And, the APIs that you use to summarize or visualize the model are available, so you can use ```model.summary``` or plot model on your custom model too. 

If you're using the sequential a functional APIs, there are still some limitations that you have to consider when you look at complex or exotic model architectures. For example, networks like **directed acyclic graphs**, which are made up of directed layers that don't loop or cycle are pretty well-suited for sequential or functional APIs. 

<img src='Images/Limit Seq&Func.png'>

Examples of this are **MobileNet** and **Inception**. In these cases, the data flows from the inputs to the outputs, and sometimes it's in multiple branches as we've seen already, but the direction is always the same, it never loops back during training or inference. Thus networks where recursion is used or dynamic networks where the architecture can change on the fly can be very difficult to build if you use these APIs. With sequential, they're impossible, with a functional APIs, they might be possible, but will involve a lot of coding. But with **model subclassing**, it can become a bit easier to achieve these complex scenarios. 

When subclassing models, you do have a number of benefits. 

<img src='Images/benifits sub class.png'>

1. The first is that it extends the natural way you've likely been building models using the sequential of functional APIs, it's not a revolutionary change and it doesn't require you to learn extensive new skills. 
2. The second is that it allows you to continue to use model code that you've previously used with the functional or sequential APIs, so that investment isn't necessarily lost and it can be encapsulated into future work. 
3. When creating your architecture using subclass models, you can design your architecture in a modular way that allows you to swap pieces of your architecture in and out without replacing the entire thing. 
4. This allows you to try experiments quickly, perhaps you might want to have 3 convolutional layers instead of 4 in a part of your architecture. Instead of rewriting the code for the full architecture and possibly introduce new bugs, you could have a convolutional model with 3 layers in it and change just that module to have 4. 
5. Also interestingly, you could change the flow of data in your network. Where instead of data going from the top layer to the bottom one, you can have things like branches and loops. 

Using and subclassing the model class can give you powerful and flexible model architectures. Next you'll see how to implement these in a ResNet architecture.

### Understanding Residual networks
The example I look at in this video is a simplified version of a residual network or Resnet for short. This type of network architecture has been shown in research to help you increase the depth of your network while potentially not losing accuracy. 

Ultimately are as networks by having an alternative path from the input to the output, and it can be represented like this.

<img src='Images/Loop.png'>

 So for example, this block might have two dense layers. Data flows through it as we're familiar with Innopath, I'm going to call the main path. But a rez net also has a shortcut path where the data doesn't go through the same route, and it might flow through this block completely unaltered. It's typically summed with the main path data so that the output of the block isn't just the modified data from passing through the dense layers, but the original plus the altar data. So for example, if we were to define this in a code block or a function, we could encapsulate the entire flow a little bit. 
 
So now consider a network architecture that uses a block like this. 

<img src='Images/Block.png'>

There may be two different types and I'm going to call them residual type one which is colored in blue and residual type 2, which is in orange. Your architecture could define a dense layer than a type one residual, then three Type 2 residuals followed by another dense layer. What I'm referring to is type one or type 2. Here aren't formal types, it's just slightly different blocks that you've defined. So in the previous slides we had two dance blocks with a shortcut around them. Let's call that Type 2, and perhaps there's a different block that might have two convolutional layers with a shortcut around them. We can call that Type 1. So for example, if you have the block we defined earlier as a residual type 2, you could have three of them in sequence in your architecture here. But one of the Golden rules of programming is don't repeat yourself. Why should you have three blocks of code here to go through the data when instead you could have a loop that runs the data through residual type 2 three times instead? Also, it could be the same weights in each block, so instead of each of the three blocks learning independent weight separately, you get one block that is learned and executed three times. This type of design might seem completely alien. If you were using the sequential API and would involve a lot of spaghetti code if you're using the functional API. 

<img src='Images/Block simple.png'>

So let's take a look at how we could implement this using model subclassing next. So let's start with one of our residual layers. 
<img src='Images/CNNBlock.png'>

In [None]:
class CNNResidual(Layer):
    def __init__(self, layers, filters, **kwargs):
        super.__init__(**kwargs)
        self.hidden = [Conv2D(filtters, (3,3), activation='relu') 
                       for layer in layers]
        
    def call(self, inputs):
        x = inputs
        for layer in self.hidden:
            x = layer(x)
        return inputs + x

I'll call this one ```CNNResidual```, because this custom layer block will be residual, but it will include convolutional layers within it. When you construct an object from this call, you'll pass parameters for the number of layers that you want, as well as the number of convolutional filters within each layer. So this code can define the set of hidden layers within the CNN residual class to be convolutional 2D layers that are ```(3,3)``` filters and are activated by ```relu```. The loop will create an amount of them as specified in the layers parameter. Then within our call we can start flowing data through the layers starting with the inputs. And then using the functional API, each acts will follow a previous acts in self dot hidden and remember that there are a number of layers in ```self.hidden``` so will loop through each of them, which is the main path through the residual network block. Then at the end, the input is appended to the results, so we have the combination of the main path on the shortcut. 

The other residual layer type can be a ```DNNResidual``` and the code for this is very similar.
<img src='Images/DenseBlock.png'>

In [None]:
class DNNResidual(Layer):
    def __init__(self, layers, neurons, **kwargs):
        super.__init__(**kwargs)
        self.hidden = [Dense(neurons, activation='relu') 
                       for layer in layers]
        
    def call(self, inputs):
        x = inputs
        for layer in self.hidden:
            x = layer(x)
        return inputs + x

Well, we will construct the inner layers of this layer using dense layers instead of convolutional ones. The init function is very similar, but will construct with dense and will fill out the number of neurons in the dense layer instead of a number of filters like we had in the convolutional one. As the layers are now nicely abstracted, the call looks identical when using the functional API, you start with the inputs and then for the layers in the hidden section you can use the functional API to add them. Remember that with the functional API, the ```x = layer(x)``` syntax means that ```x``` will be a new stack of layers consisting of layer following ```x```.  Then, we add the inputs to the resulting concatenation in order to get the concatenation of the shortcut and the main path. 

Now if we look back at our model architecture.
<img src='Images/Block simple.png'>

we can replace the residual type one and residual Type 2 with ```CNNResidual``` and ```DNNResidual```. We want to have a single ```CNNResidual``` as well as a loop around the ```DNNResidual``` three times. So here's the code:

In [None]:
class MyResidual(Model):
    def __init__(self, **kwargs):
        self.hidden_1 = Dense(30, activation='relu')
        self.block_1 = CNNResidual(2, 32)
        self.block_2 = DNNResidual(2, 64)
        self.output = Dense(1)
        
    def call(self, inputs):
        x = self.hidden_1(inputs)
        x = self.block_1(x)
        for _ in range(1,4):
            x = self.block_2(x)
        return self.output(x)

First is the initialization of the model where the states are declared. Note that there is no sequence inherent here that comes later when the model is called. Right now we're just defining the layers on the model, so the first called ```hidden_1``` is a dense with 30 neurons, activated by ```relu```. Earlier we created the ```CNNResidual``` layer type and we can initialize that with two layers, each containing 32 filters. Will call that ```block_1```. Then we'll define a ```DNNResidual```  with two inner layers, each with 64 neurons. We can call this ```block_2```. Note that we don't have three instances of ```block_2``` we will just have one and will pass the data through it 3 times in both training and inference, but that won't happen until we call the layer. So let's just define it once. And then there's our output, which will be just a dense layer with one neuron in it. 

So the fun really begins on the call, and here's where will construct the architecture. Remember this is using the functional API, so the syntax is to define a layer and then put it in parentheses which layer it's following. So hidden one follows inputs and that chunk of the architecture will be called X. Then self block one will follow the chunk that we just called X and that chunk will now be called X block one. As you can see in in it is our CNN residual layer. Now we can go into a loop. We wanted the data to loop through the same DNN residual layer three times so we can have a loop in our call block. Remember that we only have one instance of a DNN residual were not copying it 3 times. Instead, we're looping through it 3 times for cyclic network. So following the functional syntax, this gets appended to the network and this chunk is now called X. And our final element will be output layer that we defined following acts. Annex of course was that chunk of architecture that was defined by the earlier lines. We now have a complete model with cyclic functionality using custom residual layers that are encapsulated into a single simple compact class. Next we're going to take a look at a popular network architecture called Resnet 18 and using what you've learned here, you'll see how that can be built.

- Lecture on [Residual Networks](https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9) by Andrew Ng (part of [Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning), [Course 4: Convolutional Neural Networks](https://www.coursera.org/learn/convolutional-neural-networks))

### Coding a Residual network with the Model class
A powerful model for computer vision is ResNet-18. This is a network that was designed using residual methods to ease the training of a very deep network. We'll take a look at how a ResNet-18 is architected, and yours model subclass to be able to build it. So let's take a look.So here's a look at a resonant architectures that's based on ResNet-18. It starts with a 7 by 7 convolution, a layer, which is batch normalized, followed by a 3 by 3 max pooling. 
<img src='Images/Resnet18.PNG'>

After this are repeated blocks of logic, which are usually called identity ResNet blocks. As you can see, it's a residual network with a path through convolutions. And batch normalization with a shortcut path around them. I've marked those in blue. The name identity comes from the fact that they have a shortcut, which often doesn't change the input X at all. And just performs an identity transformation instead of going through all the layers.
<img src='Images/Resnet18Identitty.PNG'>

Additionally, there is a different block type, which is almost identical, but which routes the short cut through a 1 by 1 convolutional layer. I've marked that in orange.
<img src='Images/Resnet18Conv.PNG'>

So this section, with the resonant identity block in blue on the enhanced one in orange, is then repeated a number of times. For example, 3 times.

It's a big and a complex model, so let's start by creating a mini version of that, that has to identity blocks. We'll use that to build a simple classifier.Starting with the identity block, here's the code that constructs it.

In [None]:
class IdentityBlock(tf.keras.Model):
    def __init__(self, filters, kernel, kernel_size):
        super(IdentityBlock, self).__init__(name='')
        
        self.conv1 = tf.keras.layers.Conv2D(filters, kernel_size, padding='same')
        self.bn1 = tf.keras.layers.BatchNormalization()
        
        self.conv2 = tf.keras.layers.Conv2D(filters, kernel_size, padding='same')
        self.bn2 = tf.keras.layers.BatchNormalization()
        
        self.act = tf.keras.layers.Activation('relu')
        self.add = tf.keras.layers.Add()
        
    def call(self, input_tensor):
        x = self.conv1(input_tensor)
        x = self.bn1(x)
        x = self.act(x)
        
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.add([x, input_tensor])
        
        x = self.act(x)
        return x

Note that we will extend the model class here so this code actually act as a standalone model if we wanted it. In the ```init``` block will define each of the layers. Remember, we're just defining and declaring them here and not the full model architecture itself. So the 3 by 3 convolution, a layer and batch normalization at the beginning of the model are defined here. The number of filters and the colonel size can be parameterised.

The next 3 by 3 and batch normalization at the end of the architecture is defined similarly, so layers can be activated using ```relu```. So we'll define an activation here, and we'll just call ```self.act```.

This is a residual network, so the main path and this shortcut path get added, so we can declare an ```Add``` layer here to handle all of that.

Then within the call method will define a sequence of layers using the functional API, and, as you can see, it begins with the input. It's followed by the first convolutional layer, the first batch, normalization layer and et cetera, et cetera. Notice how things like the relu layer called self.act can be reused.

Now let's look back at our mini resident model, and the identity blocks that we just created in code are what's shown in blue here. 

<img src='Images/MiniIdentitty.PNG'>

You've built these models already. So we need to define the rest of the model, such as the convolutions, the pooling and the batch normalization. That's used outside of the identity layers.
<img src='Images/MiniOther.PNG'>

And here's the code to define the model: 

In [None]:
class ResNet(tf.keras.Model):
    def __init__(self, num_classes):
        super(ResNet, self).__init__(name='')
        
        self.conv = tf.keras.layers.Conv2D(64, 7, padding='same')
        self.bn = tf.keras.layers.BatchNormalization()
        
        self.act = tf.keras.layers.Activation('relu')
        self.max_pool = tf.keras.layers.MaxPool2((3, 3))
        self.idb1 = IdentityBlock(64, 3)
        self.idb2 = IdentityBlock(64, 3)
        self.global_pool = tf.keras.layers.GlobalAveragePooling2D()
        self.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')
        
    def call(self, input_tensor):
        x = self.conv(input_tensor)
        x = self.bn(x)
        x = self.act(x)
        x = self.max_pool(x)
        
        x = self.idb1(x)
        x = self.idb2(x)
        x = self.global_pool(x)
        
        x = self.classifier(x)
        return x

We'll start with the init function that defines each of the layers within the model. Later, we'll see the call function where the model architecture is built up from these layers. As before this will extend the model class, so we're building a new model type of ResNet, which contains identity block models. The initial 7 by 7 convolutional layer is defined as conv, it's batch normalization is to find a self.bn. The relu activation is self.act. A 3 by 3 max pooling will follow this, so we can define that we have to identity blocks. Remember when we created them, they could take parameters with filters and colonel size. So by specifying 64 and 3, these identity blocks will have convs to the layers with 64, 3 by 3 filters within them.

A final global average pulling is then defined. And because this model is going to be a classifier for a number of classes, we can parameterise that. So, for example, if we were going to train it to classify MNIST, which has 10 classes. Instead of hard coding, a final dense layer with 10 neurons, we could simply pass in the desired number of classes as a parameter. And then to find the classifier layer with that number. It's a really nice way to make for a very flexible model, so that's it for the ```init```. 

Let's check out the ```call``` where the model architectures is constructed from these layers. It takes inputs and the convolutional layer is designed to follow them. This, in turn, is followed by the batch normalization layer that gets activated. And is then followed by the max pooling layer are two identity blocks. Then follow that, and this gets to global average pooling before the whole thing gets passed to our dense layer that we used to classify.

In [None]:
resnet = ResNet(10)
resnet.compile(loss='sparse_categorical_crossentropy', 
               optimizer='adam', metrics=['accuracy'])

dataset = tfds.load('mnist', split=tfds.Split.Train)
dataset = dataset.map(preprocess).batch(32)
resnet.fit(dataset, epoch=1)

Using this in code is then very straightforward, and here's a classifier for MNIST that's using our modified mini ResNet. Instantiate by passing the parameter 10. This is because MNIST uses 10 classes, so we want our final dense layer to have that many neurons. We're going to train with the MNIST data set, which could be loaded from ```tfds```. ResNet was subclass from model so we can actually take advantage of ```model.fit``` and called ```resnet.fit``` to train our model. And that's it, that's how to use the models API, to construct your own custom model type.

## Assignment
### Custom Models

**1.Following is an example of a deep and wide network structure.**
<img src='Images/Week4 Q1.png'>
- True
- False


**2.Consider the following code and check all that are true:**
<img src='Images/Week4 Q2.png'>
- The init function initializes the MyModel Class objects, as well as the attributes that are inherited from the Model Class.
- The code is incomplete in the sense that you can only initialize and construct your model, you cannot perform training or inference.
- The concat should be defined within the init function instead of the call function as it is also a hidden layer.
- The output layers cannot give more than 1 result each.


**3.You have learned that Sequential and Functional APIs have their limitations.**

**How can you build dynamic networks where the architecture changes on the fly, or networks where recursion is used? Check all that are true:**
- Using model subclassing
- Using Functional API
- Using Sequential API


**4.Which one of the following is a false statement regarding model subclassing?**
- You cannot introduce a branch structure in the architecture when doing model subclassing.
- You can make use of Functional and Sequential APIs when writing code for model subclassing.
- Instead of tweaking the entire architecture, you can have different modules and make changes in them as required, as opposed to entirely rewriting the structure.
- You can have modular architectures


**5.Consider the following two images:**
<img src='Images/Week4 Q5_1.png'>
<img src='Images/Week4 Q5_2.png'>

**Check all that are true:**
- You loop Residual Type 2 (Dense layers) because you cannot make a loop of Conv2D layers (Residual Type 1)
- You make a loop of Residual Type 2 blocks because you want to reduce the depth of the network (making it less complex of an architecture)
- Each Residual block has two hidden layers and one add layer in it.
- When you make a loop of Residual Type 2 blocks, each block could have the same weights. 

---

## Built-in Callbacks
### Introduction to Callbacks
Callbacks are a useful piece of functionality in Tensorflow that lets you have control over the training process, there's two main flavor of callback. There's the **built in callbacks** that are pre built in functions that allow you to do things like saving checkpoints early, stopping on this **custom callbacks** where you can override the callback class to do whatever you want. 

I'm going to look at the built in callbacks, and then later you could learn how to do the custom ones. So in summary, callbacks are designed to give you some type of functionality, while you're training every epoch, you can effectively have code that executes to perform a task. What that task it is up to you.

<img src='Images/callbacks.png'>

There's a ```tf.keras.callbacks.Callback``` class that you'll subclass, so the pattern you've been looking at in this course for subclassing existing objects will also work for this. They're particularly useful in helping you understand the model state during training, saving you valuable time is your optimizing your model.

In [None]:
class Callback(object):
    def __init__(self):
        self.validation_data = None
        self.model = None
        
    def on_epoch_begin(self, epoch, logs=None):
        """Called at the beginning of an epoch during training."""
        
    def on_epoch_end(self, epoch, logs=None):
        """Called at the end of an epoch during training."""

Here's the anatomy of a callback class, as with any class in python, you define it to extend an existing class and you have local variables initialized in the init function. For callbacks, you have the ```on_epoch_begin``` function that you can override, which, as its name suggests, gets called at the beginning of every epoch. Of course, similarly, there's the ```on_epoch_end``` function that gets called at the end of each epoch.

In [None]:
class Callback(object):
    def __init__(self):
        self.validation_data = None
        self.model = None
        
    def on_(train|test|predict)_begin(self, epoch, logs=None):
        """Called at the beginning of an epoch during training."""
        
    def on_(train|test|predict)_end(self, epoch, logs=None):
        """Called at the end of an epoch during training."""
        
    def on_(train|test|predict)_batch_begin(self, epoch, logs=None):
        """Called at the beginning of an epoch during training."""
        
    def on_(train|test|predict)_batch_end(self, epoch, logs=None):
        """Called at the end of an epoch during training."""

You may have seen or used these already, and we did a little bit of them in the [tensorflow in practice specialization](), if you've studied that. But there are lots of others, including those that could be used when you're training, testing or running predictions with the model.

So, for example, when you run a prediction, you can have a callback that happens at the beginning or the end by calling the unpredict begin method or the unpredict end method, respectively, you could do similar for on training and on testing.

Also, when running batches, you can have ```on_predict_batch_begin``` or ```on_predict_batch_end``` so that you can execute code batch by batch while predicting, and of course, you could do similar for training and testing.

This, of course, leads to the question, where would you use callbacks? 
<img src='Images/callbacks_where.PNG'>

### TensorBoard
Well, the model methods that involve training, evaluation of prediction used them, you simply specify them, using the callbacks equal parameter. So now let's take a look at some of the built in callbacks.
<img src='Images/callbacks_tensorboard.PNG'>

We'll start with TensorBoard, which, if you aren't familiar with it, provides a suite of visualization tools for tensorflow. Unless you visualize your experiments and track metrics like loss and accuracy, as well as viewing the model graph you can learn more about it [here](https://www.tensorflow.org/tensorboard).

In [None]:
log_dir = os.path.join('logs', datatime.dattime.now().strftime('%Y%m%d-%H%M%S'))
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=log_dir)
model.fit(train_batches, epochs=10, callbacks=[tensorboard])

To use the tensorboard callback is super simple. You simply define it and then start training, it's defined by creating an instance of the tense aboard call back and specifying the desired log directory. 
<img src='Images/logs_tensorboard.png'>

Tensorflow then saves the details to that directory, and then, when tensorboard is pointed at the logs directory, it does it's thing such as plotting accuracy and loss. It can even be used by co lab by loading it as an extension, as you can see here.
```
# Load the extension
%load_ext tensorboard

# Run TensorBoard
%tensorboard --logdir logs
```
<img src='Images/tensorboard_plots.png'>

- [TensorBoard: TensorFlow's visualization toolkit](https://www.tensorflow.org/tensorboard)

### Model Checkpoints
Next, take a look at the model checkpoints where the models details can be saved out, epoch by epoch for later inspection, or we can monitor progress through them.

The model checkpoint class saves the model details for you with a lot of parameters that you can use to fine tune it. So let's look at some examples.

<img src='Images/checkpoint.png'>

In [None]:
ModelCheckpoint(filepath, 
                monitor='val_loss', 
                mode='auto', 
                save_best_only=False, 
                save_weights_only=False, 
                verbose=0, 
                save_freq=1, 
                **kwargs)

model.fit(train_batches, 
          epochs=5, 
          validation_data=validation_batches, 
          callbacks=[ModelCheckpoint('model.h5', verbose=1)])

Here's an example of using it in the ```model.fit``` method. Using the callbacks parameter, I specified that I want a model checkpoint with the model file being called ```model.h5```.

Then, during the training process, you can see that the model is getting saved out epoch by epoch. If I don't want the entire model structure and only the weights, I could do so by specifying the save weights only parameter to be true.
<img src='Images/checkpoint_callback.png'>

Or if I only want to save when I reach optimal values, I could do so by specifying save best, only to be true, then whatever value I specify in the monitor parameter will be saved whenever it's optimized. As you can see here in the first epoch, the value started as infinite, and it ended at 0.65278 so get safe, then in the second epoch, it improved, so it got saved and so on. If at some point you're val loss starts to increase, the model checkpoints, of course, would not be saved.

In these examples, I've been showing the native keras ```.h5``` format, but of course, you can also use a ```save_model```, which is the standard tensorflow format. 
```
tf.keras.models.save_model(
    model,
    filepath,
    overwrite=True,
    include_optimizer=True,
    save_format=None,
    signatures=None,
    options=None,
    save_traces=True
)
```

And as the name of the file is just specified using text, you can actually form at the values within the name, so you could have separate weights saved out pair epoch in separate h5 files. Simply by using the epoch value or other metrics such as the validation loss value you can format  the file name, so the last two digits of the epoch are used, so the file is ```weights.01```, ```weights.02``` and so on.

### EarlyStopping
The next built in call back, you can use his early stopping, which is useful for helping you stop training when it hits a metric that you want, where, for example, 10 epochs might be enough, but you're training for 100. It can also be used the other way, if there's not enough of a noticeable improvement, early stopping could end training, saving him a lot of time.

<img src='Images/earlystop.png'>


In [None]:
EarlyStopping(monitor='val_loss', 
              verbose=0, min_delta=0, 
              patientce=0, 
              mode='auto', 
              baseline=None, 
              restore_best_weights=False, 
              **kwargs)

In [None]:
model.fit(train_batches, 
          epochs=50, 
          validation_data=validation_batches, 
          callbacks=[EarlyStopping(monitor='val_loss', patientce=0)])

So, for example, let's look at a scenario that could be used to prevent over fitting, here we want to explore validation loss and ensure that it continues to go down.

So we set the monitor property to validation loss, and then we'll set patients to three, the idea here is that once we hit the best value will log that and we'll wait for this number of epochs (3) to see if the values improve.

So here you can see that at Epoch 15, the validation loss was at its smallest, after which it began to increase by epoch 18, 3 epochs later, it was still worse than at 15. So training stops, if you don't want to lose the weight values from the best epoch you can set, restore best waits to be true. So in our case, even though we stopped at 18 will have the weight restored to where they were at 15. There's other parameters you can play with, but the mode one is crucial to ensure that you're following your monitor values correctly for loss that you want to minimize. So you would then set the mode to min, for others, they might require you to maximize the value so you could change the mode with this property.
<img src='Images/earlystopex.png'>

### CSVLogger
Another super useful callback is the CSVlogger which, as its name suggests, will log your training results out to a CSV file.

In [None]:
model.fit(train_batches, 
          epochs=50, 
          validation_data=validation_batches, 
          callbacks=[CSVLogger('training.csv')])

So, for example, when using it like this, you'll have a file containing the epoch number, accuracy, loss, validation, accuracy and validation lost stored for you. So that's it for this quick look at the different built in callbacks. Next, you'll see how to create custom callbacks.

### Custom Callbacks
Previously you saw how to use the built-in callbacks of TensorFlow, but you can also create your own custom callbacks. 

There are many benefits to creating your own custom callback and first of all, they're highly customizable. Then you can use callbacks that suit the specific needs of your project. Your custom callbacks can still make use of all of the features of the built-in Keras call-backs, if your custom callback extends them. Finally, you can also design your custom callback with functionality that does not yet exist in standard Keras callbacks. 

It's probably easier to demonstrate this using code, so let's go through a couple of scenarios. Before we start, I'm going to build a simple model using TensorFlow Keras. It's a sequential one dense layer containing one neuron and it will be compiled with an RMS prop optimizer, mean squared error loss and the mean average error metrics. You can see it here.

In [None]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=1, 
                                activation='relu', 
                                input_dim=(784, )))
model.compile(loss='mean_squared_error', 
              optimizer=tf.keras.optimizers.RMSprop(lr=0.1), 
              metrics=['mae'])

In [None]:
import datetime

class MyCustomCallback(tf.keras.callbacks.Callback):
    def on_train_batch_begin(self, batch, logs=None):
        print('Training: Batch {}, Begins at: {}'
              .format(batch, datetime.datetime.now().time()))
        
    def on_train_batch_end(self, batch, logs=None):
        print('Training: Batch {}, Ends at: {}'
              .format(batch, datetime.datetime.now().time()))

Now let's define a custom callback, which we'll name my custom callback. We'll do this by subclassing the ```tf.keras.callbacks.Callback``` class. Our custom callback will print the timestamp each time when a training batch starts or ends. To achieve this we'll override the untrained batch begin and the untrained batch end methods to include our business logic, which just displays the timestamps. 

In [None]:
my_custom_callback = MyCustomCallback()

model.fit(x_train, y_train, 
          batch_size=64, 
          epochs=1, 
          verbose=0, 
          callbacks=[my_custom_callback])

To use the custom callback that we've just defined we'll instantiate an instance of this class. Here we save the instance of the custom callback in a variable named ```my_custom_callback``` and as usual, we'll train our model using ```model.fit```. Inside the model's fit method, we have a parameter callbacks that takes a list of callbacks. One of them is our custom callback. We can do this by passing the instance of our custom callback, which is stored in the variable my custom callback. This callback will then give us the timestamp for the beginning and the end of each epoch. 

Now let's look at something a little bit more complex.

In [None]:
class DetectorOverfittingCallback(tf.keras.callbacks.Callback):
    def __init__(self, threshold):
        super(DetectorOverfittingCallback, self).__init__()
        self.threshold = threshold
    
    def on_epoch_end(self, epoch, logs=None):
        ratio = logs['val_loss'] / logs['loss']
        print('Epoch: {}, Val/Train Loss ratio: {:.2f}'
              .format(epoch, ratio))
        
        if ratio > threshold:
            print('Stopping Training...')
            self.model.stop_training = True
            
model.fit(..., callbacks=[DetectorOverfittingCallback(threshold=1.3)])

Let's explore a call back where we measure the ratio between our validation loss and our training loss. If the ratio gets too high, we could have an over-fitting scenario because the validation loss may no longer be decreasing while the training loss continues to decrease, making the ratio of validation loss divided by training loss higher. 

We should in this case, stop training to avoid overfitting. We'll do this by defining a class called Detect Overfitting Callback and this subclass is the Keras callback base class. We'll need a class level variable to hold the threshold so that we can override the init function to take the threshold as a parameter and then store it in self.threshold. 

We can then implement the on epoch end class method overriding the base class. In this method we'll compute the ratio at the end of every epoch. If that ratio was higher than our threshold value, we can stop training. To use this, we'll just then specify detect overfitting callback when performing the model.fit and pass in our desired threshold. 

Notice that we're creating an instance of the class by invoking the detect overfitting callback class constructor and placing it directly into the list that gets passed to the callbacks parameter. You could have also stored an instance of the callback class in a variable and passed in that variable, as you saw in the earlier example with my custom callback. 

Let's look at another example of customizing a callback.
<img src='Images/customCBex.png'>

In this example, we'll train an endless classifier and we'll define a custom callback function called VisCallback. At the end of every epoch, the custom callback generates a visualization of the classified outputs and it will save that image out to disk. You can see an example of this saved image on the right of an epoch end here. A correctly classified example is marked with a green label below the prediction where there's a red label below the predicted output when the prediction is incorrect. At the end of training, the visualized predictions are converted into an animated GIF. 

In [None]:
class VisCallback(tf.keras.callbacks.Callback):
    def __init__(self, inputs, ground_truth, display_freq=10, n_samples=10):
        self.inputs = inputs
        self.ground_truth = ground_truth
        self.images = []
        self.display_freq = display_freq
        self.n_samples = n_samples
        
    def on_epoch_end(self, epoch, logs=None):
        # Randomly Sampled Data
        indexes = np.random.choice(len(self.inputs), size=self.n_samples)
        X_test, y_test = self.inputs[indexes], self.ground_truth[indexes]
        predictions = np.argmax(self.model.predict(X_test), axis=1)
        
        # Plot the digits
        display_digits(X_test, predictions, y_test, epoch, n=self.display_freq)
        
        # Save the Figure
        buf = io.BytesIO()
        plt.savefig(buf, format='png')
        buf.seek(0)
        image = Image.open(buf)
        self.images.append(np.array(image))
        
        # Display the digits every
        if epoch % self.display_freq == 0:
            plt.show()
            
    def on_train_end(self, logs=None):
        imageio.mimsave('animation.gif', self.images, fps=1)

In [None]:
model.fit(..., callbacks=[VisCallback(x_test, y_test)])

First we'll define a custom callback called ```VisCallback``` subclassing tf.keras.callbacks.callback and we'll have a set of class variables. ```self.input``` holds the inputs to run in the model. ```self.ground_truth``` holds the ground truth or the true labels for those input images. That allows us to compare each prediction with the actual label. ```self.images``` will hold the visualize comparison of the prediction against the ground truth. ```self.display``` frequency allows you to choose how frequently you want to display these plots. For instance, you might decide to display a visual at every 10 epochs instead of every epoch. ```self.n_samples``` then determines how many samples to be plotted each time a visualization is generated. 

At the end of every epoch we'll randomly sample data from the list of input images and then classify them. Accordingly, we'll mark the label predictions green or red based on whether they're classified correctly or not. We'll start by using np.randomchoice passing it the number of inputs and the size of samples that we want. This will give us a list of random indexes back and we can use these to pick the images from our training data completely at random. We can then get our x values from the test, from the inputs and using those randomly generated indexes, the corresponding y values from the ground truth. To get our predictions, you can use model.predict and pass it the X test values. You'll use argmax to get the most likely value. Recall that argmax will take a list, find the maximum value on that list, and then return the index position of that maximum value. 

Now that we have our predictions, we can pass our data to a display function to draw them in our plotter. Then when we save those plotted images in a list by reading the results of the plot into a buffer, which can then be appended to the list called self.images. If we want to show the plots occasionally, we can do so with code like this, where for example, if the frequency is 10, it will render every tenth epoch. 

Then it shows how you can create an animated GIF that's composed of several saved images. The imageio library for Python contains a mimsave method that lets you specify an array of images and it will write out an animated GIF of them. As you've kept a list of images in self.images, this is now really easy to do. 

To get this visualization, you specify this custom class in the callbacks when you call your model.fit. You'll pass in the x test and y test so that the visualizations are created on the test set and not on the training set. 

Here's one GIF that was generated by the callback, which you can expect at the end of your training. Remember that it picked a set of images at random each epoch, which is why it changes so much. But see if you can change it to see progress for a fixed set of images so that maybe at the beginning of training, less of them are accurately classified but by the end of training, more of them are accurately classified.

<img src='Images/out_gif.gif'>

## References
**This course drew from the following resources:**


**Week 1:**
- [UCI Machine Learning Repository: Energy efficiency Data Set](https://archive.ics.uci.edu/ml/datasets/Energy+efficiency)
- [Learning a Similarity Metric Discriminatively, with Application to Face Verification (Chopra, Hadsell, LeCun, 2005)](http://yann.lecun.com/exdb/publis/pdf/chopra-05.pdf)
- [Similarity Learning with (or without) Convolutional Neural Network (Chatterjee & Luo, n.d.)](http://slazebni.cs.illinois.edu/spring17/lec09_similarity.pdf)
- [The Distance Between Two Vectors (Mathonline)](http://mathonline.wikidot.com/the-distance-between-two-vectors)

**Week 2:**
- [Huber Loss (Wikipedia)](https://en.wikipedia.org/wiki/Huber_loss)
- [Dimensionality Reduction by Learning an Invariant Mapping (Hadsell, Chopra, LeCun, 2005)](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)

**Week 4:**
- [Lecture on Residual Networks by Andrew Ng (part of Deep Learning Specialization, Course 4: Convolutional Neural Networks)](https://www.coursera.org/lecture/convolutional-neural-networks/resnets-HAhz9)

**Week 5:**
- [TensorBoard: TensorFlow's visualization toolkit](https://www.tensorflow.org/tensorboard)