# CAB420 DCNNs, Bonus Example: DNNs, Parameters and FLOPs
Dr Simon Denman (s.denman@qut.edu.au)

### What is a "Bonus" Example?

These are extra examples that cover content outside the scope of CAB420. It exists becuase of one or more of the following reasons:
* It's closely related to other stuff we're looking at and I wanted to include it, but the course has too much content already, so I punted it here; 
* It's interesting;
* Someone (probably multiple someones if I wrote an example) has asked a question about it before.

You can freely ignore this example if you want. You really don't have to be reading this. You could go outside, go read a book, have a nap, take up a hobby, whatever you want really. The point I want to make here **this example really is optional**. Things here won't appear on an exam, or in an assignment (though you could use this in an assignment if you wanted). But if you're interested, this is here, and if you're reading this, so are you. 

Some things to note with bonus examples:
* These may gloss over details that elsewhere get more coverage. I may skip plots I'd normally include, or gloss over other details. The expecatation is that if you're reading this, you've looked at all the "core" examples and are comfortable with what they're doing. 
* Some bits of code might not be as well explained or explored as you're used to in the other examples. These examples are here for interested students looking to extend their knowledge, and I'm assuming if you're here, you're comfortable figuring code out, debugging stuff, and generally googling about to help work out what something is doing.
* There's no Tl;DR section at the top. If you're here, I'm assuming it's because you're interested and want all the gory details and don't just need the quick summary at the top.
* While my regular examples (the "core" ones) certainly contain their fare share of silly remarks and typos, expect the level of flippancy and the prevalence of typos increase in a bonus example. 

That said, as always, if you are stuggling to follow what I've got in here please shoot me a message. The aim is still for this to be clear enough to follow afterall.

## Overview

DCNNs are computationally expensive. They also result in lots of parameters. But from a computational demand standpoint, not all parameters are equal. 

This example has a little explore at how many FLOPs (floating point operations) are in a DCNN, and the cost of different types of layers, impact of input size, etc. To do this, we'll create a few networks, but:
* We're not going to load any data
* We're not going to train any networks

We're just going to create the network, and then run some analysis over it.

### Where does this fit into all the other CAB420 content?

This complements the other DCNN content. It doesn't tie directly into any other single example, but just probes a bit deeper into the computational demands of DCNNs.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import datetime
import numpy

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorboard import notebook

import sklearn
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
import matplotlib.pyplot as plt

tf.keras.backend.clear_session()

### Computing FLOPs

Tensorflow has a fairly serious profiler in it, but as best I can tell (at the time of writing anyway) this doesn't have a nice simple way to just pull out the total FLOPs. Luckily, other people have written such code, and I've grabbed the below from the [keras flops package](https://github.com/tokusumi/keras-flops/blob/master/keras_flops/flops_calculation.py).

In [2]:
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2_as_graph
from tensorflow.keras import Sequential, Model

def get_flops(model, batch_size = None):
    """
    Calculate FLOPS for tf.keras.Model or tf.keras.Sequential .
    Ignore operations used in only training mode such as Initialization.
    Use tf.profiler of tensorflow v1 api.
    """
    if not isinstance(model, (Sequential, Model)):
        raise KeyError(
            "model arguments must be tf.keras.Model or tf.keras.Sequential instanse"
        )

    if batch_size is None:
        batch_size = 1

    # convert tf.keras model into frozen graph to count FLOPS about operations used at inference
    # FLOPS depends on batch size
    inputs = [
        tf.TensorSpec([batch_size] + inp.shape[1:], inp.dtype) for inp in model.inputs
    ]
    real_model = tf.function(model).get_concrete_function(inputs)
    frozen_func, _ = convert_variables_to_constants_v2_as_graph(real_model)

    # Calculate FLOPS with tf.profiler
    run_meta = tf.compat.v1.RunMetadata()
    opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
    flops = tf.compat.v1.profiler.profile(
        graph=frozen_func.graph, run_meta=run_meta, cmd="scope", options=opts
    )
    
    # TODO: show each FLOPS
    return flops.total_float_ops

## A Simple Network

Let's look at a simple network with just a few dense layers. For now, we'll pretend that we're using Fashion MNIST and so assume an input size of $28 \times 28$, or if we vectorise it (like we are going to pretend here) $784$.

In [3]:
# create an input, we need to specify the shape of the input, in this case it's a vectorised images with a 784 in length
inputs = keras.Input(shape=(784,), name='img')
# first layer, a dense layer with 64 units, and a relu activation. This layer recieves the 'inputs' layer as it's input
x = layers.Dense(256, activation='relu')(inputs)
# second layer, another dense layer, this layer recieves the output of the previous layer, 'x', as it's input
x = layers.Dense(64, activation='relu')(x)
# output layer, length 10 units. This layer recieves the output of the previous layer, 'x', as it's input
outputs = layers.Dense(10, activation='softmax')(x)

# create the model, the model is a collection of inputs and outputs, in our case there is one of each
model = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_model')
# print a summary of the model
model.summary()

Model: "fashion_mnist_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 784)]             0         
                                                                 
 dense (Dense)               (None, 256)               200960    
                                                                 
 dense_1 (Dense)             (None, 64)                16448     
                                                                 
 dense_2 (Dense)             (None, 10)                650       
                                                                 
Total params: 218,058
Trainable params: 218,058
Non-trainable params: 0
_________________________________________________________________


We have $218,058$ parameters, most of this in the first dense layer.

In [4]:
flops = get_flops(model, batch_size=1)

Instructions for updating:
Use `tf.compat.v1.graph_util.tensor_shape_from_node_def_name`

-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProf

In running our analysis we get this little report above. This is not super intuitive, but does tell us how expensive every operation in the network is.

Let's start by considering a dense layers. A dense layer does the following:

$\hat{x} = x \times w + b$

where $x$ is the input, $w$ is the learned weight matrix equal, $b$ is the learned bias, and $\hat{x}$ is the model output. Remember that $w$ is of size $\text{length}(x) \times \text{length}(\hat{x})$, and $b$ is of size $\text{length}(\hat{x})$. Looking at the first dense layer (just called `dense`), we have two operations associated with it:
```
  fashion_mnist_model/dense/MatMul (401.41k/401.41k flops)
  fashion_mnist_model/dense/BiasAdd (256/256 flops)
```

The first of these, the `MatMul` is a matrix multiplication, so is $x \times w$. We can see that this is a fairly big operation, which makes sense as $w$ is going to be $768 \times 256$. We see subsequent matrix multiplications are much lighter weight, as the weight matrices are much smaller. The second of is the `BiasAdd` and is simply the $+ b$ part of the equation. Unsurprisingly, this element wise addition is pretty light weight.

In [5]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

0.436 Million Floating Point Operations (FLOPs)


Overall, we a bit under half a billion FLOPs.

### A DCNN

Let's switch now to a DCNN. 

In [6]:
# our input now has a different shape, 28x28x1, as we have 28x28 single channel images
inputs = keras.Input(shape=(28, 28, 1, ), name='img')
# rather than use a fully connected layer, we'll use 2D convolutional layers, 8 filters, 3x3 size kernels
x = layers.Conv2D(filters=8, kernel_size=(3,3), activation='relu')(inputs)
# 2x2 max pooling, this will downsample the image by a factor of two
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# more convolution, 16 filters, followed by max poool
x = layers.Conv2D(filters=16, kernel_size=(3,3), activation='relu')(x)
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# final convolution, 32 filters
x = layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu')(x)
# a flatten layer. Matlab does a flatten automatically, here we need to explicitly do this. Basically we're telling
# keras to make the current network state into a 1D shape so we can pass it into a fully connected layer
x = layers.Flatten()(x)
# a single fully connected layer, 64 inputs
x = layers.Dense(64, activation='relu')(x)
# and now our output, same as last time
outputs = layers.Dense(10, activation='softmax')(x)

# build the model, and print the summary
model_cnn = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_cnn_model')
model_cnn.summary()

Model: "fashion_mnist_cnn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 28, 28, 1)]       0         
                                                                 
 conv2d (Conv2D)             (None, 26, 26, 8)         80        
                                                                 
 max_pooling2d (MaxPooling2D  (None, 13, 13, 8)        0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 11, 11, 16)        1168      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 16)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 3, 3, 3

Parameter wise, we have gone from ~220,000 to ~25,000, but the FLOPs has not seem the same reduction.

In [7]:
flops = get_flops(model_cnn, batch_size=1)


-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProfRoot (--/511.98k flops)
  fashion_mnist_cnn_model/conv2d_1/Conv2D (278.78k/278.78k flops)

This output is a bit harder to unpick, but let's look at the first convolution operation. This has these the following entries:
```
  fashion_mnist_cnn_model/conv2d/Conv2D (97.34k/97.34k flops)
  fashion_mnist_cnn_model/conv2d/BiasAdd (5.41k/5.41k flops)
```
We have the main convolution operation, and the bias operation. This first convolution has $80$ parameters, yet it has abouve $100,000$ FLOPs worth of compute associated with it. Why? Going back to the maths,

$\hat{x} = x * w + b$

which looks a lot like our dense layer. Execpt, with convolution we're not just multiplying one thing by another. Instead, we are taking our $w$, which is generally (and certainly in our case) smaller than our $x$, and sliding this across $x$, evaluating the sum of the element wise product at each location. This leads to a lot of compute from a small number of parameters.

The fact that our convolutions operate across multiple channels also means that we don't see the same decrease in parameters as we go deeper. Consider our second convolution operation, this has the following:
```
  fashion_mnist_cnn_model/conv2d_1/Conv2D (278.78k/278.78k flops)
  fashion_mnist_cnn_model/conv2d_1/BiasAdd (1.94k/1.94k flops)
```
We have roughly three times the parameters in our second convolution compared to our first. This is despite the spatial resolution being reduced by the max pooling in the middle. The increase in the channel depth of the representation leads to this increase. By the time we get to the next convolution operation, the number of operations is starting to reduce again. 

In [8]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

0.512 Million Floating Point Operations (FLOPs)


All up, we have just over half a billion FLOPs. We have a bit over 10% of the parameters, but see a slight increase in computational requirements. As mentioned above, not all parameters are created equal.

### Bigger Inputs

Let's repeat the above, but we'll increase the input sizes. We'll leave everything else the same. Let's assume that we've now got $50 \times 50$ input images. 

#### Simple Dense Network

In [9]:
# create an input, we need to specify the shape of the input, in this case it's a vectorised images of length 2500
inputs = keras.Input(shape=(2500,), name='img')
# first layer, a dense layer with 64 units, and a relu activation. This layer recieves the 'inputs' layer as it's input
x = layers.Dense(256, activation='relu')(inputs)
# second layer, another dense layer, this layer recieves the output of the previous layer, 'x', as it's input
x = layers.Dense(64, activation='relu')(x)
# output layer, length 10 units. This layer recieves the output of the previous layer, 'x', as it's input
outputs = layers.Dense(10, activation='softmax')(x)

# create the model, the model is a collection of inputs and outputs, in our case there is one of each
model = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_model')
# print a summary of the model
model.summary()

Model: "fashion_mnist_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 2500)]            0         
                                                                 
 dense_5 (Dense)             (None, 256)               640256    
                                                                 
 dense_6 (Dense)             (None, 64)                16448     
                                                                 
 dense_7 (Dense)             (None, 10)                650       
                                                                 
Total params: 657,354
Trainable params: 657,354
Non-trainable params: 0
_________________________________________________________________


In [10]:
flops = get_flops(model, batch_size=1)


-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProfRoot (--/1.31m flops)
  fashion_mnist_model/dense_5/MatMul (1.28m/1.28m flops)
  fashion_

In [11]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

1.31 Million Floating Point Operations (FLOPs)


The number of parameters has trippled, though only the number of parameters in the first dense layer has changed. Looking at the FLOPs, overall we've gone up by a factor of roughly 2.5, though again all the increase has happened in the first dense layer, and only in the matrix multiplication component. Remember, this doing

$\hat{x} = x \times w + b$.

$w$ depends on the size of the input (which got bigger), and the output (which is unchanged). $b$ only depends on the size of the output. So our bias operation is unchanged, and our $w$ operation blows out. After this layer, we're back to the same size representation as we had before, so everything is unchanged.

#### Simple DCNN 

In [12]:
# pretending that we have 50x50 inputs
inputs = keras.Input(shape=(50, 50, 1, ), name='img')
# rather than use a fully connected layer, we'll use 2D convolutional layers, 8 filters, 3x3 size kernels
x = layers.Conv2D(filters=8, kernel_size=(3,3), activation='relu')(inputs)
# 2x2 max pooling, this will downsample the image by a factor of two
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# more convolution, 16 filters, followed by max poool
x = layers.Conv2D(filters=16, kernel_size=(3,3), activation='relu')(x)
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# final convolution, 32 filters
x = layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu')(x)
# a flatten layer. Matlab does a flatten automatically, here we need to explicitly do this. Basically we're telling
# keras to make the current network state into a 1D shape so we can pass it into a fully connected layer
x = layers.Flatten()(x)
# a single fully connected layer, 64 inputs
x = layers.Dense(64, activation='relu')(x)
# and now our output, same as last time
outputs = layers.Dense(10, activation='softmax')(x)

# build the model, and print the summary
model_cnn = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_cnn_model')
model_cnn.summary()

Model: "fashion_mnist_cnn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 50, 50, 1)]       0         
                                                                 
 conv2d_3 (Conv2D)           (None, 48, 48, 8)         80        
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 24, 24, 8)        0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 22, 22, 16)        1168      
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 11, 11, 16)       0         
 2D)                                                             
                                                                 
 conv2d_5 (Conv2D)           (None, 9, 9, 3

In [13]:
flops = get_flops(model_cnn, batch_size=1)


-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProfRoot (--/2.58m flops)
  fashion_mnist_cnn_model/conv2d_4/Conv2D (1.12m/1.12m flops)
  fas

In [14]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

2.58 Million Floating Point Operations (FLOPs)


Here, we've got a bit of a blow out. Our parameters increased by a factor of about $7$, and our FLOPs have gone up by five. Parameter wise, we see our only change is in the dense layer after our thrid convolution which has gone right up (all but about 7,000 network parameters are in this layer). The FLOPs though have gone through the roof for the convolution operations.

Again, let's consider what's going on. Our convolution operations are applying the convolution filter with a sliding window. A larger input means we apply the convolution at more locations. Compared to what we started with, we're going to applying our convolution kernel at $784$ locations (each of the pixels) to $2500$ locations. We see the FLOPs go up accordingly. As subsequent convolution layers are now also operating over representations that are larger spatially, we see a similar increase. 

When we get to the dense layer, the larger spatial representation means that when we flatten things, we get a larger matrix multiplication operation. Hence our weight matrix for this operation get's big, and we see the parameters increase massivley - though this is not where the bulk of the computational cost comes from.

#### Adding an Extra Convolution + MaxPool

Let's add another convolution and maxpool and see what happens. This will obviously add another convolution operation, but will also make the input to the dense layer much smaller.

In [15]:
# pretending that we have 50x50 inputs
inputs = keras.Input(shape=(50, 50, 1, ), name='img')
# rather than use a fully connected layer, we'll use 2D convolutional layers, 8 filters, 3x3 size kernels
x = layers.Conv2D(filters=8, kernel_size=(3,3), activation='relu')(inputs)
# 2x2 max pooling, this will downsample the image by a factor of two
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# more convolution, 16 filters, followed by max poool
x = layers.Conv2D(filters=16, kernel_size=(3,3), activation='relu')(x)
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# third convolution, 32 filters
x = layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu')(x)
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# fourth convolution
x = layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu')(x)
# a flatten layer. Matlab does a flatten automatically, here we need to explicitly do this. Basically we're telling
# keras to make the current network state into a 1D shape so we can pass it into a fully connected layer
x = layers.Flatten()(x)
# a single fully connected layer, 64 inputs
x = layers.Dense(64, activation='relu')(x)
# and now our output, same as last time
outputs = layers.Dense(10, activation='softmax')(x)

# build the model, and print the summary
model_cnn = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_cnn_model')
model_cnn.summary()

Model: "fashion_mnist_cnn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 50, 50, 1)]       0         
                                                                 
 conv2d_6 (Conv2D)           (None, 48, 48, 8)         80        
                                                                 
 max_pooling2d_4 (MaxPooling  (None, 24, 24, 8)        0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 22, 22, 16)        1168      
                                                                 
 max_pooling2d_5 (MaxPooling  (None, 11, 11, 16)       0         
 2D)                                                             
                                                                 
 conv2d_8 (Conv2D)           (None, 9, 9, 3

In [16]:
flops = get_flops(model_cnn, batch_size=1)


-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProfRoot (--/2.43m flops)
  fashion_mnist_cnn_model/conv2d_7/Conv2D (1.12m/1.12m flops)
  fas

In [17]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

2.43 Million Floating Point Operations (FLOPs)


This has a perhaps unexpected impact. We have made our network deeper, yet greatly reduced the number of parameters, and reduced the number of FLOPs (slightly). Even with adding a 64 filter convolution layer here, I appear to have made the network simpler. 

This issue here is how these layers interact with each other. With deep nets, the output of one layer is the input to the next, so by manipulating sizes like this we can impact how efficient later parts of a network are. Does this mean that deepening a network is a path towards simplification? Not really, or at least not in most cases. This one here is a bit of an edge case. One way to have perhaps a larger impact is via global average pooling. Replacing my fourth convolution layer with this, I get the following.

In [18]:
# pretending that we have 50x50 inputs
inputs = keras.Input(shape=(50, 50, 1, ), name='img')
# rather than use a fully connected layer, we'll use 2D convolutional layers, 8 filters, 3x3 size kernels
x = layers.Conv2D(filters=8, kernel_size=(3,3), activation='relu')(inputs)
# 2x2 max pooling, this will downsample the image by a factor of two
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# more convolution, 16 filters, followed by max poool
x = layers.Conv2D(filters=16, kernel_size=(3,3), activation='relu')(x)
x = layers.MaxPool2D(pool_size=(2, 2))(x)
# third convolution, 32 filters
x = layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu')(x)
# a flatten layer. Matlab does a flatten automatically, here we need to explicitly do this. Basically we're telling
# keras to make the current network state into a 1D shape so we can pass it into a fully connected layer
x = layers.GlobalAveragePooling2D()(x)
x = layers.Flatten()(x)
# a single fully connected layer, 64 inputs
x = layers.Dense(64, activation='relu')(x)
# and now our output, same as last time
outputs = layers.Dense(10, activation='softmax')(x)

# build the model, and print the summary
model_cnn = keras.Model(inputs=inputs, outputs=outputs, name='fashion_mnist_cnn_model')
model_cnn.summary()

Model: "fashion_mnist_cnn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 img (InputLayer)            [(None, 50, 50, 1)]       0         
                                                                 
 conv2d_10 (Conv2D)          (None, 48, 48, 8)         80        
                                                                 
 max_pooling2d_7 (MaxPooling  (None, 24, 24, 8)        0         
 2D)                                                             
                                                                 
 conv2d_11 (Conv2D)          (None, 22, 22, 16)        1168      
                                                                 
 max_pooling2d_8 (MaxPooling  (None, 11, 11, 16)       0         
 2D)                                                             
                                                                 
 conv2d_12 (Conv2D)          (None, 9, 9, 3

In [19]:
flops = get_flops(model_cnn, batch_size=1)


-max_depth                  10000
-min_bytes                  0
-min_peak_bytes             0
-min_residual_bytes         0
-min_output_bytes           0
-min_micros                 0
-min_accelerator_micros     0
-min_cpu_micros             0
-min_params                 0
-min_float_ops              1
-min_occurrence             0
-step                       -1
-order_by                   float_ops
-account_type_regexes       .*
-start_name_regexes         .*
-trim_name_regexes          
-show_name_regexes          .*
-hide_name_regexes          
-account_displayed_op_only  true
-select                     float_ops
-output                     stdout:


Doc:
scope: The nodes in the model graph are organized by their names, which is hierarchical like filesystem.
flops: Number of float operations. Note: Please read the implementation for the math behind it.

Profile:
node name | # float_ops
_TFProfRoot (--/2.26m flops)
  fashion_mnist_cnn_model/conv2d_11/Conv2D (1.12m/1.12m flops)
  fa

In [20]:
print(f"{flops / 10 ** 6:.03} Million Floating Point Operations (FLOPs)")

2.26 Million Floating Point Operations (FLOPs)


Here, I've greatly reduced my parameters, but not changed the FLOPs much. This change only impact the computational demands of that dense layer. I've made the input to that layer much more compact and saved a heap of parameters, but not altered the FLOPs much as most of these lie in the convolution layers.

## Final Thoughts

Not all parameters are created equal. Convolutional kernels will incur much greater runtimes than dense layers, but with many fewer parameters. They do at least lend themelves very to paralellisation though, so at least we have that in our favour.

There are a lot of other things that you can do to impact the compute demands. Some other simple things that you might want to play with include:
* Using different strides for the convolution filters, so you don't convolve with every pixel in the input
* Using different pooling sizes in max-pooling, such as (4x4), to more rapidly downscale the networks internal representation