# Theory

This notebook discusses theoretical aspects of the network.

## Separable Convolutions

The bottleneck and downsampling blocks make use of separable 
convolutions in order to improve performance. A separable convolution
involves breaking a standard CNN-style 2D convolution into two separate
convolutions, one for spatial mixing and one for channel mixing. 
In the literature spatial mixing is often called a depthwise convolution 
while channel mixing is called a pointwise convolution. 
For the rest of this document, subscripts $d$ and $p$ will be used to denote
depthwise and pointwise CNN-style 2D convolutions respectively.

Let

$$
\begin{align}
    N_o &= \text{number of input filters} \\
    N_i &= \text{number of output filters} \\
    L_r &= \text{input rows} \\
    L_c &= \text{input columns} \\
    F_r &= \text{filter rows} \\
    F_c &= \text{filter columns} \\
\end{align}
$$

Under a standard CNN-style 2D convolution (assume `padding=same`), we have

$$
\begin{align}
    \text{weights} &= N_o N_i F_r F_c \\
    \text{MACs} &= N_o N_i F_r F_c L_r L_c\\
\end{align}
$$

By splitting CNN-style 2D convolution into spatial and depthwise components 
we instead have two separate convolutions

$$
\begin{align}
    \text{weights}_d &= N_i F_r F_c
    &&
    \text{weights}_p &= N_o N_i 
    \\
    \text{MACs}_d &= N_i F_r F_c L_r L_c \\
    &&
    \text{MACs}_p &= N_o N_i L_r L_c 
\end{align}
$$

The total weights and MACs for both convolutions is

$$
\begin{align}
    \text{weights}_{tot} &= N_i (N_o + F_r F_c) \\
    \text{MACs}_{tot} &= N_i L_r L_c (N_o + F_r F_c)
\end{align}
$$

Finally, in order to see an improvement in memory / MACs we must 
satisfy the following inequalities 
(which quickly simlify to the same inequality)

$$
\begin{align}
    \text{Memory} && \text{MACs} 
    \\
    N_o N_i F_r F_c &\geq N_i (N_o + F_r F_c)
    &&
    N_o N_i F_r F_c L_r L_c &\geq N_i L_r L_c (N_o + F_r F_c) 
    \\
    N_o F_r F_c &\geq N_o + F_r F_c
    &&
    N_o F_r F_c &\geq N_o + F_r F_c
    \\
    N_o F_r F_c &\geq N_o + F_r F_c
\end{align}
$$

If we assume a resonable kernel size of $F_r = F_c = 3$ we have

$$
\begin{align}
    9 N_o &\geq N_o + 9\\
    N_o &\geq \frac{9}{8}\\
\end{align}
$$

This confirms that for reasonable parameters we will see a reduction in 
memory / compute with separable convolutions while still mixing information
in both spatial and channel dimensions.