<a href="https://colab.research.google.com/github/Carolyn-Ha/MDST/blob/main/20_4_lec20_channels_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Notebook credit**: Based on the original D2L notebook [here](https://github.com/d2l-ai/d2l-tensorflow-colab/blob/master/chapter_convolutional-neural-networks/channels.ipynb).

# Multiple Input and Multiple Output Channels

While we have described the multiple channels
that comprise each image (e.g., color images have the standard RGB channels
to indicate the amount of red, green and blue) and convolutional layers for multiple channels before,
until now, we simplified all of our numerical examples
by working with just a single input and a single output channel.
This has allowed us to think of our inputs, convolution kernels,
and outputs each as two-dimensional tensors.

When we add channels into the mix,
our inputs and hidden representations
both become three-dimensional tensors.
For example, each RGB input image has shape $3\times h\times w$.
We refer to this axis, with a size of 3, as the *channel* dimension.
In this section, we will take a deeper look
at convolution kernels with multiple input and multiple output channels.



## Multiple Input Channels

When the input data contain multiple channels,
we need to construct a convolution kernel
with the same number of input channels as the input data,
so that it can perform cross-correlation with the input data.
Assuming that the number of channels for the input data is $c_i$,
the number of input channels of the convolution kernel also needs to be $c_i$. If our convolution kernel's window shape is $k_h\times k_w$,
then when $c_i=1$, we can think of our convolution kernel
as just a two-dimensional tensor of shape $k_h\times k_w$.

However, when $c_i>1$, we need a kernel
that contains a tensor of shape $k_h\times k_w$ for *every* input channel. Concatenating these $c_i$ tensors together
yields a convolution kernel of shape $c_i\times k_h\times k_w$.
Since the input and convolution kernel each have $c_i$ channels,
we can perform a cross-correlation operation
on the two-dimensional tensor of the input
and the two-dimensional tensor of the convolution kernel
for each channel, adding the $c_i$ results together
(summing over the channels)
to yield a two-dimensional tensor.
This is the result of a two-dimensional cross-correlation
between a multi-channel input and
a multi-input-channel convolution kernel.

In the figure below, we demonstrate an example
of a two-dimensional cross-correlation with two input channels.
The shaded portions are the first output element
as well as the input and kernel tensor elements used for the output computation:
$(1\times1+2\times2+4\times3+5\times4)+(0\times0+1\times1+3\times2+4\times3)=56$.

![Cross-correlation computation with 2 input channels.](https://github.com/d2l-ai/d2l-tensorflow-colab/blob/master/img/conv-multi-in.svg?raw=1)


To make sure we really understand what is going on here,
we can (**implement cross-correlation operations with multiple input channels**) ourselves.
Notice that all we are doing is performing one cross-correlation operation
per channel and then adding up the results.


### ***Notes***
- with channels: yields a convolution kernel of shape $c_i\times k_h\times k_w$.
  - additional dimension other than height, width: will be the number of channels

- use different kernels for different channels => same size kernel
  - elementwise multiplication -> add up everything together

- if we have 2 channels of input -> will have 2 channels of kernel
- if we have same number of input channels for the kernel -> do element wise and channel wise convolution (number of input channels should match the number of kernels)



**[cf]**
- kernel: small matrix that slides over the input data during convolution, extracting local patterns
- channel: different feature maps produced by applying multiple kernels to the input data => Each channel captures different aspects or features of the input, and the combination of channels forms the output tensor of a convolutional layer

In [None]:
import tensorflow as tf

def corr2d(X, K):
    # Compute 2D cross-correlation => 2D cross-correlation of a 2D input tensor X with a 2D kernel K
    h, w = K.shape
    Y = tf.Variable(tf.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)))
      #(1) initializing output tensor Y with zeros: where the dimension of Y are determined by the shape of X and K
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i,j].assign(tf.reduce_sum(
                X[i : i + h, j: j + w]*K
            ))
            #(2) computing cross-correlation:
              # -element-wise product of the corresponding elements in the input tensor X and the kernel K
              # -summing up the results.
    return Y


def corr2d_multi_in(X, K):
    # => multi-input 2D cross-correlation.
    # first iterate through the 0th dimension (channel dimension) of 'X' and 'K,
    # then, add them together
    return tf.reduce_sum([corr2d(x,k) for x,k in zip(X,K)], axis=0)
      #(1) iterate over every single x,k in zip(X,K)

We can construct the input tensor `X` and the kernel tensor `K`
corresponding to the values in the figure above
to (**validate the output**) of the cross-correlation operation.


In [None]:
X = tf.constant([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
                 [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = tf.constant([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K).numpy()

array([[ 56.,  72.],
       [104., 120.]], dtype=float32)

## Multiple Output Channels


Regardless of the number of input channels,
so far we always ended up with one output channel.
However, as we discussed before,
it turns out to be essential to have multiple channels at each layer.
In the most popular neural network architectures,
we actually increase the channel dimension
as we go higher up in the neural network,
typically downsampling to trade off spatial resolution
for greater *channel depth*.
Intuitively, you could think of each channel
as responding to some different set of features.
Reality is a bit more complicated than the most naive interpretations of this intuition since representations are not learned independent but are rather optimized to be jointly useful.
So it may not be that a single channel learns an edge detector but rather that some direction in channel space corresponds to detecting edges.


Denote by $c_i$ and $c_o$ the number
of input and output channels, respectively,
and let $k_h$ and $k_w$ be the height and width of the kernel.
To get an output with multiple channels,
we can create a kernel tensor
of shape $c_i\times k_h\times k_w$
for *every* output channel.
We concatenate them on the output channel dimension,
so that the shape of the convolution kernel
is $c_o\times c_i\times k_h\times k_w$.
In cross-correlation operations,
the result on each output channel is calculated
from the convolution kernel corresponding to that output channel
and takes input from all channels in the input tensor.

We implement a cross-correlation function
to [**calculate the output of multiple channels**] as shown below.


In [None]:
def corr2d_multi_in_out(X, K):
    # iterate through the 0th dimension of 'K' --- output channel
    # each time, perofrm cross-correlation oeprations with input 'X', which can have multiple input channels
    # all of the results are stacked together
    return tf.stack([corr2d_multi_in(X, k) for k in K])
      #(1) going to iterate the operation for every output kernel K => stack them all together

We construct a convolution kernel with 3 output channels
by concatenating the kernel tensor `K` with `K+1`
(plus one for each element in `K`) and `K+2`.


In [None]:
K = tf.stack((K, K+1, K+2))
  #(1) creating 2 new kernels for the other 2 output channels => by adding 1 to every single one of them
K.shape, K

(TensorShape([3, 2, 2, 2]),
 <tf.Tensor: shape=(3, 2, 2, 2), dtype=float32, numpy=
 array([[[[0., 1.],
          [2., 3.]],
 
         [[1., 2.],
          [3., 4.]]],
 
 
        [[[1., 2.],
          [3., 4.]],
 
         [[2., 3.],
          [4., 5.]]],
 
 
        [[[2., 3.],
          [4., 5.]],
 
         [[3., 4.],
          [5., 6.]]]], dtype=float32)>)

Below, we perform cross-correlation operations
on the input tensor `X` with the kernel tensor `K`.
Now the output contains 3 channels.
The result of the first channel is consistent
with the result of the previous input tensor `X`
and the multi-input channel,
single-output channel kernel.


In [None]:
corr2d_multi_in_out(X, K)
  #(1) doing the operation for every single output channels

<tf.Tensor: shape=(3, 2, 2), dtype=float32, numpy=
array([[[ 56.,  72.],
        [104., 120.]],

       [[ 76., 100.],
        [148., 172.]],

       [[ 96., 128.],
        [192., 224.]]], dtype=float32)>

### ***Notes***
- it would not be the case that a single channel learns an edge detector <-> rather some direction in channel space is corresponding to detecting edges
  - detectors: going to be jointly used to do an edge detector
  - what we did for single kernel: going to repeat this for every single kernel
  - whole dimension: $c_o\times c_i\times k_h\times k_w$.

## $1\times 1$ Convolutional Layer

At first, a [**$1 \times 1$ convolution**], i.e., $k_h = k_w = 1$,
does not seem to make much sense.
After all, a convolution correlates adjacent pixels.
A $1 \times 1$ convolution obviously does not.
Nonetheless, they are popular operations that are sometimes included
in the designs of complex deep networks.
Let us see in some detail what it actually does.

Because the minimum window is used,
the $1\times 1$ convolution loses the ability
of larger convolutional layers
to recognize patterns consisting of interactions
among adjacent elements in the height and width dimensions.
The only computation of the $1\times 1$ convolution occurs
on the channel dimension.

Figure below shows the cross-correlation computation
using the $1\times 1$ convolution kernel
with 3 input channels and 2 output channels.
Note that the inputs and outputs have the same height and width.
Each element in the output is derived
from a linear combination of elements *at the same position*
in the input image.
You could think of the $1\times 1$ convolutional layer
as constituting a fully-connected layer applied at every single pixel location
to transform the $c_i$ corresponding input values into $c_o$ output values.
Because this is still a convolutional layer,
the weights are tied across pixel location.
Thus the $1\times 1$ convolutional layer requires $c_o\times c_i$ weights
(plus the bias).


![The cross-correlation computation uses the $1\times 1$ convolution kernel with 3 input channels and 2 output channels. The input and output have the same height and width.](https://github.com/d2l-ai/d2l-tensorflow-colab/blob/master/img/conv-1x1.svg?raw=1)

Let us check whether this works in practice:
we implement a $1 \times 1$ convolution
using a fully-connected layer.
The only thing is that we need to make some adjustments
to the data shape before and after the matrix multiplication.


In [None]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
      #(1) getting the number of input channels
    c_o = K.shape[0]
      #(2) getting the number of output channels
    X = tf.reshape(X, (c_i, h*w))
      #(3) summing together, collapsing the input channel dimension
    K = tf.reshape(K, (c_o, c_i))
    # matrix multiplication in the fully connected layer
    Y = tf.matmul(K, X)
    return tf.reshape(Y, (c_o, h, w))

When performing $1\times 1$ convolution,
the above function is equivalent to the previously implemented cross-correlation function `corr2d_multi_in_out`.
Let us check this with some sample data.


In [None]:
X = tf.random.normal((3,3,3))
K = tf.random.normal((2,3,1,1))

In [None]:
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(tf.reduce_sum(tf.abs(Y1 - Y2))) < 1e-6
  #(1) asserting true: eventhough I have a different implementation,
    # this is a customized matrix multiplication interpretation of the specific 1-by-1 convolution layer

### ***Notes***
- having 1*1 kernel = looking at 1-by-1 patch: meaning that every single patch I'm looking at is one pixel
  - thus no locality: why would I need to invoke that in the first place
  - input & output have same height and width: since I'm using 1*1 patch
- it's going to be a fully connected layer since I'm not taking into any spatial structure -> but going to take the pixel value at every single location = look across channels
  - (in channel perspective): we're doing operiations jointly across channels to obtain the output = working with locality across channels
  - way to combine multiple channels
  - if we only have one channel: having 1*1 patch wouldn't make sense -> but you still have spatial information, if you look at multiple channels but at the same pixel location
    - pixel values across different relations may still be related and have spatial structures if they're the same pixel location

## Summary

* Multiple channels can be used to extend the model parameters of the convolutional layer.
* The $1\times 1$ convolutional layer is equivalent to the fully-connected layer, when applied on a per pixel basis.
* The $1\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.
