# Convolutional Neural Networks (CNN)

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/logo.png" width=150>

In this lesson we will learn the basics of CNNs applied to images for computer vision tasks.



# Overview

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/cnn.png" width=700>

* **Objective:**  Detect spatial substructure from input data to aid in classification, segmentation, etc.
* **Advantages:** 
  * Small number of weights (shared)
  * Parallelizable
  * Detects spatial substrcutures (feature extractors)
  * Interpretable via filters
  * Used for in images/text/time-series etc.
* **Disadvantages:**
  * Many hyperparameters (kernel size, strides, etc.)
  * Inputs have to be of same width (image dimensions, text length, etc.)
* **Miscellaneous:** 
  * Lot's of deep CNN architectures constantly updated for SOTA performance

# Filters

At the core of CNNs are filters (weights, kernels, etc.) which convolve (slide) across our input to extract relevante features. The filters are initialized randomly but learn to pick up meaningful features from the input that aid in optimizing for the objective. We're going to teach CNNs in an unorthodox method where we entirely focus on applying it to 2D text data. Each input is composed of words and we will be representing each word as on-hot encoded vector which gives us our 2D input.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/conv.gif" width=400>

In [0]:
# Loading PyTorch library
!pip3 install http://download.pytorch.org/whl/cpu/torch-0.4.1-cp36-cp36m-linux_x86_64.whl
!pip3 install torchvision

In [0]:
import torch
import torch.nn as nn

In [0]:
# Assume all our inputs have the same # of words
batch_size = 128
sequence_size = 10 # words per input
one_hot_size = 20 # vocab size
x = torch.randn(batch_size, one_hot_size, sequence_size)
print("Size: {}".format(x.shape))

Size: torch.Size([128, 20, 10])


In [0]:
# Create filters for a conv layer
out_channels = 96 # of filters
kernel_size = 3 # filters are 3X3
conv1 = nn.Conv1d(in_channels=one_hot_size, out_channels=out_channels, kernel_size=kernel_size)
print("Size: {}".format(conv1.weight.shape))
print("Filter size: {}".format(conv1.out_channels))
print("Filter size: {}".format(conv1.kernel_size[0]))
print("Padding: {}".format(conv1.padding[0]))
print("Stride: {}".format(conv1.stride[0]))

Size: torch.Size([96, 20, 3])
Filter size: 96
Filter size: 3
Padding: 0
Stride: 1


In [0]:
# Convolve using filters
conv_output = conv1(x)
print("Size: {}".format(conv_output.shape))

Size: torch.Size([128, 96, 8])


We get 128 for the batch size, 96 outputs because that's how many filters we used on the input, but where is the 8 coming from? You can visually apply the convolution or use this handy equation:

$\frac{W - F + 2P}{S} + 1 = \frac{10 - 3 + 2(0)}{1} + 1 = 8$

where:
  * W: width of each input
  * F: filter size
  * P: padding
  * S: stride

# Pooling

The result of convolving filters on an input is a feature map. Due to the nature of convolution and overlaps, our feature map will have lots of redundant information. Pooling is a way to summarize a high-dimensional feature map into a lower dimensional one for simplified downstream computation. The pooling operation can be the max value, average, etc. in a certain receptive field.

<img src="https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/images/pool.jpeg" width=450>

In [0]:
# Max pooling
kernel_size = 2
pool1 = nn.MaxPool1d(kernel_size=kernel_size, stride=2, padding=0)
pool_output = pool1(conv_output)
print("Size: {}".format(pool_output.shape))

Size: torch.Size([128, 96, 4])


$\frac{W-F}{S} + 1 = \frac{8-2}{2} + 1 = 4$

# TODO