# Lecture 7: Convolutional Neural Networks
In this lecture we will discuss the current state of the art network architecture for all things computer vision: the convolutional layer. We'll start by going over the basics of what a convolution is, then discuss the building blocks of a convolutional network, take a look at some applications, and finally train a hotdog detector with transfer learning.

![spicynet](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/image/mlp_mnist.png)

We saw that neural networks are fairly decent at tasks like MNIST and CIFAR, however, the input size of both these datasets was quite small. CIFAR, for example, has images of size 32x32x3 (3072 total features). In a dense layer, each neuron is connected to all incoming features, this means the input layer of a CIFAR network must have 3072 weights per neuron. Consider a higher resolution image, 200x200x3 for example. This would require 120,000 weights per neuron!

Such a huge number of trainable parameters is problematic due to overfitting. It's clear that dense layers aren't well suited to higher resolution images.

![doggo](https://memeguy.com/photos/images/my-dog-used-to-chase-people-on-a-bike-a-lot-it-got-so-bad-finally-i-had-to-take-his-bike-away-209332.jpg)

Dense layers connect all incoming features to all neurons, implying that all the features have some relationship to eachother. This is true for many types of problem, but not images! Above, we see a dog along with a bunch of other stuff. If our goal is to figure out a dog is in this picture, the grill behind him doesnt really matter. All that matters are the things that clearly make him a dog, like his cute face.

Rather than have each neuron look at every pixel, it makes more sense to only look at a neighborhood of pixels since images tend to have local information.

![test](https://devblogs.nvidia.com/wp-content/uploads/2015/11/Convolution_schematic.gif)
![convgif](https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif?w=748)

In a convolutional layer, neurons are replaced by __kernels__. Kernels are nxn matrices that slide across an incoming image to produce a similarly shaped output

![kernel](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/assets/Conv1.PNG)

![kernels](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-05-at-11-03-00-pm.png?w=342&h=562)

## Other Layers in a ConvNet
Other than the convolutional layer itself, image processing networks almost always contain pooling layers. A pooling layer simply reduces the dimension of incoming data. This allows an increase in the number of features between layers without increasing the total computational workload.

![pooling](http://cs231n.github.io/assets/cnn/maxpool.jpeg)

The intuition behind this is that convolutional kernels are trying to find whether a certain feature is present in an image. We dont care too much about low numbers because they indicate that feature is not present. It's reasonable to simply throw out all but the most promising regions

Just as in dense networks, convolutional layers also need an activation, these activations exactly mirror the dense case and ReLU is very prominent.

# My first convolutional network
Let's take a look at implementing and training a convolutional network in MXNet. We'll start by implementing an architecture called Alexnet. This was the architecture that kicked off the deep learning boom, one of the first convolutional networks ever made!

Although Alexnet was made for higher resolution images, let's just try it out on CIFAR.

In [None]:
import mxnet as mx
from mxnet import nd, autograd
from mxnet import gluon
import numpy as np
import time 

# import matplotlib for plotting
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# im using gpu to speed things up a little, cpu would be fine though
ctx = mx.gpu()

In [None]:
# add noise to training data
def train_transformer(data, label):
    # make sure all data is the same shape
    data = mx.image.imresize(data, 32, 32)
    # Change the order of dimensions to (batch, channel, height, width) I dont know why this isnt default
    data = mx.nd.transpose(data, (2,0,1))
    # convert from int to float
    data = data.astype(np.float32)
    # normalize the data
    data = (data - nd.min(data)) / (nd.max(data) - nd.min(data))
    # add some noise for better performance
    data = data + .01*nd.random.normal(shape=data.shape)
    # force noisy data between 0 and 1
    data = nd.clip(data=data, a_min=0, a_max=1)
    return data, label

# dont add noise to testing data
def test_transformer(data, label):
    data = mx.image.imresize(data, 32, 32)
    data = mx.nd.transpose(data, (2,0,1))
    data = data.astype(np.float32)
    data = (data - nd.min(data)) / (nd.max(data) - nd.min(data))
    return data, label

batch_size = 64
# note doing this [d for d in dataset] thing helps a lot with speed, i think this is an MXNet bug
train_data = gluon.data.DataLoader([d for d in gluon.data.vision.CIFAR10('./data', train=True, transform=train_transformer)],
                                    batch_size=batch_size, shuffle=True, last_batch='discard')

test_data = gluon.data.DataLoader([d for d in gluon.data.vision.CIFAR10('./data', train=False, transform=test_transformer)],
                                   batch_size=batch_size, shuffle=False, last_batch='discard')

![alexnet](http://cv-tricks.com/wp-content/uploads/2017/03/xalexnet_small-1.png.pagespeed.ic.q5Lnn1-u6h.png)

In [None]:
alex_net = gluon.nn.Sequential()
with alex_net.name_scope():
    #  First convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=96, kernel_size=11, strides=(4,4), padding=5, activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, padding=1, strides=2))
    #  Second convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=192, kernel_size=5, padding=2, activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, padding=1, strides=(2,2)))
    # Third convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, padding=1, activation='relu'))
    # Fourth convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=384, kernel_size=3, padding=1, activation='relu'))
    # Fifth convolutional layer
    alex_net.add(gluon.nn.Conv2D(channels=256, kernel_size=3, padding=1, activation='relu'))
    alex_net.add(gluon.nn.MaxPool2D(pool_size=3, padding=1, strides=2))
    # Flatten and apply fullly connected layers
    alex_net.add(gluon.nn.Flatten())
    alex_net.add(gluon.nn.Dense(4096, activation="relu"))
    alex_net.add(gluon.nn.Dense(4096, activation="relu"))
    alex_net.add(gluon.nn.Dense(10))

![alexnet](http://cv-tricks.com/wp-content/uploads/2017/03/xalexnet_small-1.png.pagespeed.ic.q5Lnn1-u6h.png)

In [None]:
# initialize parameters, create a trainer, and loss just like usual
alex_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
alex_trainer = gluon.Trainer(alex_net.collect_params(), 'sgd', {'learning_rate': .1})
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

In [None]:
# function for evaluating accuracy, note it is identical to dense neural networks
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    for d, l in data_iterator:
        data = d.as_in_context(ctx)
        label = l.as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis=1)
        acc.update(preds=predictions, labels=label)
    return acc.get()[1]

In [None]:
# training function also is virtually unchanged!
def train(net, trainer):
    epochs = 7
    smoothing_constant = .01

    for e in range(epochs):
        for i, (d, l) in enumerate(train_data):
            data = d.as_in_context(ctx)
            label = l.as_in_context(ctx)
            with autograd.record():
                output = net(data)
                loss = softmax_cross_entropy(output, label)
            loss.backward()
            trainer.step(data.shape[0])

            curr_loss = nd.mean(loss).asscalar()
            moving_loss = (curr_loss if ((i == 0) and (e == 0))
                           else (1 - smoothing_constant) * moving_loss + smoothing_constant * curr_loss)

            if i > 0 and i % 200 == 0:
                print('Batch %d. Loss: %f' % (i, moving_loss))

        test_accuracy = evaluate_accuracy(test_data, net)
        train_accuracy = evaluate_accuracy(train_data, net)
        print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy))

In [None]:
train(alex_net, alex_trainer)

Alexnet is a very outdated architecture and there are many modern architectures that vastly outperform it. One of the most famous is VGG.

![vgg](http://www.cs.toronto.edu/~frossard/post/vgg16/vgg16.png)

VGG is known for its high accuracy, but also its high computational cost. Let's implement the above image in MXNet and see how it does!

In [None]:
from mxnet.gluon import nn

def vgg_block(num_convs, channels):
    out = nn.HybridSequential()
    for _ in range(num_convs):
        out.add(nn.Conv2D(channels=channels, kernel_size=3,
                      padding=1, activation='relu'))
    out.add(nn.MaxPool2D(pool_size=2, strides=2))
    return out

def vgg_stack(architecture):
    out = nn.HybridSequential()
    for (num_convs, channels) in architecture:
        out.add(vgg_block(num_convs, channels))
    return out

num_outputs = 10
architecture = ((1,64), (1,128), (1,256), (3,512))
vgg = nn.HybridSequential()
with vgg.name_scope():
    vgg.add(vgg_stack(architecture))
    vgg.add(nn.Flatten())
    vgg.add(nn.Dense(512, activation="relu"))
    vgg.add(nn.Dropout(.5))
    vgg.add(nn.Dense(512, activation="relu"))
    vgg.add(nn.Dropout(.5))
    vgg.add(nn.Dense(num_outputs))
    
vgg.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)    
vgg.hybridize()

In [None]:
vgg_trainer = gluon.Trainer(vgg.collect_params(), 'sgd', {'learning_rate': .05})
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

In [None]:
train(vgg, vgg_trainer)

# Transfer Learning: Hotdog or Not Hotdog
Unfortunately, if we want to move beyond CIFAR and MNIST to high resolution (much more useful) images, we can't train a network from scratch using cpus, it would simply take too long. Modern networks are typically trained using multiple GPUS, which are orders of magnitude faster.

Fortunately, there's a dataset called Imagenet that has extremely generalized features!

![imagenet](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Sample-of-Images-from-the-ImageNet-Dataset-used-in-the-ILSVRC-Challenge.png)

Imagenet is a dataset made of over 1 million images labeled with one of 1000 classes. It is such a huge and varied dataset that when a network is trained on it, it learns very general features. This means that we can repurpose the __pretrained__ weights of a network to whatever kind of task we want!

![test](http://jwfromm.com/GIX513/images/transfer.jpg)

To demonstrate this, let's create a network that identifies whether an image has a hotdog in it or not

In [None]:
import os
import logging
logging.basicConfig(level=logging.INFO)
from mxnet.test_utils import download

# start by downloading some hotdog training images

ctx = [mx.gpu()]
dataset_files = {'train': ('not_hotdog_train-e6ef27b4.rec', '0aad7e1f16f5fb109b719a414a867bbee6ef27b4'),
                 'validation': ('not_hotdog_validation-c0201740.rec', '723ae5f8a433ed2e2bf729baec6b878ac0201740')}

In [None]:
# dont worry too much about this part, its just parsing MXNets silly image records

training_dataset, training_data_hash = dataset_files['train']

validation_dataset, validation_data_hash = dataset_files['validation']

def verified(file_path, sha1hash):
    import hashlib
    sha1 = hashlib.sha1()
    with open(file_path, 'rb') as f:
        while True:
            data = f.read(1048576)
            if not data:
                break
            sha1.update(data)
    matched = sha1.hexdigest() == sha1hash
    if not matched:
        logging.warn('Found hash mismatch in file {}, possibly due to incomplete download.'
                     .format(file_path))
    return matched

url_format = 'https://apache-mxnet.s3-accelerate.amazonaws.com/gluon/dataset/{}'
if not os.path.exists(training_dataset) or not verified(training_dataset, training_data_hash):
    logging.info('Downloading training dataset.')
    download(url_format.format(training_dataset),
             overwrite=True)
if not os.path.exists(validation_dataset) or not verified(validation_dataset, validation_data_hash):
    logging.info('Downloading validation dataset.')
    download(url_format.format(validation_dataset),
             overwrite=True)

In [None]:
# load dataset
train_iter = mx.io.ImageRecordIter(path_imgrec=training_dataset,
                                   min_img_size=256,
                                   data_shape=(3, 224, 224),
                                   rand_crop=True,
                                   shuffle=True,
                                   batch_size=batch_size,
                                   max_random_scale=1.5,
                                   min_random_scale=0.75,
                                   rand_mirror=True)
val_iter = mx.io.ImageRecordIter(path_imgrec=validation_dataset,
                                 min_img_size=256,
                                 data_shape=(3, 224, 224),
                                 batch_size=batch_size)

In [None]:
# take a look at some examples
for i, batch in enumerate(val_iter):
    d = batch.data[0]
    l = batch.label[0]
    data = d[0]
    label = l[0]
    data = mx.nd.transpose(data, (1,2,0))
    plt.imshow(data.astype(np.uint8).asnumpy())
    plt.show()
    if label == 0:
        print("Not a hotdog")
    else:
        print("Hotdog")
    if i == 20:
        break

In [None]:
from mxnet.gluon.model_zoo import vision as models
# lets use a pretrained squeezenet, this a model known for being decently good accuracy at a low computational cost
squeezenet = models.squeezenet1_1(pretrained=True, prefix="dog_", ctx=ctx)

See [here](https://arxiv.org/pdf/1602.07360.pdf) for more info about squeezenet

In [None]:
squeezenet

In [None]:
# create a new copy of squeezenet, this time though only have 2 output classes (hotdog or not hotdog)
dognet = models.squeezenet1_1(classes=2, prefix="dog_")
dognet.collect_params().initialize(ctx=ctx)

In [None]:
dognet

In [None]:
# use the the features chunk of squeezenet, only leave the output untouched
dognet.features = squeezenet.features

In [None]:
# in the trainer, specify that we only want to update the output chunk of dognet
trainer = gluon.Trainer(dognet.output.collect_params(), 'sgd', {'learning_rate': .01})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

In [None]:
# given guess z and label y, compute the loss
def unbalanced_loss(loss_func, z, y):
    # there are 5 times more images of not hotdogs than hotdogs :(
    positive_class_weight = 5
    regular_loss = loss_func(z, y)
    # convienently y is either 1 (hotdog) or 0 (not hotdog) so scaling is pretty simple
    scaled_loss = regular_loss * (1 + y*positive_class_weight)/positive_class_weight
    return scaled_loss

In [None]:
# return metrics string representation
def metric_str(names, accs):
    return ', '.join(['%s=%f'%(name, acc) for name, acc in zip(names, accs)])
metric = mx.metric.create(['acc', 'f1'])

In [None]:
from mxnet.image import color_normalize

def evaluate(net, data_iter, ctx):
    data_iter.reset()
    for batch in data_iter:
        data = color_normalize(batch.data[0]/255,
                               mean=mx.nd.array([0.485, 0.456, 0.406]).reshape((1,3,1,1)),
                               std=mx.nd.array([0.229, 0.224, 0.225]).reshape((1,3,1,1)))
        data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
        outputs = []
        for x in data:
            outputs.append(net(x))
        metric.update(label, outputs)
    out = metric.get()
    metric.reset()
    return out

In [None]:
# now lets train dognet, this will look very similar to other training loops we've done
epochs = 10
best_f1 = 0
log_interval = 100
val_names, val_accs = evaluate(dognet, val_iter, ctx)
print('[Initial] validation: %s'%(metric_str(val_names, val_accs)))
for epoch in range(epochs):
    tic = time.time()
    train_iter.reset()
    btic = time.time()
    for i, batch in enumerate(train_iter):
        # the model zoo models expect normalized images
        data = color_normalize(batch.data[0]/255,
                               mean=mx.nd.array([0.485, 0.456, 0.406]).reshape((1,3,1,1)),
                               std=mx.nd.array([0.229, 0.224, 0.225]).reshape((1,3,1,1)))
        data = gluon.utils.split_and_load(data, ctx_list=ctx, batch_axis=0)
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
        outputs = []
        Ls = []
        with autograd.record():
            for x, y in zip(data, label):
                z = dognet(x)
                # rescale the loss based on class to counter the imbalance problem                
                L = unbalanced_loss(loss, z, y)
                # store the loss and do backward after we have done forward
                # on all GPUs for better speed on multiple GPUs.
                Ls.append(L)
                outputs.append(z)
            for L in Ls:
                L.backward()
        trainer.step(batch.data[0].shape[0])
        metric.update(label, outputs)
        if log_interval and not (i+1)%log_interval:
            names, accs = metric.get()
            print('[Epoch %d Batch %d] speed: %f samples/s, training: %s'%(
                           epoch, i, batch_size/(time.time()-btic), metric_str(names, accs)))
        btic = time.time()

    names, accs = metric.get()
    metric.reset()
    print('[Epoch %d] training: %s'%(epoch, metric_str(names, accs)))
    print('[Epoch %d] time cost: %f'%(epoch, time.time()-tic))
    val_names, val_accs = evaluate(dognet, val_iter, ctx)
    print('[Epoch %d] validation: %s'%(epoch, metric_str(val_names, val_accs)))

    if val_accs[1] > best_f1:
        best_f1 = val_accs[1]
        print('Best validation f1 found. Checkpointing...')
        dognet.save_params('dog-%d.params'%(epoch))

In [None]:
from skimage.color import rgba2rgb
import skimage.io as io
def classify_hotdog(net, url):
    I = io.imread(url)
    if I.shape[2] == 4:
        I = rgba2rgb(I)
    image = mx.nd.array(I).astype(np.uint8)
    plt.subplot(1, 2, 1)
    plt.imshow(image.asnumpy())
    image = mx.image.resize_short(image, 256)
    image, _ = mx.image.center_crop(image, (224, 224))
    plt.subplot(1, 2, 2)
    plt.imshow(image.asnumpy())
    image = mx.image.color_normalize(image.astype(np.float32)/255,
                                     mean=mx.nd.array([0.485, 0.456, 0.406]),
                                     std=mx.nd.array([0.229, 0.224, 0.225]))
    image = mx.nd.transpose(image.astype('float32'), (2,1,0))
    image = mx.nd.expand_dims(image, axis=0)
    out = mx.nd.SoftmaxActivation(net(image))
    print('Probabilities are: '+str(out[0].asnumpy()))
    result = np.argmax(out.asnumpy())
    outstring = ['Not hotdog!', 'Hotdog!']
    print(outstring[result])

In [None]:
dognet.collect_params().reset_ctx(mx.cpu())
classify_hotdog(dognet, "http://del.h-cdn.co/assets/17/25/980x490/landscape-1498074256-delish-blt-dogs-01.jpg")

In [None]:
classify_hotdog(dognet, "https://i.ytimg.com/vi/SfLV8hD7zX4/maxresdefault.jpg")