# Deep Learning with TensorFlow/Keras

Now that we have completed a project of Machine Learning with spark ML, in this assignment, we will be swithing to the context of Deep Learning with Tensorflow/Keras by two tasks:
- Task1: Image Classification with CNN
- Task2: Image captioning with a combination of CNN and RNN

## Task 1: Going Deeper with convolutions


Before **Inception v1** (**GoogLeNet**), which is the winner of the **ILSVRC** (ImageNet Large Scale Visual Recognition Competition) in 2014, most popular CNNs just stacked convolution layers deeper and deeper, hoping to get better performance.
The Inception network, however, uses a lot of tricks to improve performance in terms of speed and accuracy.
Compared to other networks, **Inception v1** has significant improvement over **ZFNet** (the winner in 2013) and **AlexNet** (the winner in 2012), and has relatively lower error rate compared with the VGGNet.

In this task, we will be implementing the inception architecture [in this paper](https://arxiv.org/abs/1409.4842) with TensorFlow/Keras/Pytorch. 

The goal of this task is to understand how to write code to build the model, as long as you can verify the correctness of the code (e.g., through Keras model summary), it is not necessary to train the model.

In [1]:
# conda env torch_planet
!pip install torch
!pip install torchsummary



In [2]:
import torch
torch.__version__

'1.3.1'

In [3]:
#inspired from https://github.com/pytorch/vision/blob/master/torchvision/models/googlenet.py
from __future__ import division

import warnings
from collections import namedtuple
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.jit.annotations import Optional, Tuple
from torch import Tensor
from torchsummary import summary

In [4]:

GoogLeNetOutputs = namedtuple('GoogLeNetOutputs', ['logits', 'aux_logits2', 'aux_logits1'])
GoogLeNetOutputs.__annotations__ = {'logits': Tensor, 'aux_logits2': Optional[Tensor],
                                    'aux_logits1': Optional[Tensor]}

# _GoogLeNetOutputs set here for backwards compat
_GoogLeNetOutputs = GoogLeNetOutputs

class GoogLeNet(nn.Module):

    def __init__(self, num_classes=1000, aux_logits=True, blocks=None):
        super(GoogLeNet, self).__init__()
        if blocks is None:
            blocks = [BasicConv2d, Inception, InceptionAux]
        assert len(blocks) == 3
        conv_block = blocks[0]
        inception_block = blocks[1]
        inception_aux_block = blocks[2]

        self.aux_logits = aux_logits

        self.conv1 = conv_block(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)
        self.conv2 = conv_block(64, 64, kernel_size=1)
        self.conv3 = conv_block(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = inception_block(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = inception_block(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = inception_block(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = inception_block(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = inception_block(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = inception_block(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = inception_block(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

        self.inception5a = inception_block(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = inception_block(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = inception_aux_block(512, num_classes)
            self.aux2 = inception_aux_block(528, num_classes)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.2)
        self.fc = nn.Linear(1024, num_classes)


    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                import scipy.stats as stats
                X = stats.truncnorm(-2, 2, scale=0.01)
                values = torch.as_tensor(X.rvs(m.weight.numel()), dtype=m.weight.dtype)
                values = values.view(m.weight.size())
                with torch.no_grad():
                    m.weight.copy_(values)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)


    def _forward(self, x):
        # type: (Tensor) -> Tuple[Tensor, Optional[Tensor], Optional[Tensor]]
        # N x 3 x 224 x 224
        x = self.conv1(x)
        # N x 64 x 112 x 112
        x = self.maxpool1(x)
        # N x 64 x 56 x 56
        x = self.conv2(x)
        # N x 64 x 56 x 56
        x = self.conv3(x)
        # N x 192 x 56 x 56
        x = self.maxpool2(x)

        # N x 192 x 28 x 28
        x = self.inception3a(x)
        # N x 256 x 28 x 28
        x = self.inception3b(x)
        # N x 480 x 28 x 28
        x = self.maxpool3(x)
        # N x 480 x 14 x 14
        x = self.inception4a(x)
        # N x 512 x 14 x 14
        aux_defined = self.training and self.aux_logits
        if aux_defined:
            aux1 = self.aux1(x)
        else:
            aux1 = None

        x = self.inception4b(x)
        # N x 512 x 14 x 14
        x = self.inception4c(x)
        # N x 512 x 14 x 14
        x = self.inception4d(x)
        # N x 528 x 14 x 14
        if aux_defined:
            aux2 = self.aux2(x)
        else:
            aux2 = None

        x = self.inception4e(x)
        # N x 832 x 14 x 14
        x = self.maxpool4(x)
        # N x 832 x 7 x 7
        x = self.inception5a(x)
        # N x 832 x 7 x 7
        x = self.inception5b(x)
        # N x 1024 x 7 x 7

        x = self.avgpool(x)
        # N x 1024 x 1 x 1
        x = torch.flatten(x, 1)
        # N x 1024
        x = self.dropout(x)
        x = self.fc(x)
        # N x 1000 (num_classes)
        return x, aux2, aux1

#     @torch.jit.unused
    def eager_outputs(self, x, aux2, aux1):
        # type: (Tensor, Optional[Tensor], Optional[Tensor]) -> GoogLeNetOutputs
        if self.training and self.aux_logits:
            return _GoogLeNetOutputs(x, aux2, aux1)
        else:
            return x

    def forward(self, x):
        # type: (Tensor) -> GoogLeNetOutputs
        x, aux1, aux2 = self._forward(x)
        aux_defined = self.training and self.aux_logits
        if torch.jit.is_scripting():
            if not aux_defined:
                warnings.warn("Scripted GoogleNet always returns GoogleNetOutputs Tuple")
            return GoogLeNetOutputs(x, aux2, aux1)
        else:
            return self.eager_outputs(x, aux2, aux1)


class Inception(nn.Module):
    __constants__ = ['branch2', 'branch3', 'branch4']

    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj,
                 conv_block=None):
        super(Inception, self).__init__()
        if conv_block is None:
            conv_block = BasicConv2d
        self.branch1 = conv_block(in_channels, ch1x1, kernel_size=1)

        self.branch2 = nn.Sequential(
            conv_block(in_channels, ch3x3red, kernel_size=1),
            conv_block(ch3x3red, ch3x3, kernel_size=3, padding=1)
        )

        self.branch3 = nn.Sequential(
            conv_block(in_channels, ch5x5red, kernel_size=1),
            conv_block(ch5x5red, ch5x5, kernel_size=3, padding=1)
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True),
            conv_block(in_channels, pool_proj, kernel_size=1)
        )

    def _forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        outputs = [branch1, branch2, branch3, branch4]
        return outputs

    def forward(self, x):
        outputs = self._forward(x)
        return torch.cat(outputs, 1)


class InceptionAux(nn.Module):

    def __init__(self, in_channels, num_classes, conv_block=None):
        super(InceptionAux, self).__init__()
        if conv_block is None:
            conv_block = BasicConv2d
        self.conv = conv_block(in_channels, 128, kernel_size=1)

        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = F.adaptive_avg_pool2d(x, (4, 4))
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, 1)
        # N x 2048
        x = F.relu(self.fc1(x), inplace=True)
        # N x 1024
        x = F.dropout(x, 0.7, training=self.training)
        # N x 1024
        x = self.fc2(x)
        # N x 1000 (num_classes)

        return x


class BasicConv2d(nn.Module):

    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return F.relu(x, inplace=True)

In [5]:
model= GoogLeNet()

In [6]:
summary(model,(3,224,224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
       BasicConv2d-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]           4,096
       BatchNorm2d-6           [-1, 64, 56, 56]             128
       BasicConv2d-7           [-1, 64, 56, 56]               0
            Conv2d-8          [-1, 192, 56, 56]         110,592
       BatchNorm2d-9          [-1, 192, 56, 56]             384
      BasicConv2d-10          [-1, 192, 56, 56]               0
        MaxPool2d-11          [-1, 192, 28, 28]               0
           Conv2d-12           [-1, 64, 28, 28]          12,288
      BatchNorm2d-13           [-1, 64, 28, 28]             128
      BasicConv2d-14           [-1, 64,

## Task 2: Show and Tell: A Neural Image Caption Generator

Automatically describing the content of an image is a fundamental problem in AI that connects *computer vision* and *natural language processing*.
In this task, we will be looking into how we can use CNNs and RNNs to build an Image Caption Generator.

Specifically, you will be implementing and training the model [in this paper](https://arxiv.org/abs/1411.4555) with TensorFlow/Keras on one of the datasets mentioned in the paper.

To lighten the burden on training the network, you can use any pretrained network in [tf.keras.applications](https://www.tensorflow.org/api_docs/python/tf/keras/applications).

In [51]:
import torch
from torch import nn
from torchvision import models
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Image Encoder
First we fetch a pre-trained convolutional neural network and tweak it slightly so it's not focused on classification anymore. We'll use this to encode our image and then feed that to the caption generator.

In [8]:
class ImageEncoder(nn.Module):
    """Network to encode an image"""
    def __init__(self, out_dim=1000):
        super(ImageEncoder, self).__init__()
        self.resnet = models.resnet152(pretrained=True)
        # it's pretrained, so let's make it not change
        for p in self.resnet.parameters():
            p.requires_grad = False
            
        # the last layer of resnet is Linear that is used for classification
        # we'll change its size, mark it as learnable, and initialize it randomly, since it's not going to be classification anymore
        self.resnet.fc = nn.Linear(self.resnet.fc.in_features, out_dim) # constructing it marks it as learnable as well
        assert(all(p.requires_grad for p in self.resnet.fc.parameters()))
        self.resnet.fc.weight.data.normal_(0, 0.02) # weights to a small number
        self.resnet.fc.bias.data.fill_(0) # bias to zero
        
        # TODO: maybe it also makes sense to tweak the one layer before that, that is, the pooling layer
    
    def forward(self, x):
        return self.resnet(x)        

# Caption Generator
This is a recurrent neural network that is first initialized with the encoded image and then starts generating words that will be used as the generated caption. The word embeddings are pretrained.

In [81]:
class PartiallyFixedEmbedding(nn.Module): # from https://discuss.pytorch.org/t/updating-part-of-an-embedding-matrix-only-for-out-of-vocab-words/33297/4
    """
    This embedding has an embedding matrix that is split into two parts: fixed and variable.
    The fixed part is not changed in the learning process, making it ideal for pretrained vocabularies with a few extra words that we want to learn.
    """
    def __init__(self, fixed_weights, num_to_learn):
        super().__init__()
        self.num_fixed = fixed_weights.size(0)
        self.num_to_learn = num_to_learn
        weight = torch.empty(self.num_fixed + num_to_learn, fixed_weights.size(1))
        weight[:self.num_fixed] = fixed_weights
        self.trainable_weight = nn.Parameter(torch.empty(num_to_learn, fixed_weights.size(1)))
        nn.init.kaiming_uniform_(self.trainable_weight)
        weight[self.num_fixed:] = self.trainable_weight
        self.register_buffer('weight', weight)
        
    def forward(self, inp):
        self.weight.detach_()
        self.weight[self.num_fixed:] = self.trainable_weight
        return nn.functional.embedding(inp, self.weight, None, None, 2.0, False, False)

    
class Vocabulary(object):
    START_TOKEN = '<start>'
    END_TOKEN = '<end>'
    PAD_TOKEN = '<pad>'
    CONTROL_WORDS = [START_TOKEN, END_TOKEN, PAD_TOKEN]
    
    def __init__(self, non_control_words):
        self.control_words_count = len(CaptionGenerator.CONTROL_WORDS)
        
        self.vocab = [w for w in non_control_words] + CaptionGenerator.CONTROL_WORDS
        self.non_control_words = self.vocab[:-self.control_words_count]
        self.word2idx = {w: i for (i, w) in enumerate(self.vocab)}
        
    def get_idx(self, word):
        return self.word2idx[word]
    
    def get_start_token_idx(self):
        return self.word2idx[Vocabulary.START_TOKEN]
    
    def get_end_token_idx(self):
        return self.word2idx[Vocabulary.END_TOKEN]
    
    def get_pad_token_idx(self):
        return self.word2idx[Vocabulary.PAD_TOKEN]
    
    def size(self):
        return len(self.vocab)
    
    
class CaptionGenerator(nn.Module):
    CONTROL_WORDS = ['<start>', '<end>', '<pad>']
    
    def __init__(self, vocab, word2vec, encoded_image_size, hidden_size=512, rnn_layers_num=1):
        """
        :param vocab: set of all the words (as strings) in the vocabulary other than CONTROL_WORDS
        :param word2vec: one of the pre-trained models from torchnlp library
        """
        super(CaptionGenerator, self).__init__()
        self.vocab = vocab
        
        self.embed = PartiallyFixedEmbedding(word2vec[vocab.non_control_words], vocab.control_words_count)
        
        self.initial_hidden_state = nn.Linear(encoded_image_size, hidden_size)
        self.initial_cell_state = nn.Linear(encoded_image_size, hidden_size)
        
        self.recurrent_unit = nn.LSTM(word2vec.dim, hidden_size, rnn_layers_num, batch_first=True)
        
        self.linear = nn.Linear(hidden_size, vocab.size())
        
    def forward(self, features, captions, lengths):
        embeddings = self.embed(captions)
        inputs_packed = pack_padded_sequence(embeddings, lengths, batch_first=True)
        
        initial_hidden = self.initial_hidden_state(features)
        initial_hidden = initial_hidden.view(-1, initial_hidden.shape[0], initial_hidden.shape[1])
        initial_cell = self.initial_cell_state(features)
        initial_cell = initial_cell.view(-1, initial_cell.shape[0], initial_cell.shape[1])
        
        hiddens, _ = self.recurrent_unit(inputs_packed, (initial_hidden, initial_cell))
        hiddens, lengths = pad_packed_sequence(hiddens, batch_first=True)
        outputs = self.linear(hiddens) # maybe a softmax after that?
        return outputs, lengths
    
    def to_sentence(self, forward_out, lengths):
        _, tops = forward_out.topk(1)
        tops = tops.view(tops.shape[0], tops.shape[1])
        sentences = []
        for sentence in tops:
            words = []
            for idx in sentence:
                words.append(self.vocab.vocab[idx])
            sentences.append(' '.join(words))
        return sentences

Now let's get some data. We downloaded and unpacked the Flickr8K dataset from http://academictorrents.com/details/9dea07ba660a722ae1008c4c8afdd303b6f6e53b

In [31]:
from torch.utils import data
from os import path
from PIL import Image
from torchvision import transforms

class Flickr8KDataset(data.Dataset):
    def __init__(self, tokens_path, images_path):
        self.img_names_and_captions = []
        self.images_path = images_path
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)), # TODO: maybe add some cropping or sth? this will squash most of the images
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)), # normalization required by resnet
        ])
        with open(tokens_path, 'r') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split()
                img_name = parts[0].split('#')[0] # each image has a number of captions - but we don't care how many
                tokens = parts[1:(-1 if parts[-1] == '.' else len(parts))] # tokens are everything apart from the image name and the final period
                
                self.img_names_and_captions.append((img_name, list(map(lambda t: t.lower(), tokens))))
                
        self.all_tokens = set()
        for _, tokens in self.img_names_and_captions:
            for token in tokens:
                self.all_tokens.add(token)
                
        self.vocab = Vocabulary(self.all_tokens)
                
    def __getitem__(self, idx):
        img_name, caption = self.img_names_and_captions[idx]
        img_path = path.join(self.images_path, img_name)
        image = Image.open(img_path).convert('RGB')
        image = self.transform(image)
        caption = torch.tensor([self.vocab.get_start_token_idx()] + 
                               [self.vocab.get_idx(token) for token in caption] + 
                               [self.vocab.get_end_token_idx()])
        return image, caption
    
    def __len__(self):
        return len(self.img_names_and_captions)
    
    
def get_flickr8k_dataloader(tokens_path, images_path, batch_size=32, shuffle=True, num_workers=2):
    flickr = Flickr8KDataset(tokens_path, images_path)
    
    def make_batch(data):
        # sort data by caption length
        data.sort(key=lambda x: len(x[1]), reverse=True)
        images, captions = zip(*data)

        # Merge image tensors (stack)
        images = torch.stack(images, 0)

        # Merge captions
        caption_lengths = [len(caption) for caption in captions]

        # zero-matrix num_captions x caption_max_length
        padded_captions = torch.empty(len(captions), max(caption_lengths)).fill_(flickr.vocab.get_pad_token_idx()).long()

        # fill the zero-matrix with captions. the remaining zeros are padding
        for i, caption in enumerate(captions):
            end = caption_lengths[i]
            padded_captions[i, :end] = caption[:end]
        return images, padded_captions, caption_lengths
    
    return data.DataLoader(dataset=flickr,
                           batch_size=batch_size,
                           shuffle=shuffle,
                           num_workers=num_workers,
                           collate_fn=make_batch)

Let's look at the shapes of the data in one batch

In [61]:
dl = get_flickr8k_dataloader('../../Flickr8k/Flickr8k_text/Flickr8k.token.txt', '../../Flickr8k/Flickr8k_Dataset/Flicker8k_Dataset')
images, captions, lengths = next(iter(dl))
print(f"images:  \t{images.shape}")
print(f"captions:\t{captions.shape}")
print(f"lengths: \t{lengths}")

images:  	torch.Size([32, 3, 224, 224])
captions:	torch.Size([32, 17])
lengths: 	[17, 15, 15, 15, 15, 15, 15, 14, 14, 14, 14, 13, 13, 13, 13, 12, 12, 11, 11, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 8, 8, 8]


In [64]:
image_encoding_length = 1000

ie = ImageEncoder(image_encoding_length)
features = ie.forward(images)
print(f"encoded images: {features.shape}")
features

encoded images: torch.Size([32, 1000])


tensor([[-0.6978,  0.6970,  1.1099,  ..., -0.2639, -0.2276,  0.4812],
        [-0.2985,  0.1557,  0.0748,  ..., -0.5385, -0.6335,  1.0926],
        [-0.1242,  0.5914,  0.8516,  ..., -1.0254, -0.4933,  0.7090],
        ...,
        [ 0.4618,  0.2589,  0.9273,  ..., -0.8678, -0.3088,  0.6960],
        [-1.0010,  0.3153, -0.0195,  ..., -0.8871, -1.1349,  1.2005],
        [-0.4832,  0.8630,  0.8718,  ..., -0.1795,  0.0482, -0.3214]],
       grad_fn=<AddmmBackward>)

In [82]:
from torchnlp.word_to_vector import GloVe

vocab = dl.dataset.vocab
cg = CaptionGenerator(vocab, GloVe(), image_encoding_length)

generated, lengths = cg.forward(features, captions, lengths)
print(f"generated: {generated.shape}")
generated

generated: torch.Size([32, 17, 8921])


tensor([[[ 0.0106,  0.0343, -0.1044,  ..., -0.1066,  0.0839,  0.0504],
         [ 0.0219,  0.0362, -0.1248,  ..., -0.1207,  0.0773, -0.0157],
         [ 0.0103,  0.0444, -0.0900,  ..., -0.0469,  0.0734, -0.0253],
         ...,
         [ 0.0033,  0.0377,  0.0108,  ..., -0.0174,  0.0425, -0.0759],
         [-0.0052,  0.0188, -0.0224,  ..., -0.0298,  0.0365, -0.0826],
         [-0.0112,  0.0248, -0.0102,  ..., -0.0457,  0.0313, -0.0658]],

        [[-0.0206,  0.1056, -0.1236,  ..., -0.0462,  0.0688,  0.0421],
         [-0.0129,  0.0684, -0.0445,  ..., -0.0419,  0.0564,  0.0019],
         [ 0.0217,  0.0606, -0.0295,  ..., -0.0258,  0.0775, -0.0440],
         ...,
         [-0.0075,  0.0473,  0.0005,  ..., -0.0598,  0.0523, -0.0395],
         [-0.0097,  0.0249,  0.0298,  ..., -0.0380,  0.0341, -0.0291],
         [-0.0097,  0.0249,  0.0298,  ..., -0.0380,  0.0341, -0.0291]],

        [[ 0.0201,  0.0334, -0.0108,  ..., -0.0726,  0.0386,  0.0229],
         [ 0.0093,  0.0498, -0.0047,  ..., -0

In [83]:
cg.to_sentence(generated, lengths)

['bath sequined appears wrestle scubba scubba giong steps fanning course giong scubba driver-side driver-side reaching mule garter',
 'bath bath tophats social shoeless retreiver clown ejected limo limo limo stride arizona ticket garter checkered checkered',
 'challenging romp gettnig reaching canned mule sewn glancing driver-side billiards cavort appears ticket slices slices checkered checkered',
 'scoring watched jacked driver-side move scubba bow rocky rocky scubba scubba scubba scubba prances prances checkered checkered',
 'aig driver-side driver-side slices mess car driver-side driver-side car withdrawing woods giong social kid-sized prances checkered checkered',
 'aig mosaic car rocky responders earpiece stride rocky slices driver-side innertubes driver-side rocky rocky rocky checkered checkered',
 'pit mosaic dave watched giong driver-side driver-side driver-side mule giong giong scubba ticket withdrawing withdrawing checkered checkered',
 'challenging paraglide gettnig canned d