# Building AlexNet on Tensorflow
This time I try to develop AlexNet on Tensorflow. Actually, I never worked on neural networks for Computer Vision before, so I am really excited to learn to develop one.

## What Is AlexNet?
[AlexNet](https://en.wikipedia.org/wiki/AlexNet) is one of the first architectures for Computer Vision. It was developed by *Alex* Krizhevsky (that's why it is called *AlexNet*), Ilya Sutskever, and Geoffrey Hinton in 2012 to tackle [**ImageNet**](http://www.image-net.org) Image Classification problem.

I read about AlexNet from [this post](https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html). It states that the AlexNet paper, [*ImageNet Classification with Deep Convolutional
Neural Networks*](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), was one of the pioneers (after [LeNet](http://yann.lecun.com/exdb/lenet/) in 1998) in Computer Vision.

## AlexNet from Scratch
For me, the best way to learn a model is to implement it directly. Therefore, I try to develop the architecture from scratch (not so scratch actually). I try to replicate the architecture in this [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf). The paper doesn't explain the architecture in detail, so I infer some of the required details. Btw, I use Tensorflow 1.4.

In [1]:
import tensorflow as tf
tf.reset_default_graph()
tf.set_random_seed(1)

The architecture of AlexNet is described in the image below (taken from this [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)):
<img src='alexnet.png'>

In the image above, the authors splited the parameters into 2 different GPUs, since the parameters are too much for 1 GPU.

Some points to be noted in developing the network:
- All layers in AlexNet use relu activation function.
- All kernel weights are initialized using **Gaussian distribution** with mean **0** and standard deviation **0.01**.
- All bias weights are initialized using constant **0** or **1**. Initialization with constant 1 accelerates the early learning stages. However, I'm not sure why the author didn't use constant 1 initialization in every layer. My best guess is that the author found that the learning was very slow in several layers.

### 0. Input Layer
I create placeholder for the batch of the image with size of 224x224 with 3 channels (RGB). I also create placeholder for the label.

In [2]:
with tf.variable_scope('input_layer'):
    image_batch = tf.placeholder(tf.float32, [None,224,224,3],
                                 'image_batch')
    print 'image_batch:', image_batch
    label_batch = tf.placeholder(tf.int32, [None], 'label_batch')
    print 'label_batch:', label_batch

image_batch: Tensor("input_layer/image_batch:0", shape=(?, 224, 224, 3), dtype=float32)
label_batch: Tensor("input_layer/label_batch:0", shape=(?,), dtype=int32)


### 1. Convolution Layer 1
The output from the previous layer is filtered with 96 convolutional kernels with size 11x11x3, strides 4x4, and 'SAME' padding. The result passes through the [*Local Response Normalization*](https://prateekvjoshi.com/2016/04/05/what-is-local-response-normalization-in-convolutional-neural-networks/) (LRN) (with the setup similar to the paper). The idea of LRN is to normalize the kernel from the activation function based on the sum of the adjacent kernels, since the output from relu might be very big. Finally, the result passes *Maxpool* with size 2x2 and stride 1x1.

In [3]:
with tf.variable_scope('conv_layer_1'):
    conv1 = tf.layers.conv2d(inputs=image_batch, filters=96,
            kernel_size=[11,11], strides=(4,4), padding='SAME',
            activation=tf.nn.relu,
            bias_initializer=tf.zeros_initializer(),
            kernel_initializer=tf.random_normal_initializer(0, 0.01))
    conv1 = tf.identity(conv1, name='conv1') # to rename the tensor
    print 'conv1:', conv1
    lrn1 = tf.nn.lrn(conv1, depth_radius=5, bias=2, alpha=1e-4,
                     beta=0.75, name='lrn1')
    print 'lrn1:', lrn1
    maxpool1 = tf.layers.max_pooling2d(inputs=lrn1, pool_size=[2,2],
                                       strides=(1,1))
    maxpool1 = tf.identity(maxpool1, name='maxpool1')
    print 'maxpool1:', maxpool1

conv1: Tensor("conv_layer_1/conv1:0", shape=(?, 56, 56, 96), dtype=float32)
lrn1: Tensor("conv_layer_1/lrn1:0", shape=(?, 56, 56, 96), dtype=float32)
maxpool1: Tensor("conv_layer_1/maxpool1:0", shape=(?, 55, 55, 96), dtype=float32)


### 2. Convolution Layer 2
Similar to the previous layer, the output from the previous layer is filtered with 256 convolutional kernels with size 5x5x96, strides 2x2, and 'SAME' padding. The biases are initilized with constant 1 to accelerate the early stage of learning (according to the paper). The result passes through the LRN and Maxpool with size 2x2 and stride 1x1.

In [4]:
with tf.variable_scope('conv_layer_2'):
    conv2 = tf.layers.conv2d(inputs=maxpool1, filters=256,
            kernel_size=[5,5], strides=(2,2), padding='SAME',
            activation=tf.nn.relu,
            bias_initializer=tf.ones_initializer(),
            kernel_initializer=tf.random_normal_initializer(0, 0.01))
    conv2 = tf.identity(conv2, name='conv2')
    print 'conv2:', conv2
    lrn2 = tf.nn.lrn(conv2, depth_radius=5, bias=2, alpha=1e-4,
                     beta=0.75, name='lrn2')
    print 'lrn2:', lrn2
    maxpool2 = tf.layers.max_pooling2d(inputs=lrn2, pool_size=[2,2],
                                       strides=(1,1))
    maxpool2 = tf.identity(maxpool2, name='maxpool2')
    print 'maxpool2:', maxpool2

conv2: Tensor("conv_layer_2/conv2:0", shape=(?, 28, 28, 256), dtype=float32)
lrn2: Tensor("conv_layer_2/lrn2:0", shape=(?, 28, 28, 256), dtype=float32)
maxpool2: Tensor("conv_layer_2/maxpool2:0", shape=(?, 27, 27, 256), dtype=float32)


### 3. Convolution Layer 3
Only convolutional kernels are applied in this layer. The output from the previous layer is filtered with 384 convolutional kernels with size 3x3x256, strides 2x2, and 'VALID' padding. The biases are initialized with constant 0.

In [5]:
with tf.variable_scope('conv_layer_3'):
    conv3 = tf.layers.conv2d(inputs=maxpool2, filters=384,
            kernel_size=[3,3], strides=(2,2), padding='VALID',
            activation=tf.nn.relu,
            bias_initializer=tf.zeros_initializer(),
            kernel_initializer=tf.random_normal_initializer(0, 0.01))
    conv3 = tf.identity(conv3, name='conv3')
    print 'conv3:', conv3

conv3: Tensor("conv_layer_3/conv3:0", shape=(?, 13, 13, 384), dtype=float32)


### 4. Convolution Layer 4
Similar to the previous layer, the output from the previous layer is filtered with 384 convolutional kernels with size 3x3x384, strides 1x1, and 'SAME' padding. The biases are initialized with constant 1.

In [6]:
with tf.variable_scope('conv_layer_4'):
    conv4 = tf.layers.conv2d(inputs=conv3, filters=384,
            kernel_size=[3,3], strides=(1,1), padding='SAME',
            activation=tf.nn.relu,
            bias_initializer=tf.ones_initializer(),
            kernel_initializer=tf.random_normal_initializer(0, 0.01))
    conv4 = tf.identity(conv4, name='conv4')
    print 'conv4:', conv4

conv4: Tensor("conv_layer_4/conv4:0", shape=(?, 13, 13, 384), dtype=float32)


### 5. Convolution Layer 5
Similar to the previous layer, the output from the previous layer is filtered with 256 convolutional kernels with size 3x3x384, strides 1x1, and 'SAME' padding. The biases are initialized with constant 1.

In [7]:
with tf.variable_scope('conv_layer_5'):
    conv5 = tf.layers.conv2d(inputs=conv4, filters=256,
            kernel_size=[3,3], strides=(1,1), padding='SAME',
            activation=tf.nn.relu,
            bias_initializer=tf.ones_initializer(),
            kernel_initializer=tf.random_normal_initializer(0, 0.01))
    conv5 = tf.identity(conv5, name='conv5')
    print 'conv5:', conv5

conv5: Tensor("conv_layer_5/conv5:0", shape=(?, 13, 13, 256), dtype=float32)


### 6. Fully Connected Layer 1
Now, it's fully connected layer time. The output from the previous convolutional layer has to be reshaped before hand. The number of neurons in this layer is 4096. Moreover, the dropout is applied in the output of this layer.

In [8]:
with tf.variable_scope('fc_layer_1'):
    conv5 = tf.reshape(conv5, [-1, 13*13*256])
    dense1 = tf.layers.dense(inputs=conv5, units=4096,
            activation=tf.nn.relu,
            kernel_initializer=tf.random_normal_initializer(0, 0.01),
            bias_initializer=tf.zeros_initializer())
    dense1 = tf.layers.dropout(inputs=dense1, rate=0.5, training=True)
    dense1 = tf.identity(dense1, name='dense1')
    print 'dense1:', dense1

dense1: Tensor("fc_layer_1/dense1:0", shape=(?, 4096), dtype=float32)


### 7. Fully Connected Layer 2
This layer is similar to the previous layer.

In [9]:
with tf.variable_scope('fc_layer_2'):
    dense2 = tf.layers.dense(inputs=dense1, units=4096,
            activation=tf.nn.relu,
            kernel_initializer=tf.random_normal_initializer(0, 0.01),
            bias_initializer=tf.zeros_initializer())
    dense2 = tf.layers.dropout(inputs=dense2, rate=0.5, training=True)
    dense2 = tf.identity(dense2, name='dense2')
    print 'dense2:', dense2

dense2: Tensor("fc_layer_2/dense2:0", shape=(?, 4096), dtype=float32)


### 8. Softmax Layer
This is the final layer. Actually, this layer is similar to the previous layer. The only difference is in the activation function. This layer uses softmax for 1000-class classification.

In [10]:
with tf.variable_scope('softmax_layer'):
    softmax = tf.layers.dense(inputs=dense2, units=1000,
            activation=tf.nn.relu,
            kernel_initializer=tf.random_normal_initializer(0, 0.01),
            bias_initializer=tf.zeros_initializer())
    softmax = tf.identity(softmax, name='softmax')
    print 'softmax:', softmax

softmax: Tensor("softmax_layer/softmax:0", shape=(?, 1000), dtype=float32)


In summary, all the parameters are shown below:

In [11]:
tf.trainable_variables()

[<tf.Variable 'conv_layer_1/conv2d/kernel:0' shape=(11, 11, 3, 96) dtype=float32_ref>,
 <tf.Variable 'conv_layer_1/conv2d/bias:0' shape=(96,) dtype=float32_ref>,
 <tf.Variable 'conv_layer_2/conv2d/kernel:0' shape=(5, 5, 96, 256) dtype=float32_ref>,
 <tf.Variable 'conv_layer_2/conv2d/bias:0' shape=(256,) dtype=float32_ref>,
 <tf.Variable 'conv_layer_3/conv2d/kernel:0' shape=(3, 3, 256, 384) dtype=float32_ref>,
 <tf.Variable 'conv_layer_3/conv2d/bias:0' shape=(384,) dtype=float32_ref>,
 <tf.Variable 'conv_layer_4/conv2d/kernel:0' shape=(3, 3, 384, 384) dtype=float32_ref>,
 <tf.Variable 'conv_layer_4/conv2d/bias:0' shape=(384,) dtype=float32_ref>,
 <tf.Variable 'conv_layer_5/conv2d/kernel:0' shape=(3, 3, 384, 256) dtype=float32_ref>,
 <tf.Variable 'conv_layer_5/conv2d/bias:0' shape=(256,) dtype=float32_ref>,
 <tf.Variable 'fc_layer_1/dense/kernel:0' shape=(43264, 4096) dtype=float32_ref>,
 <tf.Variable 'fc_layer_1/dense/bias:0' shape=(4096,) dtype=float32_ref>,
 <tf.Variable 'fc_layer_2/d

### 9. Loss Function
Since this is classification problem, Cross Entropy is the right choice of loss function.

In [12]:
with tf.variable_scope('loss'):
    loss = tf.losses.sparse_softmax_cross_entropy(labels=label_batch,
                                                  logits=softmax)
    print 'loss:', loss

loss: Tensor("loss/sparse_softmax_cross_entropy_loss/value:0", shape=(), dtype=float32)


### 10. Training
The paper uses *Gradient Descent with momentum* for training with additional weight decay. I don't find the exact optimizer in the Tensorflow. *MomentumOptimizer* is the quite similar, but it doesn't have the weight decay. Hence, I modify all gradients to include the weight decay as follows: *weight_decay x learning_rate x variable*. The paper uses *learning_rate = 0.01*, *momentum = 0.9*, and *weight_decay = 0.0005*.

In [13]:
with tf.variable_scope('train'):
    optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,
                                           momentum=0.9)
    grads_vars = optimizer.compute_gradients(loss)
    grads_vars = [(grad + 0.0005*0.01*var, var)
                  for grad, var in grads_vars]
    train_op = optimizer.apply_gradients(grads_vars)

### 11. Prediction
Predicting the output class is very simple. Just use the argmax to find the most likely class. Moreover, calculating the accuracy is simply using *tf.equal* and *tf.reduce_mean*.

In [14]:
with tf.variable_scope('predict'):
    prediction = tf.argmax(input=softmax, axis=0, output_type=tf.int32,
                           name='prediction')
    accuracy = tf.reduce_mean(tf.to_float( \
                tf.equal(prediction, label_batch)), name='accuracy')

## Pretrained AlexNet
Honestly, I won't train the network by myself (since I don't have a building full of servers like Google!). Moreover, I didn't find the pretrained AlexNet model in Tensorflow or Keras, but I found an interesting github [repository](https://github.com/guerzh/tf_weights).

## Conclusion
I'm really happy that I can implement AlexNet myself. I think it's not that hard (although I haven't trained it). This model has many parameters, so the training will be very costly. Pretrained model is prefered.

Any feedback are welcomed! :)