### AI & Compute, Inception-v3

Inception-v3 is identical to Inception-v2 from [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567.pdf), only that it's the version "in which the fully connected layer of the auxiliary classifier is also batch-normalized, not just the convolutions."

##### About my estimation:

- table 1 in the paper summarizes the network architecture
- I wrote functions for all types of layers and modules and then use them to estimate the compute of a forward pass
- I then use the formula from OpenAI's "AI and Compute" to estimate total compute [3]

#### Open questions
- currently I ignore the batch-normalization, does this increase the compute costs meaningfully?

#### Notes
- I looked up Towards Data Science intros to CNNs [1,2] to understand the calculations that go into a convolutional layer

[1] https://towardsdatascience.com/an-introduction-to-convolutional-neural-networks-eb0b60b58fd7

[2] https://towardsdatascience.com/convolution-neural-network-for-image-processing-using-keras-dc3429056306

[3] https://openai.com/blog/ai-and-compute/


In [1]:
def conv_compute(kernel_size, stride, input_size, total_kernels):
    #how often does the kernel fit on the input?
    total_convolutions = kernel_fits(kernel_size, stride, input_size)
    
    #computations per convolution
    #third dimension of the kernel is the depth of the input size
    total_multiplications = kernel_size[0] * kernel_size[1] * input_size[2]
    total_additions = total_multiplications # adding all the products together
    comp_per_conv = total_multiplications + total_additions
    
    return comp_per_conv * total_convolutions * total_kernels


In [2]:
def pool_compute(kernel_size, stride, input_size):
    total_pools = kernel_fits(kernel_size, stride, input_size)
    
    #finding max is O(n)
    total_comparisons = kernel_size[0] * kernel_size[1] * input_size[2]
    return total_comparisons * total_pools

In [3]:
def incept_compute_1(stride=1, input_size=[35,35,288]):
    # figure 5
    strand_one = \
    conv_compute(kernel_size=[1,1],
                 stride=1,
                 input_size=input_size, 
                 total_kernels=input_size[2]) \
    + conv_compute([3,3],1,input_size, input_size[2]) \
    + conv_compute([3,3],1,input_size, input_size[2]) \

    strand_two = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    + conv_compute([3,3],1,input_size, input_size[2]) \
    
    strand_three = \
    pool_compute([1,1],1,input_size) \
    + conv_compute([1,1],1,input_size, input_size[2]) \
    
    strand_four = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    
    return strand_one + strand_two + strand_three + strand_four

In [4]:
def incept_compute_2(stride=1, input_size=[35,35,288]):
    # figure 6
    strand_one = \
    conv_compute(kernel_size=[1,1],
                 stride=1,
                 input_size=input_size, 
                 total_kernels=input_size[2]) \
    + conv_compute([1,7],1,input_size, input_size[2]) \
    + conv_compute([7,1],1,input_size, input_size[2]) \
    + conv_compute([1,7],1,input_size, input_size[2]) \
    + conv_compute([7,1],1,input_size, input_size[2]) \

    strand_two = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    + conv_compute([1,7],1,input_size, input_size[2]) \
    + conv_compute([7,1],1,input_size, input_size[2]) \
    
    strand_three = \
    pool_compute([1,1],1,input_size) \
    + conv_compute([1,1],1,input_size, input_size[2]) \
    
    strand_four = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    
    return strand_one + strand_two + strand_three + strand_four

In [5]:
def incept_compute_3(stride=1, input_size=[35,35,288]):
    # figure 7
    strand_one = \
    conv_compute(kernel_size=[1,1],
                 stride=1,
                 input_size=input_size, 
                 total_kernels=input_size[2]) \
    + conv_compute([3,3],1,input_size, input_size[2]) \
    + conv_compute([3,1],1,input_size, input_size[2]) \
    + conv_compute([1,3],1,input_size, input_size[2])

    strand_two = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    + conv_compute([1,3],1,input_size, input_size[2]) \
    + conv_compute([3,1],1,input_size, input_size[2])
    
    strand_three = \
    pool_compute([1,1],1,input_size) \
    + conv_compute([1,1],1,input_size, input_size[2]) \
    
    strand_four = \
    conv_compute([1,1],1,input_size, input_size[2]) \
    
    return strand_one + strand_two + strand_three + strand_four

In [12]:
def linear_compute(input_size, output_size):
    # mapping from 2048 to 1000
    # so you have 1000 * 2048 connections/multiplications + that many additions of the bias
    return 2 * input_size[2] * output_size

In [13]:
def softmax_compute(input_size):
    # exponentiating each input and deviding by the sum
    return input_size[2]

In [8]:
def kernel_fits(kernel_size, stride, input_size):
    #how often will the kernel be applied to the input?
    vertical_fit = input_size[0] / kernel_size[0]
    horizontal_fit = input_size[1] / kernel_size[1]
    layers = input_size[2]
    return layers * vertical_fit * horizontal_fit / stride

In [9]:
def output_size(kernel_size, stride, input_size):
    vertical_fit = input_size[0] / kernel_size[0] / stride
    horizontal_fit = input_size[1] / kernel_size[1] / stride
    return [vertical_fit, horizontal_fit, input_size[2]]

#### Estimate compute of forward pass
The layers are taken from table 1 in the paper.

In [17]:
forward_pass_compute = \
conv_compute(kernel_size=[3,3],
                stride=2,
                input_size=[299,299,3], 
                total_kernels=32) \
+ conv_compute([3,3],2,[149,149,32], 32) \
+ conv_compute([3,3],2,[147,147,32], 64) \
+ pool_compute([3,3],2,[147,147,64]) \
+ conv_compute([3,3],2,[73,73,64], 80) \
+ conv_compute([3,3],2,[71,71,80], 192) \
+ conv_compute([3,3],2,[35,35,192], 288) \
+ incept_compute_1([35,35,288]) \
+ incept_compute_2([17,17,768]) \
+ incept_compute_3([8,8,1280]) \
+ pool_compute([3,3],2,[8,8,2048]) \
+ linear_compute([1,1,2048], 1000) \
+ softmax_compute([1,1,1000])

#### Estimate total compute
I use the formula from OpenAI's article ["AI and Compute"](https://openai.com/blog/ai-and-compute/)

>(add-multiplies per forward pass) * 
(2 FLOPs/add-multiply) * 
(3 for forward and backward pass) * 
(number of examples in dataset) *
(number of epochs)

Relevant data from the paper:
> #### 8. Training Methodology
> We have trained our networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU with batch size 32 for 100 epochs

Finally, there are 1.2 million images in ILSVRC 2012's training data. [source](https://image-net.org/challenges/LSVRC/2012/)

In [15]:
forward_pass_compute * 2 * 3 * 1.2E6 * 100 

1.1125875164832e+21