## MobileNet

In [2]:
from torchvision import models
import torch
from torchvision import transforms
from PIL import Image
from torchvision import datasets
import torch.nn as nn

### Abstract

[streamlined architecture ?]() that uses [depth-wise separable ?]() convolutions to build [light weight ?]() deep neural networks.\
[Two simple global hyperparameters ?]() trade off between [latency (delay ?)](https://www.techtarget.com/whatis/definition/latency) and accuracy.\
These hyper-parameters allow us to choose the right sized model !


The main difference between 2D convolutions and [Depthwise Convolution]() is that 2D convolutions are performed over all/multiple input channels, whereas in Depthwise convolution, each channel is kept separate.


### 1. Introduction

AlexNet(2012) win ImageNet Challenge: ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012.

Problem: the recognition tasks need to be carried out in a timely fashion on a computationally limited platform.

Notes: i.e. need to become more effciency?

MobileNet can be easily matched to the design requirements for [mobile and embedded vision applications]()

### 2. Prior work

Compared with small networks of other papers, MobileNets optimize latency and also yield networks with small size.

MobileNets are built primarily from depthwise separable convolutions initially introduced in [Rigid-motion scattering for image classification](../papers/phd_sifre.pdf) and subsequently used in [Inception models](../papers/BatchNomalization.pdf) to reduce the computation in the first few layers. 


blog: [Understanding Depthwise Separable Convolutions and the efficiency of MobileNets](../blogs/DepthwiseSeparableConvolutions.pdf)

### 3. MobileNet Architecture

we first describe the core layers that Mo- bileNet is built on which are depthwise separable filters.

#### 3.1. Depthwise Separable Convolution

The MobileNet model is based on depthwise separable convolutions which is a form of factorized convolutions which factorize a standard convolution into [a depthwise convolution]() and a 1 × 1 convolution called [a pointwise convolution]().

In [4]:
# data
data_path = '../data-unversioned/p1ch7/'
# cifar10 = datasets.CIFAR10(data_path, train=True, download=False)
# cifar10_val = datasets.CIFAR10(data_path, train=False, download=False)
tensor_cifar10 = datasets.CIFAR10(data_path, train=True, download=False,
                          transform=transforms.ToTensor())
tensor_cifar10_val = datasets.CIFAR10(data_path, train=False, download=False)
label_map = {0: 0, 2: 1}
class_names = ['airplane', 'bird']
cifar2 = [(img, label_map[label])
          for img, label in tensor_cifar10
          if label in [0, 2]]
cifar2_val = [(img, label_map[label])
              for img, label in tensor_cifar10_val
              if label in [0, 2]]

[torch Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d)

In [15]:
# depthwise convolution
Cin, k = 3, 2
Cout = k * Cin
depthwise_conv = nn.Conv2d(Cin, Cout, kernel_size=3, groups=Cin, padding='same')
depthwise_conv.weight.shape, depthwise_conv.bias.shape

(torch.Size([6, 1, 3, 3]), torch.Size([6]))

In [14]:
# depthwise convolution
img, _ = cifar2[0]
img = img.unsqueeze(0)
depthwise_output = depthwise_conv(img)
img.shape, depthwise_output.shape

(torch.Size([1, 3, 32, 32]), torch.Size([1, 3, 32, 32]))

In [8]:
# 1 x 1 convolution / pointwise convolution
Cin = 3
Cout = 10
pointwise_conv = nn.Conv2d(Cin, Cout, kernel_size=1)
pointwise_output = pointwise_conv(img)
pointwise_conv.weight.shape, pointwise_output.shape

(torch.Size([10, 3, 1, 1]), torch.Size([1, 10, 32, 32]))

* standard convolutional layer :
* input : $M \times D_{F} \times D_{F}$, $M$ is input depth, $D_{F}$ is the spatial width and height of a square input feature map $F$
* convolution kernel $K$ : $N \times M \times D_{K} \times D_{K}$, $N$ is output depth, $M$ is input depth, $D_{K}$ is the spatial dimension of the square kernel
* output : $N \times D_{G} \times D_{G}$, $N$ is output depth, $D_{G}$ isthe spatial width and height of a square output feature map $G$
* $G_{n, k, l} = \sum_{m, i, j} K_{n, m, i, j} \cdot F_{m, k+i-1, l+j-1}$
* computational cost : $N \cdot M \cdot D_{K} \cdot D_{K} \cdot D_{F} \cdot D_{F}$

* Depthwise convolution :
* input : $M \times D_{F} \times D_{F}$ feature map $F$
* kernel : $M \times 1 \times D_{K} \times D_{K}$
* output : $M \times D_{G} \times D_{G}$ feature map $G$
* no padding, stride 1 example : input[3, 32, 32] -> kernel[3, 1, 3, 3] -> output[3, 30, 30]
* computational cost : $M \cdot 1 \cdot D_{K} \cdot D_{K} \cdot D_{F} \cdot D_{F}$

* $1 \times 1$ (pointwise) convolution
* compute a linear combination of the output of depthwise convolution
* input from depthwise convolution: $M \times D_{G} \times D_{G}$
* kernel : $N \times M \times 1 \times 1$
* output : $N \times D_{G} \times D_{G}$
* computational cost : $N \cdot M \cdot 1 \cdot 1 \cdot D_{G} \cdot D_{G}$

#### 3.2. Network Structure and Training

Down sampling is handled with strided convolution in the depthwise convolutions as well as in the first layer.

Our model structure puts nearly all of the computation into dense 1 × 1 convolutions. This can be implemented with highly optimized ***general matrix multiply (GEMM)*** functions.

Additionally, we found that it was important to put very little or no weight decay (l2 regularization) on the depthwise filters since their are so few parameters in them.

#### 3.3. Width Multiplier: Thinner Models

For a given layer and width multiplier α, the number of input channels M be- comes αM and the number of output channels N becomes αN. ???

#### 3.4. Resolution Multiplier: Reduced Representa- tion

where ρ ∈ (0, 1] which is typically set implicitly so that
the input resolution of the network is 224, 192, 160 or 128. ρ = 1 is the baseline MobileNet and ρ < 1 are reduced computation MobileNets. Resolution multiplier has the ef- fect of reducing computational cost by ρ2.