<a name='0'></a>

# Vision Transformers for Mobile Applications

As we saw in the last notebooks, Vision Transformers are one of the top trends in computer vision research nowdays. They have matched or outperformed convolutional neural networks(CNNs) on most vision benchmarks. In the world of edge and mobile devices, it's essential to have light-weight models that can run efficiently on those devices. There are already efficient and light-weight CNNs models that run on mobile devices, but with the superpower of self-attention to learn global representation from data, researchers are trying to introduce transformers in mobile applications as well.

In this notebook, we will review some few landmark papers that use self-attention or combine it with CNNs focusing on devices that have low-computation power. I will also add pointers for further reading. Important to note that this review is not meant to be exhaustive or chronological.

***Outline***:

- [1. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](#1) 

- [2. EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers](#2)

- [3. EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition](#3)

- [4. Other Vision Transformers for Low-Computation Devices](#4)

- [5. Conclusion](#5)

<a name='1'></a>

## 1. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

MobileViT is a light-weight, general purpose, and one of the earliest Vision Transformers that was designed for mobile devices. Unlike prior Vision Transformers that are heavy-weights and rely on strong data augmentation techniques, MobileViT achieves a competitive accuracy with fewer parameters. MobileViT also outperforms existing low-weight CNNs architectures such as [MobileNetV2](https://github.com/Nyandwi/ModernConvNets/blob/main/convnets/09-mobilenetv2.ipynb).

MobileViT combines convolutions and self-attention layers and the resulting network can learn the global features in data(thanks to self-attention that attend to all parts of the input data) while retaining fewer paramaters. The reason for using convolutions in MobileViT is to benefit from the spatial inductive biases of convolutions that ultimately reduces the need of large datasets or strong augmentation methods.


MobileViT architecture is made of [MobileNetV2](https://github.com/Nyandwi/ModernConvNets/blob/main/convnets/09-mobilenetv2.ipynb) blocks that are used for downsampling the spatial resolution of intermediate features and MobileViT blocks that are made of convolution and self-attention layers. Quoting the paper about the architecture: "MobileViT uses convolutions and transformers in a way that the resultant MobileViT block has convolution-like properties while simultaneously allowing for global processing."


![image](https://drive.google.com/uc?export=view&id=1KjEPzvGE3P1cmG-7MrBX-C8Byyoy1IKN)

Surprisingly, MobileViT achieves better performance(when used for image classification or when used as a backbone network in object detection and image segmentation) with fewer parameters than existing light-weight CNNs and Vision Transformers(ViTs).

![image](https://drive.google.com/uc?export=view&id=1V9FFAK9l_BAJsd3rYuL7QaKpXl_xwj2_)





MobileViT shows really great performance, but it still uses self-attention which has a quadratic time complexity. MobileViT has a high latency compared to CNNs. The improved version of MobileViT, MobileViTv2 introduce a separable self-attention which have linear time complexity, making it a viable architecture in mobile devices. You can learn more about MobileViTv2 [here](https://arxiv.org/abs/2206.02680).

<a name='2'></a>


## 2. EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

[EdgeVit](https://arxiv.org/abs/2205.03436) is another Vision Transformer that was designed for devices that have limited computation resources such as mobile devices. Unlike prior low-weights VITs such as MobileViT that focus on reducing parameters while achieving great accuracy, EdgeVit seeks to improve inference efficiency(low latency and enerygy consumption) rather than merely focusing on parameter counts or FLOPs(floaping point operations).

EdgeViT uses the same approach as Swin Transformer. It employs a hierarchical pyramid network structure where the spatial resolution is reduced while the channel dimension is increased as we go deeper into the network(i.e stage after a stage).

![image](https://drive.google.com/uc?export=view&id=1bEv0H44_R1lEhHfOlq3SzO_kE108ZaUp)

The main component of EdgeViT is Local-Global-Local(LGL) which is made of local aggregation, global sparse attention, and local propogation. Local aggregation is used for aggregating the information in local windows using depthwise and pointwise convolution, global sparse attention is used for computing the global representation of tokens in a fixed size window, and local propogation is used for propogating the learned global information into neighboring tokens using [transposed convolution](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf).

Compared to low-weights CNNs such as MobileNets and EfficientNets, EdgeViT achieves a great trade-off between accuracy and efficiency(such as latency). You can read more about EdgeViTs [here](https://arxiv.org/abs/2205.03436).

![image](https://drive.google.com/uc?export=view&id=18lzwmlUs5eaTBSPghlyR0nas3fbEHbhE)

<a name='3'></a>


## 3. EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

EfficientViT introduces a linear attention replacing standard softmax attention(simply swapping ReLU with softmax in the attention formula). Since linear attention is not sufficient enough for capturing the global features, it is enhanced with depthwise convolutions. The main contribution of EfficientViT seems to be reducing the quadratic time of softmax to linear time and using depthwise convolutions. Depthwise convolutions are common in architectures that are designed for efficiency in low-computation regime.

Like all previous architectures, EfficientViT is hybrid(made of CNNs and self-attention) and hierarchical(resolution and channels descrease and increase respectively as we go deep into the network).


![image](https://drive.google.com/uc?export=view&id=1xXBpVTe6yMcA6K-eUgVeZw_W-c3Jrgiw)


<a name='4'></a>


## 4. Other Vision Transformers for Low-Computation Devices

There are currently [zillions](https://twitter.com/Jeande_d/status/1543632243486117888?s=20&t=rnjIM0zsnq1RMSb0CXy-LA) of (Vision) Transformer models, most of which are beta and will be forgotten in few months. It's nearly impossible to talk about every paper. So, what I want to do here is to list a few other optional papers that are designed for low-computation devices:

* [MiniViT: Compressing Vision Transformers with Weight Multiplexing](https://arxiv.org/abs/2204.07154)

* [Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios](https://arxiv.org/abs/2207.05501)

* [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)

* [Mobile-Former: Bridging MobileNet and Transformer](https://arxiv.org/abs/2108.05895)

* [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://openaccess.thecvf.com/content/ICCV2021/html/Graham_LeViT_A_Vision_Transformer_in_ConvNets_Clothing_for_Faster_Inference_ICCV_2021_paper.html)

* [MobileOne: An Improved One millisecond Mobile Backbone](https://arxiv.org/abs/2206.04040)




<a name='5'></a>


## 5. Conclusion

This was a short review of Vision Transformers that are designed for efficient deployment in mobile and edge devices. Vision Transformers are still an active area of research, but I thought it might be good to know some few papers that are addressing low-computation devices.

It seems that most of those papers are hybrid(combines convolutions and self-attention) and hierarchical(the resolution and channels dimension decreases and increases with number of layers respectively). If there is one thing that makes Vision Transformers a deployment unfriendly architecture is the quadratic time complexity of self-attention. So, most papers introduce a linear time complexity often by computing patches representation in local windows(like Swin Transformer) or swapping softmax with ReLU as we saw in EfficientViT. Finally, the specific kind of convolution that is often used in most efficient architectures not only in CNNs but in ViTs as well is [depthwise convolution](https://paperswithcode.com/method/depthwise-convolution), a special kind of grouped convolution that apply a single filter across each input channel and concatenate the results.

## [BACK TO TOP](#0)