Skip to content

Latest commit

 

History

History
19 lines (11 loc) · 3.85 KB

ProjectProposal.md

File metadata and controls

19 lines (11 loc) · 3.85 KB

Project Proposal : Investigating Possible Inductive Biases in Local Sparse Attention Vision Transformer Architectures Against Traditional CNNs

Background

The inductive bias of machine learning algorithms refers to the set of underlying assumptions made by the model in order to generalise better on inputs which haven't been seen before. Typically, in learning problems, the algorithm is presented with training examples, from which it tries to derive a relationship between the input and output data points. However, the model should also be able to extend its prediction capabilities to input values, never seen during the training phase and predict a reasonably correct output. As outputs for these samples may be arbitrary, it is often needed to have a set of assumptions based on which the relationship between inputs and outputs may be generalized better, which is known as the inductive bias. For example, in the KNN problem, the bias assumed is that the data points in a small neighbourhood in the feature space, belong to the same class.

In deep learning based computer vision tasks, Convolution Neural Networks have been for long used for classification, generation and such tasks, especially pre-trained CNNs such as VGG19 and ResNet50. CNNs themselves usually have certain types of inductive biases, one namely being the spatial bias, which assumes a certain type of spatial structure present in the data. CNNs use the principle of local structure as a bias, which assumes that pixels close to each other in the image, have a higher likelihood of being similar. Thus, local structure in images tends to be more prominent than global structure for the algorithm. Another kind of inductive bias which is present in CNNs is the principle of weight sharing, where each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field. These inductive biases help CNNs to perform well for general image and vision tasks.

With transformer based models such as the vision transformer, things work a little differently now. Transformers used attention matrices in the architecture, in order to focus on different areas of the input vectors, and not treat them uniformly. However, computing the full attention matrix can be computationally infeasible at times and thus, we have sparse attention transformers. In this model, we make use of the observation that attention matrices tend to be sparse and low rank, and using this, computations using the attention matrix can be made much faster, with negligible performance loss (sparse matrices approximate well). Also, local attention models have been proposed where attention is applied over smaller sliding windows, instead of the entire feature vector. We now want to explore the principles, causes and effects of local sparse attention and how they comapre to the inductive biases of CNNs i.e. weight sharing, locality principle etc.

Objectives

  • To benchmark the performance of a CNN based architecture and local sparse attention transformer on a standard Image dataset.
  • To study the heuristics of the perfomances and attempt to correlate to the inductive biases which we start upon.
  • Particularly, we will try to explore if the inductive biases work in local attention transformers as compared to CNNs, the extent to which the performaces match, where they differ etc.
  • To study the mathematics for both local attention transformers and CNNs and try to interpret/ proof more robust methods to correlate the observations, rather than based on only training heuristics.

References