Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query regarding the output adapter heads #5

Closed
AntiLibrary5 opened this issue Apr 25, 2022 · 3 comments
Closed

Query regarding the output adapter heads #5

AntiLibrary5 opened this issue Apr 25, 2022 · 3 comments

Comments

@AntiLibrary5
Copy link

Hi,
Thank you for the interesting work and the extensive experiments. Your depth results are based on the DPT head in the paper. In the colab, you use the spatial adapter head for inference. I was wondering if your fine-tuning results with the spatial adapter head were better/worse than the DPT head? Was the intention to implement this spatial head more to test a pure transformer based head (compared to DPT's convolution based refineNet like approach?)?

Thank you.

@roman-bachmann
Copy link
Member

Hi!

The Colab notebook is mainly intended to visualize the pre-training objective and to demonstrate cross-modal interaction, so we show predictions using the spatial adapter head. There are several reasons why we didn’t use DPT heads during pre-training:

  • The DPT head reshapes the full N_HxN_W set of tokens into a dense feature map (for multiple layers). This works during fine-tuning, since we use as input entire RGB images, but during pre-training, where we use only 98 randomly sampled tokens from 3 modalities, this reshaping would not work.
  • By using cross-attention in the decoder, we can nicely integrate the extracted information from tokens of all other input modalities, no matter how many there are.
  • All in all, we tried to keep the decoders as conceptually simple and lightweight as possible for efficient pre-training. The original MAE paper had some experiments showing that better reconstruction quality during pre-training does not necessarily result in better transfers, so we chose to go with a shallow and simple decoder, and not add any conv layers.

For all our depth fine-tuning runs, we discard the pre-trained spatial adapter head and add a DPT head instead. We do this because, as you might have noticed, the predictions of the spatial adapter head produce some “patch artifacts” that get more noticeable the more out of distribution we get in terms of number of input tokens used (e.g. when using the full set of 196 RGB input tokens instead of just 98 as during pre-training). We therefore never fine-tuned the pre-trained spatial adapter head, but given that it seems to predict depth pretty well in the first place, this would be something to try in the future.

Best, Roman

@DianCh
Copy link

DianCh commented Jan 14, 2023

Hi @roman-bachmann ! May I ask which experiment in the MAE paper you are referring to that shows "better reconstruction quality does not necessarily result in better transfers"?

@roman-bachmann
Copy link
Member

Hi @DianCh ,

I was mostly talking about a perceptual notion of reconstruction quality and referring to Table 1 d) in the MAE paper, which shows that predicting pixels using an MSE loss is just as good at fine-tuning as using dVAE tokens as targets. The former produces blurry outputs, while predicting tokens may result in visually more pleasing results. That said, as PeCo shows, using a tokenizer trained with a perceptual loss can perform better downstream.

Best, Roman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants