-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query regarding the output adapter heads #5
Comments
Hi! The Colab notebook is mainly intended to visualize the pre-training objective and to demonstrate cross-modal interaction, so we show predictions using the spatial adapter head. There are several reasons why we didn’t use DPT heads during pre-training:
For all our depth fine-tuning runs, we discard the pre-trained spatial adapter head and add a DPT head instead. We do this because, as you might have noticed, the predictions of the spatial adapter head produce some “patch artifacts” that get more noticeable the more out of distribution we get in terms of number of input tokens used (e.g. when using the full set of 196 RGB input tokens instead of just 98 as during pre-training). We therefore never fine-tuned the pre-trained spatial adapter head, but given that it seems to predict depth pretty well in the first place, this would be something to try in the future. Best, Roman |
Hi @roman-bachmann ! May I ask which experiment in the MAE paper you are referring to that shows "better reconstruction quality does not necessarily result in better transfers"? |
Hi @DianCh , I was mostly talking about a perceptual notion of reconstruction quality and referring to Table 1 d) in the MAE paper, which shows that predicting pixels using an MSE loss is just as good at fine-tuning as using dVAE tokens as targets. The former produces blurry outputs, while predicting tokens may result in visually more pleasing results. That said, as PeCo shows, using a tokenizer trained with a perceptual loss can perform better downstream. Best, Roman |
Hi,
Thank you for the interesting work and the extensive experiments. Your depth results are based on the DPT head in the paper. In the colab, you use the spatial adapter head for inference. I was wondering if your fine-tuning results with the spatial adapter head were better/worse than the DPT head? Was the intention to implement this spatial head more to test a pure transformer based head (compared to DPT's convolution based refineNet like approach?)?
Thank you.
The text was updated successfully, but these errors were encountered: