-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about the VQGAN tokenizer #9
Comments
Hi! Thanks for your interest. The answers to both of your questions are Yes: 1. One key observation in MAGE is that the exact same MAE-like encoder-decoder operating on image tokens instead of pixel space (MAE) gives us a huge boost in both generation and representation learning performance. 2. The image tokenizer is pre-trained and fixed during the MAGE training. |
Thanks for your prompt reply! Wish you all the best! I am very curious that why BEiT-like paper use the patch projection tokens as input instead of their tokenized semantic tokens before I read your paper. I am not sure if MAGE is the first paper which use semantic tokens as input for representation learning, but the experimental results about linear probing prove its effectiveness! I also very like the analysis and details in Table. 6. As you explained in this issue, I am a little confused about why "the receptive fields of neighboring feature pixels have significant overlap, it is much easier to infer masked feature pixels using nearby unquantized feature pixels.". Since you have masked the raw pixels, the receptive fields of both the neighbouring feature tokens and semantic tokens should be very similar? |
I think there is some misunderstanding: we do not mask the raw pixels in 256x256 images, but we mask the pixels in the feature space (16x16). Therefore, if we do not quantize the features and directly mask them, the masked features can easily be inferred by looking at nearby features, as each pixel in the feature space has an overlapping receptive field. |
Thanks for solving my confusion! |
Thanks for your suggestion! It could be an interesting future direction to explore. |
Hi @LTH14, I noticed that in mage/models_mage.py, after tokenizing the input images into input ids (based on the VQGAN's codebook), the ids will be converted to embeddings based on self.token_emb.
|
Hi, I think there is a bit misunderstanding: we use the pre-trained VQGAN to extract the token "index", instead of the token vector from the original image. After that, we use a different embedding (i.e., self.token_emb) to embed such token index, which is trained together with the MAGE training. The main reason to do this is that the embedding dimension of the MAGE transformer can be different from the VQGAN codebook. As a result, directly using the VQGAN codebook as an initialization of self.token_emb can cause embedding dimension mismatch. |
Thanks for your prompt reply! |
You are mostly correct! Just one minor point: the newly learned MAGE codebook also does not need to necessarily align with the VQGAN codebook: the output of MAGE is still token index. Hence, you can use the VQGAN detokenizer to decode those token index. "VQGAN codebook copy and a linear projection as the self.token_emb" -- I remember I tried that one and the results are quite similar. |
The "align" in my comment means each index in the token_emb naturally aligns with that in the VQGAN codebook based on the training process. It aligns with your comment! |
First, congratulations to MAGE for being accepted by CVPR2023! I learned a lot from your great paper and also the detailed replies to other issues!
I'm not familiar with the usage of the image tokenizer. Below are some questions from my side:
Thanks again for such a great paper! It pushes the field a big step forward!
The text was updated successfully, but these errors were encountered: