Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about the image generation? #9

Open
mutonix opened this issue Jul 10, 2024 · 12 comments
Open

questions about the image generation? #9

mutonix opened this issue Jul 10, 2024 · 12 comments
Labels
question Further information is requested

Comments

@mutonix
Copy link

mutonix commented Jul 10, 2024

Thanks for bringing the great work!

I have some questions about image generation in Anole. Does Anole utilize the vqgan decoder from Chameleon? As Chameleon has also released the vqgan weight for image generation (though they claim the image generation function is banned), what new things are added in Anole?

Many thanks!

@EthanC111
Copy link
Collaborator

Thank you for your interest!

Yes, Anole uses the same vqgan from Chameleon. As you mentioned, the open-sourced version of Chameleon doesn't support vision generation. Anole facilitates image and multimodal generation capabilities from Chameleon.
We will upgrade Anole with new functionality soon! Stay tuned for more updates!

@mutonix
Copy link
Author

mutonix commented Jul 10, 2024

Thank you for the quick reply.
I am curious about if we directly use the chameleon vqgan to generate images, will it work? Or does it have to be fine-tuned like what Anole had done to activate the image generation capability? Do you experiment on directly applying the Chameleon vqgan without fine-tuning?

@EthanC111
Copy link
Collaborator

The VQGAN part seems to work pretty well. According to our experiments, the reconstructed images looks pretty much the same as the original image.

@EthanC111
Copy link
Collaborator

We did not change the VQGAN tokenizer.

@matbee-eth
Copy link

Thank you for your interest!

Yes, Anole uses the same vqgan from Chameleon. As you mentioned, the open-sourced version of Chameleon doesn't support vision generation. Anole facilitates image and multimodal generation capabilities from Chameleon. We will upgrade Anole with new functionality soon! Stay tuned for more updates!

I would love a semi-descriptive (ELI a 40 year old full stack eng) writeup on how this is achieved

@EthanC111
Copy link
Collaborator

EthanC111 commented Jul 10, 2024

I would love a semi-descriptive (ELI a 40 year old full stack eng) writeup on how this is achieved

@matbee-eth Thank you for your interest! This is our paper: https://arxiv.org/abs/2407.06135

@mutonix
Copy link
Author

mutonix commented Jul 11, 2024

Can you further explain this question? Many thanks!

Does Chameleon have to be fine-tuned like what Anole has done to activate the intrinsic image generation capability that is banned? Do you experiment with directly applying the Chameleon original weights to generate images without fine-tuning (as vqgan decoder weights are provided by meta and Chameleon theoretically can generate images without fine-tuning)?

@JoyBoy-Su JoyBoy-Su added the question Further information is requested label Jul 11, 2024
@b2r66sun
Copy link

b2r66sun commented Jul 11, 2024

I 'd also like to know whether you tune the vqgan or directly use the weight from chameleon. Many thanks

@EthanC111
Copy link
Collaborator

EthanC111 commented Jul 12, 2024

Hi @mutonix , Chameleon doesn't support image generation. For more information, please see this issue. Anole is fine-tuned from Chameleon to facilitate image generation and multimodal generation.
Hi @b2r66sun , we did not tune the VQGAN. We directly use the VQGAN provided by Chameleon.

@mutonix
Copy link
Author

mutonix commented Jul 13, 2024

In the issue you have mentioned, he does not mention that he has commented out the following code or similar code in the original chameleon implementation:

image_tokens = self.model.vocabulary_mapping.image_tokens
logits[:, :, image_tokens] = torch.finfo(logits.dtype).min

Maybe that is the reason why he can not get the correct images. Have you tried to comment out the above code to directly generate the image? My confusion is that only fine-tuning can activate the image generation capability or just commenting out a few lines of code is ok.

@AbrahamSanders
Copy link

AbrahamSanders commented Jul 15, 2024

@mutonix I can't find anything like that in the original Chameleon implementation, only in the transformers version distributed with Anole (for fine-tuning purposes?): modeling_chameleon.py#L1627

I tried swapping the original Chameleon 7b weights for Anole 7b and running the original Chameleon Miniviewer. It appears to be capable of generating coherent images only when using the Anole weights.

@Yuheng-Li
Copy link

The original Chameleon seems released the vqgan's decoder, then how does the Chameleon banned the image generation ability?

what does Anole do to activate this ability? For example, Does Chameleon remove the logits corresponding to image tokens in the last layer, and Anole added it back?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants