Hi! First of all thanks for releasing such a great model and accompanying paper. Could you clarify few design choices in the SDXL?
- Why do you use both previous CLIP-L and new OpenCLIP ViT-bigG? Have you tried only using the later one, wouldn't it be enough?
- The crop-conditioning while avoid generating too many cropped images, seems to generate more duplicated cases, where the object of interest is present everywhere, instead of being a single instance. See this comparisons. I wonder why not to use multi-aspect ( aka rectangles) training during all training process, rather than only during fine-tuning.
Hi! First of all thanks for releasing such a great model and accompanying paper. Could you clarify few design choices in the SDXL?