-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why SwinIR can be directly (not patch by patch) tested on images with arbitrary sizes? #9
Comments
You're actually right. Since the positional encoding is fixed after training, the attention matrix is fixed for all Transformers, as far as I know.
However, one key different between them is that Swin Transformer uses the same Attention Module for all non-overlapping 8x8 image patches (similar to a 8x8 convolution with stride=8). It can easily be used on any images, as long as their sizes are a multiple of 8 (8x8, 16x16, 24x24, etc.) In practice, given any testing image, we can pad it to be a multiple of 8 and test it with SwinIR. See Lines 56-63 for the padding code. Line 56 in 5bd10ce
|
more question. in padding code, you use h_pad = (h_old // window_size + 1 ) * window_size - h_old, why this padding is confusion to me. like set5's baby 's low resolution is 128*128, then 128/8 = 16, can divided , why padding it. thankyou |
Yes, it is compatible with Set5's baby, but not with other images. This is why we need padding. As for the reason to pad |
Hi, using a 48x48 patch for training, tests can be done using any resolution, can this only be based on swin transformer? Suppose I use a normal transformer block that is doing attention within the whole picture. In this case can I also put in any resolution to do the test. For example, if I use VIT and use a 48X48 patch for training, and dynamically give in the image size at deforwad, the model will be able to do the test without reporting errors. Am I correct in this idea? |
|
Thank you, your answer clarified my thinking a lot, but there is one thing. |
Yes, I can expect a severe performance drop if you do so. Just test your idea on IPT and see the PSNR drop. |
I am confused that you change window_size to 8 instead of 7 ? |
7 also works. I choose 8 because 6x8=48, 8x8=64, which can be used for fair comparison with existing SR works. Besides, 8x8 doesn't work for JPEG compression artifact reduction. One possible reason is that JPEG uses 8x8 patches in encoding. |
Thank you. For JEPG, the input image size is 126 , it gets 133 after padding, and then through patch embeding , the patch resolution is 33 . In the window_partition , it cannot be divided exactly and maybe lose some information, am I corrrect? I see swin transformer for detection padding for several times, so I am confused why padding like this. |
No, we use 126x126 patch and 7x7 window size for training JPEG CAR, so the patch number is (126/7)x(126/7)=18x18. We don't use any padding inside the model. In testing, we pad the testing image to be a multiple of 7. See Line 56 in 5bd10ce
|
Thanks a lot ! I get it ! |
Thanks for sharing this contributional work! In padding code, you use h_pad = (h_old // window_size + 1 ) * window_size - h_old can process the arbitrary images, but the outputs of SwinIR are different the inputs. Is it possible to ensure the input and output sizes are always the same? Even the size of input image is not a multiple of 8. |
No. SwinIR operates on a small window (8x8), so you always have to pad it to be a multiple of 8 in testing. After testing, you crop it to be the same size as the GT HR image. This operation has little impact on the final performance. |
Thanks a lot, I will try it. |
Feel free to open it if you have more questions. |
SwinIR solves the resolution-adaptivity problem of Transformers for low-level vision, which is great. However, the adopted window attention can only attain local interactions, which might restrict its model capacity. We gently invite you to check out our MAXIM model accepted to CVPR 2022 Oral. It contains both global and local MLPs that can also directly test on images with arbitrary sizes. We test on slightly different image restoration tasks -- denoising, deblurring, draining, dehazing, enhancement. Our code and model has been released at https://github.com/google-research/maxim |
In my knowledge, the input in transformer must be fixed resolution, in test time, often take patch overlap method to test image in transformer.in your code, I want to know how the method you take, and the idea. I saw that like any resolution can be feed in swinIR? how to do it?
Looking forward your reply, thanku.!
The text was updated successfully, but these errors were encountered: