In the only layers that use self-attention blocks, which is the last two layers, you have set the window size to be equal to the Spatial Size, doesn't that mean that you are not really even computing self-attention and that there is only a single token?
Please correct me if I am wrong as this seems perplexing.