Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight decay and Resnet18 #11

Closed
gergopool opened this issue Apr 28, 2022 · 9 comments
Closed

Weight decay and Resnet18 #11

gergopool opened this issue Apr 28, 2022 · 9 comments

Comments

@gergopool
Copy link

Hi!

In my last issue, I forgot to congratulate to your exceptional paper. I was also looking for a relational method since PAWS, but couldn't really find one that could achieve such high performance on imagenet. Also, this method works with small batch size and very low computational resources due to the frozen target network and single-view backprop. Nice work!

Reading the code I noticed two minor differences to the paper though. Can you please double-check these and clarify which one reflects the results published?

  • (1) Weight decay
    • Paper: I didn't find any mentions on weight decay when training on Imagenet, but found 5e-4 for small and medium datasets.
    • Code: link You use 1e-4 weight decay and 0 for bias. Is it the default Imagenet settings?
  • (2) Resnet18 7x7 conv
    • Paper:
    • We adopt the ResNet18 [25] as our backbone network. Because most of our dataset contains low-resolution images, we replace the first 7x7 Conv of stride 2 with 3x3 Conv of stride 1 and remove the first max pooling operation for a small dataset.

    • Code: link I see no sign of these changes, it looks you kept the original imagenet-resnet setup. Didn't you?

Thank you.

PS.: I am about to reproduce your results from the paper, but currently hanging around 65% on Imagenet.

@mingkai-zheng
Copy link
Owner

mingkai-zheng commented Apr 28, 2022

(1) Weight decay: We use 5e-4 for the small and medium datasets, and 1e-4 for ImageNet.
(2) Resnet18 7x7 -> 3x3 Conv is only for small and medium datasets which are not implemented in this codebase.

The experimental setting in this repository is just for ImageNet and we have provided a training script in script/train.sh. We are sorry about the missing hyperparameters in our paper.

For small and medium datasets experiments, we provide a separate codebase in here, please check it out and feel free to ask us if you have any further questions.

@mingkai-zheng
Copy link
Owner

BTW, would you like to provide your training setting for the result of 65%? Did you just simply run this codebase directly?

@gergopool
Copy link
Author

we provide a separate codebase in here

Thank you! I will definitely have a look.

Did you just simply run this codebase directly?

No, I am working in a different repository, here. I've just added the 1e-4 weight decay to the code. Also, I've just discover I used log(softmax(x)) instead of log_softmax(x) when comparing the two distributions, which might have led to unstable computations. I will make re-run in the next few days and get back to you.

Thanks again!

@gergopool
Copy link
Author

As I was checking your code out, I realized you also shuffled the batch when you were training ReSSL on one gpu. I've never used this technique before, I didn't even think about this case. It's so great you uploaded that zip and I've found this out, because it might have a huge effect on all moving average methods. I've also just discovered you used MaxPooling in ResNet18 and 2048 hidden_dim when training on Tiny-ImageNet. I overlooked all of these.

Thanks again for sharing the code, it helped a lot! ☺️

@gergopool
Copy link
Author

I am very close, 69.04% accuracy. Do you think it's in noise range? 1% sounds a lot, but it might be the case.
I will make a step-by-step code revision again, maybe I'll find something.

@mingkai-zheng
Copy link
Owner

mingkai-zheng commented May 10, 2022

1% should not be an acceptable noise, let me briefly summarize some key points that I think you should check in the training setting

For pre-training

lr = 0.05
weight_decay = 1e-4
momentum=0.9
teacher temperature = 0.04
student temperature = 0.1
warm up 5 epochs and use cosine schduler
m = 0.999
hidden_dim for projection head = 4096
out_dim for projection head = 512
no bn layer in the projection head !!!
memory buffer size = 131072
batch size = 256,  32  per GPU (shuffle bn might cause different results if you do not strictly follow this setting)

contrastive augmentation:
transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
    transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1) ], p=0.8),
    transforms.RandomGrayscale(p=0.2),
    transforms.RandomApply([GaussianBlur([.1, 2.])], p=0.5),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

weak augmentation:
transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

For linear evaluation

change the backbone to evaluation mode !!!!
zero init the linear classifier !!!
batch size = 256 
momentum=0.9
learning rate = 0.3
weight_decay = 0
cosine schduler

training augmentation:
transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

eval augmentation:
transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

@gergopool
Copy link
Author

Thank you!

The only difference on my side is the learning rate. Is that a typo? I've found 0.05 in both the paper and in your code. Also, I used nesterov acceleration at linear evaluation, but that probably won't make much of a difference.

However, I pretrained the network with half precision floats. The loss calculation happened in float32, but the encoder's forward pass was done in float16. I've never seen it causing a difference in supervised setups, but maybe it results in a ~1% decrease in self-supervised trainings. I have a pretrained simsiam network, but haven't run linear evaluation on it, that could be a sanity check.

@mingkai-zheng
Copy link
Owner

Yes, the learning rate should be 0.05, sorry about the typo, I'm not quite sure about the effect of fp16 on this codebase.

@gergopool
Copy link
Author

The evaluation on my 100 epoch simsiam network also looks a bit weaker, it says 67.56% instead of 68.1%. So it's very likely that fp16 is the reason behind the 1% drop. (Interestingly, the swav evaluation protocol performs weekly on simsiam network btw, it really needs 4096bs with LARS).

I think you can close this issue. Thank you for all the codes and detailed answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants