Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not reproduce the performance of svt-small model #20

Closed
Yangr116 opened this issue Sep 13, 2021 · 6 comments
Closed

can not reproduce the performance of svt-small model #20

Yangr116 opened this issue Sep 13, 2021 · 6 comments

Comments

@Yangr116
Copy link

Thanks for your nice work!
And I would like to reproduce the performance of svt-small(alt_gvt_small) model. Below is my code:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model alt_gvt_small --batch-size 256 --data-path ../data/ImageNet --dist-eval --drop-path 0.2

The other parameters are default. But the result only up to 81.1%, not 81.7%.
Could you give me some suggestions on how to reproduce your nice performance from scratch?

@cxxgtxy
Copy link
Collaborator

cxxgtxy commented Sep 13, 2021

Thanks for your attention.
We use a global size of 1024 to train our models. Have you tried this option?
We observe that the learning rate scaling strategy for AdamW is not the same as SGD (linear with batch size).

@Yangr116
Copy link
Author

Thanks for your attention.
We use a global size of 1024 to train our models. Have you tried this option?
We observe that the learning rate scaling strategy for AdamW is not the same as SGD (linear with batch size).

Thanks for your quick reply!

I don't try the global size = 1024. mmmm, sorry, I don't know what's mean of global size, maybe equal to the batch_size * the number of gpu devices?

In the main.py, I find one line to revise the learning rate:
linear_scaled_lr = args.lr * args.batch_size * utils.get_world_size() / 512.0
So, I didn't revise the learning rate. And I will revise it as possible.

To sum up, thanks for your reminder and your nice work again!

@cxxgtxy
Copy link
Collaborator

cxxgtxy commented Sep 13, 2021

Yes. Your setting is a global batch size of 2048.
Empirically, you don't change this code of linear_scaled_lr. You can use 1024 and report the result.
If you still use 2048, I suggest you can change the drop-path rate to 0.1 for the small model.

@Yangr116
Copy link
Author

Yangr116 commented Sep 16, 2021

Hi, I am sorry to interrupt you again.

Recently, I try to use below code to train the model "alt_gvt_small" that global_size=1024:

python3 -m torch.distributed.launch --nproc_per_node=8 --master_port 29502 --use_env main.py --model alt_gvt_small --batch-size 128 --data-path ../data/ImageNet --dist-eval --drop-path 0.2 --output_dir ./work_dirs/alt_gvt_small
But the performance dropped in the first 15 epochs. I guess that the performance will not catch up to your result after the 300 epochs.

my logs:
{"train_lr": 9.999999999999953e-07, "train_loss": 6.9077429542248, "test_loss": 6.853322374098228, "test_acc1": 0.39000001693725583, "test_acc5": 1.7120000645446778, "epoch": 0, "n_parameters": 27185576}
{"train_lr": 9.999999999999953e-07, "train_loss": 6.888266038075149, "test_loss": 6.8111739736614805, "test_acc1": 0.38000001512527465, "test_acc5": 1.9880000707244874, "epoch": 1, "n_parameters": 27185576}
{"train_lr": 0.00020080000000000092, "train_loss": 6.662809805618488, "test_loss": 5.687711316527742, "test_acc1": 5.078000159225464, "test_acc5": 15.214000395431519, "epoch": 2, "n_parameters": 27185576}
{"train_lr": 0.000400599999999987, "train_loss": 6.347078082372817, "test_loss": 4.909713951927243, "test_acc1": 11.704000363845825, "test_acc5": 28.93800088180542, "epoch": 3, "n_parameters": 27185576}
{"train_lr": 0.0006003999999999824, "train_loss": 6.087260493891035, "test_loss": 4.192859256809408, "test_acc1": 20.250000597076415, "test_acc5": 42.184001130218505, "epoch": 4, "n_parameters": 27185576}
{"train_lr": 0.0008002000000000078, "train_loss": 5.829409284033268, "test_loss": 3.618216951688131, "test_acc1": 27.31400076171875, "test_acc5": 52.15800128326416, "epoch": 5, "n_parameters": 27185576}
{"train_lr": 0.0009993216197035084, "train_loss": 5.612617820334568, "test_loss": 3.220957176251845, "test_acc1": 33.55600084197998, "test_acc5": 59.50800168701172, "epoch": 6, "n_parameters": 27185576}
{"train_lr": 0.000999023230572016, "train_loss": 5.371808164792476, "test_loss": 2.8588044756289683, "test_acc1": 39.61400107498169, "test_acc5": 65.56400166351318, "epoch": 7, "n_parameters": 27185576}
{"train_lr": 0.000998670666226098, "train_loss": 5.203231047550075, "test_loss": 2.665417769641587, "test_acc1": 43.75600125, "test_acc5": 69.68200192260743, "epoch": 8, "n_parameters": 27185576}
{"train_lr": 0.0009982639653285214, "train_loss": 5.071858598912458, "test_loss": 2.3947895273114694, "test_acc1": 47.706001280212405, "test_acc5": 73.44600221710205, "epoch": 9, "n_parameters": 27185576}
{"train_lr": 0.0009978031724785232, "train_loss": 4.949590649631479, "test_loss": 2.2964160442352295, "test_acc1": 50.458001456604, "test_acc5": 75.70200245300293, "epoch": 10, "n_parameters": 27185576}
{"train_lr": 0.000997288338207296, "train_loss": 4.831694250650925, "test_loss": 2.1856239961855337, "test_acc1": 52.92200156524658, "test_acc5": 77.9720024118042, "epoch": 11, "n_parameters": 27185576}
{"train_lr": 0.0009967195189721821, "train_loss": 4.757332339656534, "test_loss": 2.07835605514772, "test_acc1": 54.860001388549804, "test_acc5": 79.68800242584228, "epoch": 12, "n_parameters": 27185576}
{"train_lr": 0.0009960967771506664, "train_loss": 4.6595591517994635, "test_loss": 1.9425015896558762, "test_acc1": 56.47200140106201, "test_acc5": 80.86000250915528, "epoch": 13, "n_parameters": 27185576}
{"train_lr": 0.0009954201810333753, "train_loss": 4.598976042321165, "test_loss": 1.9306896821115955, "test_acc1": 57.73400158081055, "test_acc5": 81.95600246551514, "epoch": 14, "n_parameters": 27185576}
{"train_lr": 0.0009946898048166896, "train_loss": 4.534675074233521, "test_loss": 1.8111045836950794, "test_acc1": 59.36600164520264, "test_acc5": 82.97000257263184, "epoch": 15, "n_parameters": 27185576}

your logs:
{"train_lr": 1.000000000000014e-06, "train_loss": 6.9166167094230655, "test_loss": 6.881752743440516, "test_acc1": 0.18800001103878022, "test_acc5": 0.9300000336265564, "epoch": 0, "n_parameters": 24060776}
{"train_lr": 1.000000000000014e-06, "train_loss": 6.900423232269287, "test_loss": 6.852993618039524, "test_acc1": 0.41600001563549044, "test_acc5": 1.5720000462150574, "epoch": 1, "n_parameters": 24060776}
{"train_lr": 0.00040080000000000486, "train_loss": 6.646979278850555, "test_loss": 5.493567599969752, "test_acc1": 6.424000176010132, "test_acc5": 18.284000532073975, "epoch": 2, "n_parameters": 24060776}
{"train_lr": 0.0008005999999999952, "train_loss": 6.297702983379364, "test_loss": 4.646971811266506, "test_acc1": 14.610000506286621, "test_acc5": 32.9140008934021, "epoch": 3, "n_parameters": 24060776}
{"train_lr": 0.001200399999999992, "train_loss": 6.0142835487365724, "test_loss": 3.968842138262356, "test_acc1": 22.43800064086914, "test_acc5": 45.01200138336181, "epoch": 4, "n_parameters": 24060776}
{"train_lr": 0.001600200000000024, "train_loss": 5.753731050109863, "test_loss": 3.4170068081687477, "test_acc1": 30.60400089279175, "test_acc5": 55.76000146942139, "epoch": 5, "n_parameters": 24060776}
{"train_lr": 0.001998636387080776, "train_loss": 5.494997973155975, "test_loss": 2.9434911687584484, "test_acc1": 38.45400118209839, "test_acc5": 64.25200190032959, "epoch": 6, "n_parameters": 24060776}
{"train_lr": 0.001998036594786119, "train_loss": 5.245779330396652, "test_loss": 2.6588724194204105, "test_acc1": 44.24800133544922, "test_acc5": 69.74200233428955, "epoch": 7, "n_parameters": 24060776}
{"train_lr": 0.001997327904838336, "train_loss": 5.0572126257419585, "test_loss": 2.3770968151443146, "test_acc1": 48.44400132751465, "test_acc5": 73.88200233337402, "epoch": 8, "n_parameters": 24060776}
{"train_lr": 0.0019965103949532784, "train_loss": 4.904551318454742, "test_loss": 2.225582433097503, "test_acc1": 51.62800114135742, "test_acc5": 76.74000275421143, "epoch": 9, "n_parameters": 24060776}
{"train_lr": 0.0019955841547800715, "train_loss": 4.781468183612824, "test_loss": 2.1663901078350403, "test_acc1": 53.92400145446777, "test_acc5": 78.55800275604248, "epoch": 10, "n_parameters": 24060776}
{"train_lr": 0.0019945492858914225, "train_loss": 4.680502858829498, "test_loss": 2.019594290677239, "test_acc1": 56.45400156951904, "test_acc5": 80.33000265350341, "epoch": 11, "n_parameters": 24060776}
{"train_lr": 0.001993405901772395, "train_loss": 4.59881298751831, "test_loss": 1.8617978301994942, "test_acc1": 58.35600150360107, "test_acc5": 82.20000294708252, "epoch": 12, "n_parameters": 24060776}
{"train_lr": 0.001992154127807911, "train_loss": 4.521906917619705, "test_loss": 1.8242545925519045, "test_acc1": 59.48800161804199, "test_acc5": 82.6920026071167, "epoch": 13, "n_parameters": 24060776}
{"train_lr": 0.0019907941012691044, "train_loss": 4.435260578107834, "test_loss": 1.7272365518352564, "test_acc1": 61.238001587524415, "test_acc5": 84.06800270141602, "epoch": 14, "n_parameters": 24060776}
{"train_lr": 0.001989325971298189, "train_loss": 4.379169403886795, "test_loss": 1.6653209179639816, "test_acc1": 61.95000177520752, "test_acc5": 84.72800244293212, "epoch": 15, "n_parameters": 24060776}

Maybe, I should set batch_size=256 and gpus=4?

@cxxgtxy
Copy link
Collaborator

cxxgtxy commented Sep 16, 2021

emm.
The n_parameters doesn't match our log. Ours is 24060776 while yours is 27185576.
Have you changed the model?

@Yangr116
Copy link
Author

Oh, sorry, So careless mistakes. I will clone a new one and try it again.
Thanks!

@cxxgtxy cxxgtxy closed this as completed Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants