can not reproduce the performance of svt-small model #20

Yangr116 · 2021-09-13T07:04:35Z

Thanks for your nice work!
And I would like to reproduce the performance of svt-small(alt_gvt_small) model. Below is my code:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model alt_gvt_small --batch-size 256 --data-path ../data/ImageNet --dist-eval --drop-path 0.2

The other parameters are default. But the result only up to 81.1%, not 81.7%.
Could you give me some suggestions on how to reproduce your nice performance from scratch?

The text was updated successfully, but these errors were encountered:

cxxgtxy · 2021-09-13T07:18:28Z

Thanks for your attention.
We use a global size of 1024 to train our models. Have you tried this option?
We observe that the learning rate scaling strategy for AdamW is not the same as SGD (linear with batch size).

Yangr116 · 2021-09-13T07:39:47Z

Thanks for your attention.
We use a global size of 1024 to train our models. Have you tried this option?
We observe that the learning rate scaling strategy for AdamW is not the same as SGD (linear with batch size).

Thanks for your quick reply!

I don't try the global size = 1024. mmmm, sorry, I don't know what's mean of global size, maybe equal to the batch_size * the number of gpu devices?

In the main.py, I find one line to revise the learning rate:
linear_scaled_lr = args.lr * args.batch_size * utils.get_world_size() / 512.0
So, I didn't revise the learning rate. And I will revise it as possible.

To sum up, thanks for your reminder and your nice work again!

cxxgtxy · 2021-09-13T08:43:59Z

Yes. Your setting is a global batch size of 2048.
Empirically, you don't change this code of linear_scaled_lr. You can use 1024 and report the result.
If you still use 2048, I suggest you can change the drop-path rate to 0.1 for the small model.

Yangr116 · 2021-09-16T11:03:03Z

Hi, I am sorry to interrupt you again.

Recently, I try to use below code to train the model "alt_gvt_small" that global_size=1024:

python3 -m torch.distributed.launch --nproc_per_node=8 --master_port 29502 --use_env main.py --model alt_gvt_small --batch-size 128 --data-path ../data/ImageNet --dist-eval --drop-path 0.2 --output_dir ./work_dirs/alt_gvt_small
But the performance dropped in the first 15 epochs. I guess that the performance will not catch up to your result after the 300 epochs.

my logs:
{"train_lr": 9.999999999999953e-07, "train_loss": 6.9077429542248, "test_loss": 6.853322374098228, "test_acc1": 0.39000001693725583, "test_acc5": 1.7120000645446778, "epoch": 0, "n_parameters": 27185576}
{"train_lr": 9.999999999999953e-07, "train_loss": 6.888266038075149, "test_loss": 6.8111739736614805, "test_acc1": 0.38000001512527465, "test_acc5": 1.9880000707244874, "epoch": 1, "n_parameters": 27185576}
{"train_lr": 0.00020080000000000092, "train_loss": 6.662809805618488, "test_loss": 5.687711316527742, "test_acc1": 5.078000159225464, "test_acc5": 15.214000395431519, "epoch": 2, "n_parameters": 27185576}
{"train_lr": 0.000400599999999987, "train_loss": 6.347078082372817, "test_loss": 4.909713951927243, "test_acc1": 11.704000363845825, "test_acc5": 28.93800088180542, "epoch": 3, "n_parameters": 27185576}
{"train_lr": 0.0006003999999999824, "train_loss": 6.087260493891035, "test_loss": 4.192859256809408, "test_acc1": 20.250000597076415, "test_acc5": 42.184001130218505, "epoch": 4, "n_parameters": 27185576}
{"train_lr": 0.0008002000000000078, "train_loss": 5.829409284033268, "test_loss": 3.618216951688131, "test_acc1": 27.31400076171875, "test_acc5": 52.15800128326416, "epoch": 5, "n_parameters": 27185576}
{"train_lr": 0.0009993216197035084, "train_loss": 5.612617820334568, "test_loss": 3.220957176251845, "test_acc1": 33.55600084197998, "test_acc5": 59.50800168701172, "epoch": 6, "n_parameters": 27185576}
{"train_lr": 0.000999023230572016, "train_loss": 5.371808164792476, "test_loss": 2.8588044756289683, "test_acc1": 39.61400107498169, "test_acc5": 65.56400166351318, "epoch": 7, "n_parameters": 27185576}
{"train_lr": 0.000998670666226098, "train_loss": 5.203231047550075, "test_loss": 2.665417769641587, "test_acc1": 43.75600125, "test_acc5": 69.68200192260743, "epoch": 8, "n_parameters": 27185576}
{"train_lr": 0.0009982639653285214, "train_loss": 5.071858598912458, "test_loss": 2.3947895273114694, "test_acc1": 47.706001280212405, "test_acc5": 73.44600221710205, "epoch": 9, "n_parameters": 27185576}
{"train_lr": 0.0009978031724785232, "train_loss": 4.949590649631479, "test_loss": 2.2964160442352295, "test_acc1": 50.458001456604, "test_acc5": 75.70200245300293, "epoch": 10, "n_parameters": 27185576}
{"train_lr": 0.000997288338207296, "train_loss": 4.831694250650925, "test_loss": 2.1856239961855337, "test_acc1": 52.92200156524658, "test_acc5": 77.9720024118042, "epoch": 11, "n_parameters": 27185576}
{"train_lr": 0.0009967195189721821, "train_loss": 4.757332339656534, "test_loss": 2.07835605514772, "test_acc1": 54.860001388549804, "test_acc5": 79.68800242584228, "epoch": 12, "n_parameters": 27185576}
{"train_lr": 0.0009960967771506664, "train_loss": 4.6595591517994635, "test_loss": 1.9425015896558762, "test_acc1": 56.47200140106201, "test_acc5": 80.86000250915528, "epoch": 13, "n_parameters": 27185576}
{"train_lr": 0.0009954201810333753, "train_loss": 4.598976042321165, "test_loss": 1.9306896821115955, "test_acc1": 57.73400158081055, "test_acc5": 81.95600246551514, "epoch": 14, "n_parameters": 27185576}
{"train_lr": 0.0009946898048166896, "train_loss": 4.534675074233521, "test_loss": 1.8111045836950794, "test_acc1": 59.36600164520264, "test_acc5": 82.97000257263184, "epoch": 15, "n_parameters": 27185576}

your logs:
{"train_lr": 1.000000000000014e-06, "train_loss": 6.9166167094230655, "test_loss": 6.881752743440516, "test_acc1": 0.18800001103878022, "test_acc5": 0.9300000336265564, "epoch": 0, "n_parameters": 24060776}
{"train_lr": 1.000000000000014e-06, "train_loss": 6.900423232269287, "test_loss": 6.852993618039524, "test_acc1": 0.41600001563549044, "test_acc5": 1.5720000462150574, "epoch": 1, "n_parameters": 24060776}
{"train_lr": 0.00040080000000000486, "train_loss": 6.646979278850555, "test_loss": 5.493567599969752, "test_acc1": 6.424000176010132, "test_acc5": 18.284000532073975, "epoch": 2, "n_parameters": 24060776}
{"train_lr": 0.0008005999999999952, "train_loss": 6.297702983379364, "test_loss": 4.646971811266506, "test_acc1": 14.610000506286621, "test_acc5": 32.9140008934021, "epoch": 3, "n_parameters": 24060776}
{"train_lr": 0.001200399999999992, "train_loss": 6.0142835487365724, "test_loss": 3.968842138262356, "test_acc1": 22.43800064086914, "test_acc5": 45.01200138336181, "epoch": 4, "n_parameters": 24060776}
{"train_lr": 0.001600200000000024, "train_loss": 5.753731050109863, "test_loss": 3.4170068081687477, "test_acc1": 30.60400089279175, "test_acc5": 55.76000146942139, "epoch": 5, "n_parameters": 24060776}
{"train_lr": 0.001998636387080776, "train_loss": 5.494997973155975, "test_loss": 2.9434911687584484, "test_acc1": 38.45400118209839, "test_acc5": 64.25200190032959, "epoch": 6, "n_parameters": 24060776}
{"train_lr": 0.001998036594786119, "train_loss": 5.245779330396652, "test_loss": 2.6588724194204105, "test_acc1": 44.24800133544922, "test_acc5": 69.74200233428955, "epoch": 7, "n_parameters": 24060776}
{"train_lr": 0.001997327904838336, "train_loss": 5.0572126257419585, "test_loss": 2.3770968151443146, "test_acc1": 48.44400132751465, "test_acc5": 73.88200233337402, "epoch": 8, "n_parameters": 24060776}
{"train_lr": 0.0019965103949532784, "train_loss": 4.904551318454742, "test_loss": 2.225582433097503, "test_acc1": 51.62800114135742, "test_acc5": 76.74000275421143, "epoch": 9, "n_parameters": 24060776}
{"train_lr": 0.0019955841547800715, "train_loss": 4.781468183612824, "test_loss": 2.1663901078350403, "test_acc1": 53.92400145446777, "test_acc5": 78.55800275604248, "epoch": 10, "n_parameters": 24060776}
{"train_lr": 0.0019945492858914225, "train_loss": 4.680502858829498, "test_loss": 2.019594290677239, "test_acc1": 56.45400156951904, "test_acc5": 80.33000265350341, "epoch": 11, "n_parameters": 24060776}
{"train_lr": 0.001993405901772395, "train_loss": 4.59881298751831, "test_loss": 1.8617978301994942, "test_acc1": 58.35600150360107, "test_acc5": 82.20000294708252, "epoch": 12, "n_parameters": 24060776}
{"train_lr": 0.001992154127807911, "train_loss": 4.521906917619705, "test_loss": 1.8242545925519045, "test_acc1": 59.48800161804199, "test_acc5": 82.6920026071167, "epoch": 13, "n_parameters": 24060776}
{"train_lr": 0.0019907941012691044, "train_loss": 4.435260578107834, "test_loss": 1.7272365518352564, "test_acc1": 61.238001587524415, "test_acc5": 84.06800270141602, "epoch": 14, "n_parameters": 24060776}
{"train_lr": 0.001989325971298189, "train_loss": 4.379169403886795, "test_loss": 1.6653209179639816, "test_acc1": 61.95000177520752, "test_acc5": 84.72800244293212, "epoch": 15, "n_parameters": 24060776}

Maybe, I should set batch_size=256 and gpus=4?

cxxgtxy · 2021-09-16T22:37:32Z

emm.
The n_parameters doesn't match our log. Ours is 24060776 while yours is 27185576.
Have you changed the model?

Yangr116 · 2021-09-17T02:42:21Z

Oh, sorry, So careless mistakes. I will clone a new one and try it again.
Thanks!

cxxgtxy closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can not reproduce the performance of svt-small model #20

can not reproduce the performance of svt-small model #20

Yangr116 commented Sep 13, 2021

cxxgtxy commented Sep 13, 2021

Yangr116 commented Sep 13, 2021

cxxgtxy commented Sep 13, 2021 •

edited

Yangr116 commented Sep 16, 2021 •

edited

cxxgtxy commented Sep 16, 2021

Yangr116 commented Sep 17, 2021

can not reproduce the performance of svt-small model #20

can not reproduce the performance of svt-small model #20

Comments

Yangr116 commented Sep 13, 2021

cxxgtxy commented Sep 13, 2021

Yangr116 commented Sep 13, 2021

cxxgtxy commented Sep 13, 2021 • edited

Yangr116 commented Sep 16, 2021 • edited

cxxgtxy commented Sep 16, 2021

Yangr116 commented Sep 17, 2021

cxxgtxy commented Sep 13, 2021 •

edited

Yangr116 commented Sep 16, 2021 •

edited