New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastAI seems very slow compared to "vanilla" Flux #187
Comments
Thanks for the report! I think the most likely culprit is that the images do not have the same size, since Could you let me know if the FastAI code runs faster without the upsizing, i.e. using method = ImageClassificationSingle(blocks, size=(32, 32)) |
Thank you for the suggestion, it was the problem. I should have read more carefully the documentation of Results on RTX 2080 Ti: (I just discover FastAi.jl, I love it) |
More of a documentation error then, not on your part 😉 That still seems pretty slow, but is probably due to the amounts of garbage collection happening and since the dataset isn't preloaded (as FastAI.jl assumes out-of-memory datasets by default). You may get another speedup if you preload the batches once, since I think they will fit into memory. To do so, construct the learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation) If you try it out, let me know if that improves the allocation issues! |
I will try that. During theses experiments, |
GPU utilization tends to be a bit lower on validation splits, since the GPU needs to do less work compared to the data loading part. If you try the above suggestion, the data loading overhead will be effectively zero. |
Collecting data into RAM clearly reduces the garbage collection time from 21% to 3%. Unfortunately, the training process is completely broken, the training loss decreases quickly and the validation loss is unchanged. It looks like an overfitting after one epoch, which is very very surprinsing. With collecting data: using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data;
lossfn=Flux.crossentropy,
callbacks=[ToGPU()],
batchsize=16,
model=model,
optimizer=Descent())
learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)
@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)
Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:33
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 1.0 │ 0.66604 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 1.0 │ 2.30902 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 2.0 │ 0.00658 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 2.0 │ 2.31039 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 3.0 │ 0.00178 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 3.0 │ 2.31698 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 4.0 │ 0.00102 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 4.0 │ 2.32021 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 5.0 │ 0.00087 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 5.0 │ 2.32081 │
└─────────────────┴───────┴─────────┘
173.158782 seconds (110.26 M allocations: 13.375 GiB, 2.65% gc time, 0.01% compilation time) Without collecting data: using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")
data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data;
lossfn=Flux.crossentropy,
callbacks=[ToGPU()],
batchsize=16,
model=model,
optimizer=Descent())
@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)
Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:43
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:03
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 1.0 │ 2.31503 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼────────┤
│ ValidationPhase │ 1.0 │ 1.5084 │
└─────────────────┴───────┴────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 2.0 │ 1.32031 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 2.0 │ 1.28183 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 3.0 │ 0.96906 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 3.0 │ 0.97488 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 4.0 │ 0.62234 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 4.0 │ 0.83441 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │ 5.0 │ 0.42819 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│ Phase │ Epoch │ Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │ 5.0 │ 0.80053 │
└─────────────────┴───────┴─────────┘
223.763756 seconds (3.90 G allocations: 147.653 GiB, 21.48% gc time, 0.07% compilation time) |
Hm, try passing If that fixes training, it looks like you'll be at Flux.jl speeds (which you really should since that part is basically the same) |
Yeah!!! It works!!! Quadro P5000: 177.265435 seconds (110.25 M allocations: 13.374 GiB, 3.17% gc time) |
Great! I'll close this issue then. |
When I try to train a simple resnet on CIFAR10 dataset, FastAi seems very slow compared to Flux (≈ 9-19 times slower).
It seems, it could be a garbage collector problem, because with Flux I can have a batch-size of 512, and with FastAI I can't exceed 128 without having a out of memory error.
FastAI code:
Flux code:
Results on a RTX 2080 Ti:
FastAI:
1841.008685 seconds (3.92 G allocations: 212.561 GiB, 59.59% gc time, 0.00% compilation time)
Flux:
98.444806 seconds (106.49 M allocations: 16.643 GiB, 3.58% gc time, 2.58% compilation time)
Results on a Quadro P5000:
FastAI:
1574.714976 seconds (3.92 G allocations: 212.473 GiB, 11.08% gc time)
Flux:
177.416636 seconds (105.55 M allocations: 16.639 GiB, 2.05% gc time, 1.42% compilation time)
The text was updated successfully, but these errors were encountered: