Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastAI seems very slow compared to "vanilla" Flux #187

Closed
a-r-n-o-l-d opened this issue Dec 13, 2021 · 9 comments
Closed

FastAI seems very slow compared to "vanilla" Flux #187

a-r-n-o-l-d opened this issue Dec 13, 2021 · 9 comments

Comments

@a-r-n-o-l-d
Copy link

When I try to train a simple resnet on CIFAR10 dataset, FastAi seems very slow compared to Flux (≈ 9-19 times slower).
It seems, it could be a garbage collector problem, because with Flux I can have a batch-size of 512, and with FastAI I can't exceed 128 without having a out of memory error.

FastAI code:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
method = ImageClassificationSingle(blocks)
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks)
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Flux code:

using Flux
using Flux: DataLoader, onehotbatch
using Augmentor
using MLDatasets
using ParameterSchedulers
using ParameterSchedulers: Scheduler
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

normpip = SplitChannels() |> PermuteDims(3, 2, 1) |> ConvertEltype(Float32)

labels = CIFAR10.classnames() .|> Symbol

function datasets(batchsize)
    train = let
        x = CIFAR10.traintensor() |> CIFAR10.convert2image
        y = map(i -> labels[i + 1], CIFAR10.trainlabels())
        DataLoader((x, y), batchsize = batchsize, shuffle = true, partial = false)
    end

    test = let
        x = CIFAR10.testtensor() |> CIFAR10.convert2image
        y = map(i -> labels[i + 1], CIFAR10.testlabels())
        DataLoader((x, y), batchsize = batchsize)
    end
    
    train, test
end

function minibatch(x, y)
    h, w, n = size(x)
    xb = Array{Float32}(undef, w, h, 3, n)
    augmentbatch!(CPUThreads(), xb, x, normpip)
    yb = onehotbatch(y, labels)
    xb, yb
end

function train!(model, optimiser, nepochs)
    loss_hist = []
    loss(x, y) = Flux.crossentropy(model(x), y)
    ps = params(model)
    for e in 1:nepochs
        # Training phase
        tloss = 0
        trainmode!(model)
        for (x, y)  train
            x, y = minibatch(x, y) |> gpu
            gs = gradient(ps) do
                l = loss(x, y)
                tloss += l
                l
            end
            Flux.Optimise.update!(optimiser, ps, gs)
        end
        tloss /= length(train)
        # Validation phase
        testmode!(model)
        vloss = 0
        for (x, y)  test
            x, y = minibatch(x, y) |> gpu
            vloss += loss(x, y)
        end
        vloss /= length(test)
        push!(loss_hist, (tloss, vloss))
    end
    
    loss_hist
end

train, test = datasets(16)
nepochs = 5
s = Triangle(λ0 = 1f-5, λ1 = 1f-3, period = nepochs * length(train))
opt = Scheduler(s, Descent())
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0) |> gpu
@time train!(model, opt, nepochs)

Results on a RTX 2080 Ti:
FastAI:
1841.008685 seconds (3.92 G allocations: 212.561 GiB, 59.59% gc time, 0.00% compilation time)
Flux:
98.444806 seconds (106.49 M allocations: 16.643 GiB, 3.58% gc time, 2.58% compilation time)

Results on a Quadro P5000:
FastAI:
1574.714976 seconds (3.92 G allocations: 212.473 GiB, 11.08% gc time)
Flux:
177.416636 seconds (105.55 M allocations: 16.639 GiB, 2.05% gc time, 1.42% compilation time)

@lorenzoh
Copy link
Member

lorenzoh commented Dec 14, 2021

Thanks for the report!

I think the most likely culprit is that the images do not have the same size, since ImageClassificationSingle defaults to resizing to (128, 128) while CIFAR10 source is significantly smaller. This should make a 16x (128^2 / 32^2) theoretical difference. The re(/up)sizing here doesn't make much sense here, but it is the default since many image datasets have differing image sizes.

Could you let me know if the FastAI code runs faster without the upsizing, i.e. using

method = ImageClassificationSingle(blocks, size=(32, 32))

@a-r-n-o-l-d
Copy link
Author

Thank you for the suggestion, it was the problem. I should have read more carefully the documentation of ImageClassificationSingle.

Results on RTX 2080 Ti:
159.958922 seconds (3.91 G allocations: 147.793 GiB, 24.01% gc time)
Results on Quadro P5000:
228.792806 seconds (3.90 G allocations: 147.644 GiB, 23.16% gc time)

(I just discover FastAi.jl, I love it)

@lorenzoh
Copy link
Member

lorenzoh commented Dec 14, 2021

More of a documentation error then, not on your part 😉

That still seems pretty slow, but is probably due to the amounts of garbage collection happening and since the dataset isn't preloaded (as FastAI.jl assumes out-of-memory datasets by default). You may get another speedup if you preload the batches once, since I think they will fit into memory. To do so, construct the Learner and then before training, do

learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)

If you try it out, let me know if that improves the allocation issues!

@a-r-n-o-l-d
Copy link
Author

I will try that. During theses experiments, Volatile GPU-Util is not always above to 90% (most of the time it's above 90%, but it can quickly drop to 0%). I supect I need to optimize the data loading part.

@lorenzoh
Copy link
Member

GPU utilization tends to be a bit lower on validation splits, since the GPU needs to do less work compared to the data loading part. If you try the above suggestion, the data loading overhead will be effectively zero.

@a-r-n-o-l-d
Copy link
Author

Collecting data into RAM clearly reduces the garbage collection time from 21% to 3%. Unfortunately, the training process is completely broken, the training loss decreases quickly and the validation loss is unchanged. It looks like an overfitting after one epoch, which is very very surprinsing.

With collecting data:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels=3, nclasses=10, dropout=0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())
learner.data.training = collect(learner.data.training)
learner.data.validation = collect(learner.data.validation)

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:33
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:32
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:01

┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   1.00.66604 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   1.02.30902 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   2.00.00658 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   2.02.31039 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   3.00.00178 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   3.02.31698 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   4.00.00102 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   4.02.32021 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   5.00.00087 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   5.02.32081 │
└─────────────────┴───────┴─────────┘
173.158782 seconds (110.26 M allocations: 13.375 GiB, 2.65% gc time, 0.01% compilation time)

Without collecting data:

using FastAI
using ResNet9 # Pkg.add(url = "https://github.com/a-r-n-o-l-d/ResNet9.jl", rev="v0.1.1")

data, blocks = loaddataset("cifar10", (Image, Label))
model = resnet9(inchannels = 3, nclasses = 10, dropout = 0.0)
method = ImageClassificationSingle(blocks, size=(32, 32))
learner = methodlearner(method, data; 
    lossfn=Flux.crossentropy,
    callbacks=[ToGPU()],
    batchsize=16,
    model=model,
    optimizer=Descent())

@time fitonecycle!(learner, 5, 1f-3, pct_start=0.5, divfinal=100, div=100)

Epoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:39
Epoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 4 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:40
Epoch 4 ValidationPhase(): 100%|████████████████████████| Time: 0:00:04
Epoch 5 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:43
Epoch 5 ValidationPhase(): 100%|████████████████████████| Time: 0:00:03

┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   1.02.31503 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬────────┐
│           Phase │ Epoch │   Loss │
├─────────────────┼───────┼────────┤
│ ValidationPhase │   1.01.5084 │
└─────────────────┴───────┴────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   2.01.32031 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   2.01.28183 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   3.00.96906 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   3.00.97488 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   4.00.62234 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   4.00.83441 │
└─────────────────┴───────┴─────────┘
┌───────────────┬───────┬─────────┐
│         Phase │ Epoch │    Loss │
├───────────────┼───────┼─────────┤
│ TrainingPhase │   5.00.42819 │
└───────────────┴───────┴─────────┘
┌─────────────────┬───────┬─────────┐
│           Phase │ Epoch │    Loss │
├─────────────────┼───────┼─────────┤
│ ValidationPhase │   5.00.80053 │
└─────────────────┴───────┴─────────┘
223.763756 seconds (3.90 G allocations: 147.653 GiB, 21.48% gc time, 0.07% compilation time)

@lorenzoh
Copy link
Member

lorenzoh commented Dec 14, 2021

Hm, try passing buffered = false to methodlearner. I think collect reuses the same buffer since buffered = true by default, so what I wrote doesn't work for collecting the whole dataset.

If that fixes training, it looks like you'll be at Flux.jl speeds (which you really should since that part is basically the same)

@a-r-n-o-l-d
Copy link
Author

Yeah!!! It works!!!

Quadro P5000: 177.265435 seconds (110.25 M allocations: 13.374 GiB, 3.17% gc time)
RTX 2080 Ti: 91.993256 seconds (111.44 M allocations: 13.404 GiB, 4.17% gc time)

@lorenzoh
Copy link
Member

Great! I'll close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants