onehotbatch performance

After a [discourse thread](https://discourse.julialang.org/t/flux-1d-convolutions-on-genomic-data/74874), I was recommended to create a github issue about the performance of `onehotbatch`. Let's use the following MWE:

```julia
using Flux
using BenchmarkTools

const bases_dna = ['A', 'C', 'G', 'T']

function ohe_custom(sequence)
    return collect(sequence) .== permutedims(bases_dna)
end

function ohe_flux(sequence)
    return Flux.onehotbatch(collect(sequence), bases_dna)
end

sequence = "CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA"
```

So right now the dimensions of the two functions are different (transposed), but that can easily be changed with a `permutedims` applied to any one of them, otherwise they return the same onehot encoded matrix. So far, so good. However, when benchmarking them, we find the following:

```julia
@btime ohe_custom(sequence);
# output: 550.514 ns (5 allocations: 464 bytes)
```
and
```julia
@btime ohe_flux(sequence);
# output: 69.274 μs (374 allocations: 17.30 KiB)
```

As we can see, `ohe_flux` is more than 100 times slower than `ohe_custom` and with 70 times more allocations. 

Another minor detail is the size of the different outputs:
```julia
Base.summarysize(sequence)
# output: 54
```
```julia
Base.summarysize(ohe_custom(sequence))
# output: 96
```
```julia
Base.summarysize(ohe_flux(sequence))
# output: 232
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

onehotbatch performance #1844

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

onehotbatch performance #1844

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions