Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update minimum version of Enzyme #950

Merged
merged 15 commits into from
Sep 23, 2024
Merged

feat: update minimum version of Enzyme #950

merged 15 commits into from
Sep 23, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

github-actions bot commented Sep 22, 2024

Benchmark Results (ASV)

main 850da8f... main/850da8f0a7dda6...
basics/overhead 0.0661 ± 0.035 μs 0.0657 ± 0.036 μs 1.01
time_to_load 1.01 ± 0.013 s 1.01 ± 0.027 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: c50cf19 Previous: e23b1a7 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 414791.5 ns 414542 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 243542 ns 243375 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 244208.5 ns 244500 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 740042 ns 740166 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 42388 ns 44280.5 ns 0.96
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1321437.5 ns 1298541.5 ns 1.02
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 1229312.5 ns 1240562 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 16261417 ns 16503791 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2264458 ns 2208500 ns 1.03
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 183483 ns 208333 ns 0.88
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1308500 ns 1353521 ns 0.97
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 1273041.5 ns 1293417 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 16436500 ns 16423250 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2235541.5 ns 2228104 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1657250 ns 1657875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1088333 ns 1092375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1520146 ns 1539021 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2997750 ns 3020458.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 207711.5 ns 210061.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12153437.5 ns 12139208 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8843791 ns 8813375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9249312.5 ns 9256875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18622833.5 ns 18601500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1489876 ns 1487808 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17298791 ns 17301000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13994979 ns 13890083 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14489500 ns 14536416 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21846041.5 ns 21849875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 251178625 ns 250662791.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148605583 ns 148483145.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116398999.5 ns 116244708 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 446999375 ns 447941542 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5472271 ns 5468492 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1223144709 ns 1220473417 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 928990375 ns 928051958 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 831507354 ns 828338104 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1629384125 ns 1629213792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35526811.5 ns 31128714 ns 1.14
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1143800125 ns 1068598834 ns 1.07
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 997671521 ns 965131583 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1328838395.5 ns 1298869062.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1731514166.5 ns 1731451000 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1118417 ns 1105521 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1650333.5 ns 1504458.5 ns 1.10
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3710604 ns 3588000 ns 1.03
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 784209 ns 785542 ns 1.00
lenet(28, 28, 1, 32)/forward/GPU/CUDA 253692.5 ns 270147 ns 0.94
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2988750 ns 2989521 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4171208 ns 4100375 ns 1.02
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 9812375 ns 10725834 ns 0.91
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3155124.5 ns 3151334 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1036737 ns 1127024.5 ns 0.92
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2274041 ns 2273083 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1338187.5 ns 1320687.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1534667 ns 1566750 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4209229 ns 4217125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 208597 ns 209825 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19426333 ns 19419958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16097375 ns 16062459 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17405000 ns 17207666.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25901083 ns 25925499.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1566916 ns 1590537 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34419979 ns 33976167 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30832770.5 ns 30847604 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 31123000 ns 31068500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36813875 ns 36660479 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4541792 ns 4532334 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2557458 ns 2536917 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2686000 ns 2708417 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8396709 ns 8394875.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 422312.5 ns 425554 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 38660187.5 ns 39076750 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32141917 ns 32039312.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32270291.5 ns 32300542 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51991750 ns 51878792 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2613911 ns 2625717.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89092250 ns 89157937.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 114709916.5 ns 110310354.5 ns 1.04
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 221810459 ns 221196292 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74349792 ns 74661583.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 268885250 ns 268645541 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 156282916 ns 155966250 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 123597854.5 ns 123152709 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 485334917 ns 485576375 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 6952906 ns 7017925 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1470046750 ns 1469993416.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1176499541 ns 1172293917 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1069804500 ns 1071179125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2009798771 ns 2008263562.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33458527 ns 34758889.5 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1720526375 ns 1722939167 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1530878833 ns 1515630729 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1828314292 ns 1805980375 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2206211958 ns 2204894250 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2052125 ns 2101917 ns 0.98
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3035291 ns 2855250 ns 1.06
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 7850416 ns 8250875 ns 0.95
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2498708 ns 2316458.5 ns 1.08
lenet(28, 28, 1, 128)/forward/GPU/CUDA 256842.5 ns 271499 ns 0.95
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9639125 ns 9314958 ns 1.03
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12000708.5 ns 12005750.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 25781625.5 ns 24338916.5 ns 1.06
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11656583 ns 11759333 ns 0.99
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1105843 ns 1189529 ns 0.93
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 379304250 ns 379542666.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 312223333 ns 310121896 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 265389417 ns 270228604.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 453576812.5 ns 452462041.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 5015585 ns 4858112 ns 1.03
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1158565250 ns 1161116458 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 944323167 ns 936045042 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 955187000 ns 1039056250 ns 0.92
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1580959458 ns 1397951750 ns 1.13
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 18148892 ns 17884006 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1061625 ns 1057792 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 1658917 ns 1665375 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 6090125 ns 4671500 ns 1.30
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1295666 ns 1297417 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 260891.5 ns 269688.5 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6500250 ns 6411041 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 13097937.5 ns 13166167 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 19774583 ns 18369000 ns 1.08
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5951416.5 ns 5854395.5 ns 1.02
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1122497 ns 1228485 ns 0.91
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70546958.5 ns 70564395.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43796229 ns 43714083.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39636104 ns 39753208 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132661062.5 ns 132540542 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1868631 ns 1943140 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 354680146 ns 355335375 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 271168625 ns 270403333 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 254678125.5 ns 253291937.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534791500 ns 534663375 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12208027.5 ns 12307495 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 396555208 ns 395656250 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 372269500 ns 373284833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 701835916.5 ns 655973250 ns 1.07
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 712568833 ns 711770458 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1189848583 ns 1188878875 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 833830958 ns 830603770.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 640280646 ns 640453979 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1861736125 ns 1769157145.5 ns 1.05
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12537044 ns 12306601 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3611672771 ns 3632733895.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2827788584 ns 2812753583 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2708940917 ns 2711988875 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5044323958 ns 5018496208 ns 1.01
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 50084733 ns 50053860 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3419375 ns 3404250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2077292 ns 2081562.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2533750 ns 2527791.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6017459 ns 6026500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 341660 ns 313980.5 ns 1.09
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 26005375 ns 26041958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18967208.5 ns 18880958 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19463604.5 ns 19381417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39291708 ns 39366250 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2471291 ns 2467954 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54418125 ns 54391666.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 86018875.5 ns 79414959 ns 1.08
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 171169959 ns 173499479 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45519250 ns 45644334 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1783916 ns 1779541 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1104250 ns 1103458.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1563167 ns 1565229.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3032874.5 ns 3034833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 211049 ns 212435.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12563437.5 ns 12548145.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9217083.5 ns 9176604 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9631250 ns 9628291.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19005416 ns 19022333 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1526304 ns 1541164.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17698625.5 ns 17655271 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14335500 ns 14328958 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14564083 ns 14577375 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22165187.5 ns 22195583.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70561500 ns 70632792 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43700667 ns 43626937.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39722083 ns 39727333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132657749.5 ns 132702083.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1882209.5 ns 1875633.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 359130042 ns 359948021 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 348301792 ns 346896729.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 306065708 ns 305342083 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 725971042 ns 725230792 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13259370 ns 13377006 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 420304708.5 ns 420544646 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 432108625 ns 420636999.5 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 727201875 ns 764717937 ns 0.95
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 717046250 ns 716168625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1574708 ns 1511645.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1159667 ns 1154625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1144375 ns 1163583 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2438562.5 ns 2456083 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 548169 ns 583442.5 ns 0.94
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8832459 ns 8867000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13726374.5 ns 13888042 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 33086417 ns 33278833 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9847166 ns 9863333 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1196777.5 ns 1464876 ns 0.82
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 16611854 ns 16574334 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 23649375 ns 22600145.5 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 46695875 ns 44879979.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 13153687.5 ns 13139812.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 829896 ns 828583.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 555875 ns 420291.5 ns 1.32
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1060479 ns 1049375 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 725375 ns 724875 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 45421 ns 47459.5 ns 0.96
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1512083 ns 1513208 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1055000 ns 954458 ns 1.11
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1388125 ns 1716520.5 ns 0.81
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2265146 ns 2271209 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 202550 ns 238389 ns 0.85
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1543291 ns 1546208 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1091104 ns 1060229.5 ns 1.03
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 1431354.5 ns 1489458.5 ns 0.96
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2186250 ns 2241416.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3401958 ns 3400042 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2072354 ns 2070874.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2518291 ns 2520375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6005604.5 ns 6012875 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 280608 ns 288345 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24117958 ns 24060000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17171228.5 ns 17205708 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17125854.5 ns 17116750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37574166.5 ns 37647979.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2388385.5 ns 2410484 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52892562.5 ns 52867250 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 84179291 ns 80422375 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 169795125 ns 170489625 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44566187.5 ns 44608687.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 250397500 ns 250519729 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148498375 ns 148647875 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116583854 ns 116284938 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 448019250 ns 447812229.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5330481 ns 5466666 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1104069125 ns 1101838792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 855788729 ns 857350166.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 828683271 ns 827927395.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1752666667 ns 1752721167 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33522328 ns 28896656 ns 1.16
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1030228896 ns 1027872958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 979512375 ns 949678541 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1354268292 ns 1283911125 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1724871187.5 ns 1723765709 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1231354 ns 1101708 ns 1.12
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 765500 ns 680396 ns 1.13
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 717667 ns 667396 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1948020.5 ns 2049895.5 ns 0.95
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 548471.5 ns 572066.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5879167 ns 5888125 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 9182833 ns 8353229 ns 1.10
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 25543333 ns 25738625.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7104041 ns 7117458 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1204148 ns 1386537.5 ns 0.87
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 9688479 ns 9689104 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 16045750 ns 15038229.5 ns 1.07
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 33941000 ns 32959125 ns 1.03
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7631979 ns 7631000 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 516896 ns 512854 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 463000 ns 285500 ns 1.62
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 3142521 ns 3290708.5 ns 0.95
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 89875 ns 90000 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 25345 ns 28008 ns 0.90
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 380124.5 ns 381292 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 455208 ns 433083.5 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4525125 ns 4497542 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 259083 ns 258500 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 183592 ns 224122.5 ns 0.82
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 413687.5 ns 411708.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 485958 ns 463834 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 4739000 ns 4857917 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 271458 ns 271354.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 463667 ns 464854.5 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 397750 ns 219854 ns 1.81
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 763875 ns 760000 ns 1.01
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54125 ns 53292 ns 1.02
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 25762 ns 28360 ns 0.91
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 341750 ns 340833 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 356520.5 ns 326666 ns 1.09
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 405146 ns 697604 ns 0.58
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151833.5 ns 151625 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 177212 ns 210056 ns 0.84
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 354416 ns 354167 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 372771 ns 340541 ns 1.09
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 939209 ns 612458 ns 1.53
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 150979.5 ns 151208 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 601604709 ns 601611250 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 431134020.5 ns 429098979 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 390905750 ns 392612937.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 870753333.5 ns 871912417 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7024858 ns 7031843.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1996126813 ns 2003215979.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1639432666.5 ns 1588632104 ns 1.03
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1595233041 ns 1645858395.5 ns 0.97
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2813310666 ns 2622754667 ns 1.07
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26517264 ns 26077633.5 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 536666.5 ns 531041.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 393792 ns 392562.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 3110625 ns 3112916 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 868625 ns 869500 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 45816 ns 47171 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1766250 ns 1751750 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1774583 ns 1762291.5 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 16288250 ns 16309167 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2656854.5 ns 2771167 ns 0.96
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 214662.5 ns 251324 ns 0.85
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 1947895.5 ns 1848417 ns 1.05
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 1855875 ns 1852416 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 16221500 ns 16667979 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2699375 ns 2787916 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1472479 ns 1351458 ns 1.09
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1067166 ns 1027312 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 953209 ns 931875 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2316333 ns 2324458 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 544815 ns 584909.5 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5920125 ns 5897250 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 8507083 ns 8354604.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 26696458 ns 26379334 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7328749.5 ns 7333291 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1150772 ns 1385897 ns 0.83
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 11704062.5 ns 11681167 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 18261208.5 ns 18190500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 37364333 ns 38237709 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9551958.5 ns 9556291 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2562.5 ns 2584 ns 0.99
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2667 ns 3604 ns 0.74
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 4458 ns 3542 ns 1.26
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 3312.5 ns 2417 ns 1.37
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 21967 ns 24985 ns 0.88
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7250 ns 7208 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7584 ns 7083 ns 1.07
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7208 ns 7334 ns 0.98
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7333 ns 7167 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 172421.5 ns 216583.5 ns 0.80
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8375 ns 8250 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8458 ns 8458.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8416 ns 8666 ns 0.97
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 6000 ns 5875 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10958 ns 9937.5 ns 1.10
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 13250 ns 13041.5 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 11313 ns 10375 ns 1.09
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 8896 ns 7334 ns 1.21
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 21893 ns 25394 ns 0.86
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 20125 ns 19792 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 20042 ns 19833 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 19958 ns 20042 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 20000 ns 19875 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 185905 ns 236401.5 ns 0.79
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 23500 ns 23500 ns 1
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 24042 ns 23500 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 23833 ns 23875 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 21375 ns 21416 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28583 ns 28583.5 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28375 ns 28625 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28541 ns 28895.5 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46292 ns 46292 ns 1
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 23169 ns 26179 ns 0.89
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 229687 ns 224187.5 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 273083 ns 270084 ns 1.01
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4053833.5 ns 4123000 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145000 ns 145500 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 179067.5 ns 212922.5 ns 0.84
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 248875 ns 242145.5 ns 1.03
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 291125 ns 287834 ns 1.01
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 4471646 ns 4006583 ns 1.12
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145666 ns 145875 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1833 ns 2000 ns 0.92
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1958 ns 1500 ns 1.31
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2875 ns 2458 ns 1.17
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 2125 ns 2041 ns 1.04
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 20308 ns 23181 ns 0.88
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5250 ns 5375 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5250 ns 5000 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5333 ns 5417 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5250 ns 4917 ns 1.07
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 214028.5 ns 277300.5 ns 0.77
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 7375 ns 7458 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7417 ns 7375 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 7667 ns 7792 ns 0.98
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5208 ns 5125 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 79999542 ns 79972458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 48006125 ns 47857917 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43132000 ns 43307917 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151541792 ns 151540125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2689995.5 ns 2710162 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 606576500 ns 662506875 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 409946916 ns 410576167 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 398008417 ns 397618416.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 683586167 ns 683832250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16955716 ns 14567626 ns 1.16
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 714999792 ns 714189229 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 687166500 ns 665454667 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1015067167 ns 1013219583 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 997955125 ns 1002665583 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/up_enzyme branch 4 times, most recently from 87e09f9 to bbd3418 Compare September 22, 2024 17:47
@avik-pal avik-pal force-pushed the ap/up_enzyme branch 11 times, most recently from b833f6f to 04ef36a Compare September 23, 2024 01:36
@avik-pal avik-pal merged commit 283db4e into main Sep 23, 2024
13 of 16 checks passed
@avik-pal avik-pal deleted the ap/up_enzyme branch September 23, 2024 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant