Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations #4

Open
Jerry-Master opened this issue May 6, 2024 · 4 comments
Open

Optimizations #4

Jerry-Master opened this issue May 6, 2024 · 4 comments

Comments

@Jerry-Master
Copy link

With the benchmark from here I wrote some optimizations to your cuda kernels to improve the backward while maintaining numerical accuracy with a tolerance of 1e-4. The result is below, compare fusedfourierkan-gpu with myfusedfourierkan-gpu.

                       |      forward  |     backward  |      forward  |     backward  |   num params  |  num trainable params
----------------------------------------------------------------------------------------------------------------------------------
effkan-gpu             |      4.41 ms  |      5.97 ms  |      0.13 GB  |      0.19 GB  |     10010000  |              10010000
fourierkan-gpu         |     17.98 ms  |     14.73 ms  |      1.96 GB  |      2.01 GB  |     10011001  |              10011001
fusedfourierkan-gpu    |     29.08 ms  |   2218.09 ms  |      0.09 GB  |      0.13 GB  |     10011001  |              10011001
myfusedfourierkan-gpu  |     30.46 ms  |     49.09 ms  |      0.09 GB  |      0.13 GB  |     10011001  |              10011001
mlp-gpu                |      0.37 ms  |      1.09 ms  |      0.10 GB  |      0.13 GB  |     10020001  |              10020001

If you change license to MIT or Apache I will make a pull request. There are more optimizations to make. I will continue adding when I have time. So if you want them, just change license.

@unrealwill
Copy link
Collaborator

@Jerry-Master Thanks for your interest, and good work optimizing the backward pass :) .

I'll probably add some optimizations in the future (the backward pass looks really bad :) but it was getting late and I wanted to push).

For now you can enjoy having an edge to investigate more efficiently the properties of fourierKAN :)

Sorry, I don't want to change the license to a more open one.

I need to earn some money, and this project is kind of an experiment to try to monetize some research algorithms. It's about striking a balance between offering enough so that people can investigate the properties, while not offering too much that it get blatantly copied and people don't have an edge to gain by using the commercial version. Kind of like selling more efficient pickaxes to miners during a gold rush.

There are probably plenty of private research labs around the world like mine, sitting on tons of tricks, techniques and algorithms of various values, looking for ways to monetize it while having some positive impact on the world.

The whole economics of deep learning and algorithmic research is completely messed up :

You've got big actors in favorable positions milking their cows for as long as possible while trying to release as slowly as possible and controlling research tools ; Wanna-be big actors running on VC fumes selling at a loss to gain market share ; Hardware manufacturers controlling the compute ; Public universities offering research for free ; Small actors looking for attention to exist ; State-actors sponsoring their flocks to various degrees for various purposes ;

Interesting times ahead :)

@unrealwill
Copy link
Collaborator

I've just pushed an optimization for the backward pass. It should be much better (probably on par with what you've done though I've not yet bench-marked it).

@Jerry-Master
Copy link
Author

Thanks for the answer. It is true that the business models are a bit messed up. In any case, I will probably publish my tricks on my own.

@Jerry-Master
Copy link
Author

Jerry-Master commented May 9, 2024

I have updated my benchmark with your new implementation and with mine. Cross-posting here.

                     |      forward  |     backward  |      forward  |     backward  |   num params  |  num trainable params
----------------------------------------------------------------------------------------------------------------------------------
effkan-cpu           |     33.31 ms  |     43.63 ms  |       nan GB  |       nan GB  |     10010000  |              10010000
effkan-gpu           |      4.15 ms  |      3.69 ms  |      0.13 GB  |      0.19 GB  |     10010000  |              10010000
fourierkan-cpu       |    798.43 ms  |    929.11 ms  |       nan GB  |       nan GB  |     10011001  |              10011001
fourierkan-gpu       |     19.20 ms  |     14.80 ms  |      1.96 GB  |      2.01 GB  |     10011001  |              10011001
fusedfourierkan-cpu  |    914.66 ms  |   1646.11 ms  |       nan GB  |       nan GB  |     10011001  |              10011001
fusedfourierkan-gpu  |     30.14 ms  |     84.01 ms  |      0.09 GB  |      0.13 GB  |     10011001  |              10011001
cufkan-cpu           |   1454.64 ms  |   3807.97 ms  |       nan GB  |       nan GB  |     10011001  |              10011001
cufkan-gpu           |      6.24 ms  |     50.71 ms  |      0.09 GB  |      0.13 GB  |     10011001  |              10011001
chebykan-cpu         |     22.16 ms  |     26.90 ms  |       nan GB  |       nan GB  |     10010000  |              10010000
chebykan-gpu         |      5.89 ms  |      8.03 ms  |      0.14 GB  |      0.13 GB  |     10010000  |              10010000
mlp-cpu              |      6.60 ms  |     10.56 ms  |       nan GB  |       nan GB  |     10020001  |              10020001
mlp-gpu              |      0.45 ms  |      1.06 ms  |      0.10 GB  |      0.13 GB  |     10020001  |              10020001
----------------------------------------------------------------------------------------------------------------------------------
pykan-cpu            |     15.59 ms  |     17.53 ms  |       nan GB  |       nan GB  |         2431  |                  1551
pykan-gpu            |     50.56 ms  |     93.93 ms  |      0.02 GB  |      0.02 GB  |         2431  |                  1551

Mine is cufkan. Your accesses in the forward were not coalesced. The standard indexing of a monolithic kernel seems to be faster than your grid-stride loop. And separating the bias addition helps too. If you make further optimizations please let me know and I will update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants