Skip to content

SkybuckFlying/CUDA

 
 

Repository files navigation

Hello dear Delphi programmer(s), here Skybuck Flying typing the following message to you:

I've been holding you back for long enough. I hoped that maybe some day the cuda driver api and cuda framework would make me rich ! XD

I am not sure if that will ever happen ! LOL.

But what I have seen so far is amazing things by other Delphi programmers. Amazing new frameworks for AI and LLMs.

Some of them written for CPUs, some of them written for Vulkan, but when I inspect the cuda support it's a bit hard/lacking as far as I can tell.

AI has developed very fast, it's evolution is going very fast, it's capabilities are already amazing.

It's time to unleash whatever I can do to enhance/embrace/extend this development :)

So today I decided to finally release my cuda work for Delphi completely free of charge onto the internet.

There is another reason why I am releasing it. I am releasing it under the "SkybuckOriginals" organization.

A github organization where I will release my hand-written code of the past as an indicator and testimony of what a human computer programmer was capable of.

This cuda driver api and cuda framework was one of my best productions during my high years as a computer programmer.

However the/those days of written code by hand are long gone for me. Nowadays I generate everything with AI.

Now onto the driver and framework itself:

  1. It's somewhat complex to do, to interact with the cuda driver and understand the C/C++ source code that makes it possible. However my work of the past decoded this in a smart way and stuffed it into Delphi units. I probably forgot some of it, but it had a certain workflow to it, to extract certain information and values from the C/C++ headers and such without changing too much.

This would allow the driver api to be easily updated as the Cuda C/C++ code advances/evolves.

This driver api/framework is basically based on cuda 4, but was also tested with cuda 12. The driver api and framework should still be usefull to do basic and somewhat advanced interactions with cuda.

However cuda has evolved far beyond it's original intention, the latest cuda technology allows gpu tiling, basically 2d matrix operations on the graphics card/gpu, this is ideal for AI computations.

I have no time yet to build these advanced applications of cuda into this Delphi framework, but perhaps somebody else can clone this driver + framework to use as a basis to evolve it further.

The framework has a very nice architecture in the way it wraps return values, error codes, error messages and error descriptions via a TCudaBase object/class which wraps a general design pattern to always make the the API calls are nicely wrapped and their error codes, return values, message etc are stored in the object for further consultation in case anything went wrong, which is very valuable for debugging sessions and prevents having to write all kinds of custom code to get those values, basically it's all nicely wrapped in TCudaBase objects/classes and objects/classes derived from it.

This cuda framework also has something special, a calculator/algorithm which calculates the optimal launch parameters, at least for the hardware of the time, this code/values could be updated to compute optimal launch parameters for newer generation hardware. The basic idea behind it is to express a problem in linear space, basically 0 to N-1 and then maybe constrain it further into certain dimensions and also let this calculator compute the optimal dimensions for the hardware. These computations are done based on the understanding of the hardware and certain hard coded values in variables/fields/lookup tables etc as documented by nvidia, basically what kind of compute capabilities and resources the gpu has on it's chip, like resident threads, that kind of thing.

I just took a look at the source code again, especially the framework, it has some comments here and there it looks a bit messy, but I will leave it as is at least under SkybuckOriginals.

These comments could be valuable to understand some of it. Some other comments are highly technical comments from original c/c++ headers and should be left in, some of them might be mine and could also be very important. Most comments are in english, some of them are in dutch.

For now I have no time to clean this source code up, so I dump it into github AS IS.

If I ever do continue with this code I will use AI to do so, I would fork this project onto a different account to leave this code as is, basically frozen in time as a testimony of this work.

Also to have something to fall back on, in case the AI makes mistakes or hallucinates or introduces bugs, however AIs have become very good and highly intelligent.

The main reason why I release this is to make cuda hardware tiling support possible in Delphi for LLMs, as far as I know vulkan does not yet have gpu tiling and thus cuda would be necessary to benefit from advanced nvidia graphics cards/gpu hardware features.

I want to make sure that the hard work/resources that others have put into their latest code bases/inventions/AI advancements can take full advantage of hardware resources.

God and we all know we can use every little bit of compute power we can get our hands on to accelerate AI.

Last but not least:

I will fork this project so that any improvement can be made onto the fork.

I would like to ask you to not file any issues on this "skybuck originals" main repo/fork because it's supposed to be a testimony/frozen in time.

This might also some day help AI researchers to detect human written code from AI written code or perhaps other insights.

I put my blood sweat and tears into this code at the time, trying to make sure it was as perfect as possible.

I read the entire nvidia cuda framework documentation at the time. I even wrote a PTX syntax highlighter which is still available for the textpad editor, see on their website.

I do like PTX a little bit, it's like the ARM instruction set, it's kinda easy to read/understand. Probably way better/easier to understand than vulkan.

Vulkan is to new for me sigh :) Vulkan is more advanced possible, at least for multi-threading, but it's also less advanced when it's comes to supporting specific gpu capabilities like gpu tiling apperently. Perhaps in the future vulkan can do the same.

To prevent confusion about other forms of tiling programming let me be specific about it:

https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-tile-kernels.html

https://docs.nvidia.com/cuda/tile-ir/latest/sections/prog_model.html

https://docs.nvidia.com/cuda/parallel-thread-execution/#tensorcore-5th-generation-instructions

Some further helpfull information:

" NVIDIA provides several official libraries related to Tensor Cores and Tile Programming.

  1. Tensor Cores Libraries (Most Relevant)

LibraryTypeDescriptionDLL / Library NameUse CasecuBLASHigh-levelBest optimized BLAS library, automatically uses Tensor Corescublas / cublasLtGEMM, general matrix operationscuBLASLtLightweight & flexibleLower-level API for maximum control + Tensor Core usage + fusionscublasLtHigh-performance GEMM with tuningCUTLASSTemplate libraryC++ templates for building custom high-performance Tensor Core kernelsHeader-only + compiled kernelsCustom kernels (very popular)cuDNNDeep LearningOptimized for neural networks, heavily uses Tensor CorescudnnTraining / Inference

cuBLAS / cuBLASLt is what most people use when they want "Tensor Core acceleration" without writing low-level code. These are included in the standard CUDA Toolkit.

DLL / Shared Library names (typical):

Windows: cublas64_.dll, cublasLt64_.dll Linux: libcublas.so, libcublasLt.so

  1. CUDA Tile Programming

cuTile is not a separate traditional library (like a DLL you link against for calling functions).

It is a new programming model + compiler infrastructure (CUDA Tile IR + cuTile Python / C++). It is delivered as part of the CUDA Toolkit (since CUDA 13+). For Python: installed via pip install cuda-tile It generates kernels that can automatically use Tensor Cores efficiently.

There is no standalone cutile.dll you link like cuBLAS. Instead, you use the cuTile DSL, which compiles down to GPU code that leverages Tensor Cores. "

Perhaps with the release of this framework a proper pascal compiler can be created which can target PTX.

I am a little bit disappointed this never happened, but it is kind of understandable. I did once offer the producers of Delphi to purchase my framework for 50k dollars, but they never bited, perhaps one of the biggest mistakes this company ever made seeing the current AI hype/world take over ! ;) But that is sour for them.

I also once contacted another person which developed a pascal compiler for some embedded system, he wanted 1000 dollars to write a pascal compiler for arm/and/or cuda or something. This guy must now also be a bit sour ?! Both of these people could have been very early positioned to take full advantage of what was to come. Unfortunately that never happened on time.

However now with the arrival of very powerfull AI models/systems/chat engines and agents the Delphi developers are striking back and taking the crown back from the shitty python-non-backwards-compatible-slow-interpreted programming language to something more powerfull/mature like Delphi, a good thing, this makes me somewhat happen. I'd hate to read python or c, or c++ or go code for the rest of my life trying to understand how it works.

To prevent this horrible future of having to read other shittier programming languages it's time to unleash this framework, better late than never.

I've had my fair share of trying to read other programming languages, it's horrid/horrible and a huge time waster.

I also believe Delphi will eventually have to become open source to compete with everything else on this planet and to be able to parse your own code and manipulate it.

I do not believe Embarcadero/Idero or whatever/software companies will be around much longer. Their products will be cracked by a 15 year old telling the AI to crack it.

And their products will be easily reverse engineered and duplicated within a few years. I myself already have a delphi grammer in the worksm for tree sitter using advanced AI and I may leak it onto the internet later, unleasing possibly a whole new venture of delphi tools, however I also do see some commercial potential for meself so I might not yet release it.

I am also not ready yet to destroy Embarcadero/Delphi itself. The guys who made it or kinda fun, but they are holding on to their own language/products a bit too much and are not learning fast enough about AI, but this is understandable, I do worry they will be beat/swamped by new alternatives and these will come very soon and some of them have already arrived, too many to keep up with it and investigate but I will surely try.

I do like the idea of keeping the delphi language community together to be able to use each others code/interoperability. However with the AI becoming faster and faster and faster it might be such that in the future we will just write everything ourselfes and don't need anybody else anymore, which would be a little bit sad, but for now doing that might be a bit a waste of tokens.

The opposite could ofcourse also happen: tear down of AI, bannishment of AI, wars, etc. However I do not seen that happen any time soon because of pro-liferation of AI for local use.

Releasing this framework will help to entrench AI locally, it may also even help data centers. It may allow to squeeze every little drop of performance out of local computers.

For now my pleas for 1 terrabyte sytems/computers were not honored and we will have to "suffer" some more with data center AI instead of local AI.

Hopefully NVIDIA will change course in the future and start producing graphics/AI cards again with massive RAM chips onboard of it and lots of cuda compute units/tensor cores, or whatever else they can come up with to accelerate local AI and keep local computing ! Otherwise to much power would fall into the hands of whoever runs data center, not a nice idea to think about if people get disconnected/banned from it, which can even happen by accident, by automated AI systems applieing certain (usage) rules.

For now data center AI is much more powerfull than local AI because of a couple of reasons:

  1. Much larger context windows, up to 1 million, which is crucial for good understand and large code base processing.
  2. Much faster.
  3. Relatively cheap.
  4. Relatively safe. (Data center burns down, instead of your home :))

So for now I am walking down a different path: trying to utilize as much data center AI as possible to get that nice speed up/power.

However I do keep an eye out for local AI developments. I do like what I see, keep up the good work.

My main objective in life currently is to:

  1. Get AI working as well as possible.
  2. Make AI work as powerfull as possible.
  3. Make AI work as efficient as possible.
  4. Make AI work as much/fast as possible.
  5. Make the AI do all kinds of things, especially debugging is high on my list of "must have features" for the near future.

Currently the "AI vendors" are frustrating the AI developers by not unleashing free AI tier APIs, a big mistake in my oppinion, preventing serious adoption of their APIs.

I have considered hacking Firefox so it can talk to web-chat AI, and anti gravity successfully did so in record breaking time, however still kinda tedious to get working well. Meanwhile I learned web-chat uses websockets and might be a way to make it work more reliable, but then their is the sign in and captchas sheningans and continue button sheningans and I am not a robot shenigans etc.

Another big reason why I am "butting out" of local AI for a while:

My chat experience with many AIs have shown that it is very beneficial to discuss coding problems with as many AIs as possible, or at least multiple.

Some AIs will be able to solve this, some others will be able to solve that.

This is why I believe for now, local AI is a dead end until local computers are so powerfull that they can run multiple AI models.

However I believe data centers will always have an adventage true share of compute units.

Very maybe something unforseen might come like local quantum computers which somehow are so powerfull that all data centers are no longer needed and a single quantum computer is so powerfull it can easily run a a super AI.

Until that time arrives. I leave you to struggle with local AI !

I do believe local AI will have some usage possibilities like local AI debug stepping... I am not yet sure how much context is needed for AI to debug a program, it will need many communication trips between AI model and the debugger, so fast AI models with a high cycle time/fast reaction/response would be desirable.

^ This is my current path.

The oxygen product from rem objects claims to be able to do AI debugging, I have net yet successfull used it, it requires some setup, and it's gui is 2010s and does not easily allow addition of thousands of go files embedded in many sub folders, a shame.

However I will continue my effors, search for the ultimate debugging experience.

So far AI is very good at debugging code in general by using massive logs, but sometimes a debugger can shine light on things an AI will struggle with.

Such bugs could be race conditions, deadlocks, which are hard to recgonize by humans and even AI.

GO language has some advanced debuggers which could find these.

Currently I am interested in automating go debugging with an AI.

I also believe Delphi could maybe use some of GO's parallelism/concurrency inventions, like channels, but I am not yet sure.

I definetly do not like the import statement from GO, they take control away from the programmer.

In general I do not like give away my hard work for free, unless it was generated by an AI.

I have posted test programs in the past on usenet, but never really some big hard work I did, so this is the first time I will release some of my work to general public.

I think this release is important, or at least it could be important, but that depends on the uptake and if people will be able to understand it, even if people can't understand it the fun part is the AI will be able to understand it.

So I would highly recommend you start with that. Simply "throw" all the files into an AI an ask the AI what it thinks of it.

I already did do this, the AI will say: "it was written for cuda 4.0 or whatever..." and the newer "cuda api" is like star trek a different beast.

But don't let that fool you too much.

As far as I can tell the driver api has not changed that much and is still the way it was, still backwards compatible.

I tried to do my best to understand how to push/pop cuda contexts onto the stack.

I am not a true assembler programmer. I think I got it right but I might be mistaken.

The most difficult part was to push the cuda context in the beginning and to support multi threading but this is should be able to do.

It can also support opengl interoperability. I never bother with directx, it changes to much and always crashed on my systems, so I consider DirectX garbage from Microsoft.

But the point I am trying to make is this: I do notice that my cuda applications sometime hang on shutdown. I am not sure why. I am not sure if it's a cuda framework related issue or not.

This could be looked into... by an AI... it's easy to do, unfortunately I do not want to spent the time on it right now.

It still requires a lot of time to wait for the AI to think, produce an answers, copy it, paste it, test it, read it, look at it, scroll through it etc.

So I choose to spent time on that which will bring me closer to productivity gains.

Unfortunately working on this cuda framework is very far away from that goal, but for some others it might bring them closer and me too in the future by benefitting from their cuda integrations.

So with that I will leave... I hope you understand a little bit...

Perhaps in the future I will spent some time on this.

I will fork it, read any pull issues, but I do not garantuee that I will react to it, especially if it's AI generated. I am pro AI, but don't want to become your AI development engine ;)

I will try to answer any questions that you have to the best of my abilities, but I would have to think hard to remember how everything worked again, fortunately it's still somewhere deep in my mind, and ofcourse the benefit of Delphi is always there: easy to read code.

I dislike the lack of Pascal programming support by NVIDIA, they could have invested some money into helping people design and release pascal compilers/tools for cuda but they never did.

Their cuda kernel backwards compatiblity was also somewhat lacking at the time, so I dislike the way NVIDIA handles things, but I do like the performance their graphics cards bring.

So it's a double edged sword.

My general dislike of how they handle backwards compatibility and lack of pascal/delphi support put me of of cuda. Plus cuda not very general, mostly suited for processing tiny little elements and they must be near each other in memory.

So CPUs in general are still king for executing code. Though GPUs have also become very important with the rise of AI, especially dumping math formulas into matrices lazy and brute forcing their way to greatness ! ;)

Now you have a chance to also become matrix-lazy ! ;) :)

May the matrix-lazyness be with you ! Alwayyyyyzzz ? (Or maybe not and inject some optimizations ? ;) :))

Bye for now, Skybuck Flying.

One last note:

I am going to release it exactly as it's on my drive, with no alterations, this includes versioning folders so I know exactly what versions I released and without any modifications to keep it consistent what is on my drive, in case for example I develop it further by hand instead of AI, but this is unlikely in todays world. It depends on what kind of code quality/sanity I want. Though sometimes AI can find bugs that even humans can't find. In general for highly technical stuff/my own code I feel more at easy knowing it was 100% coded by a human/me. So far I have not applied AI to my own hand written code, only newly generated AI code. I kinda like this division. I might make an exception for this driverapi/framework, but somehow I feel emotional about it. What if the AI designs future versions wrongly ? It would undo my own design. On the other hand it might also design it better. I guess what I am trying to say is: Perhaps I have too much emotional attachment to this code, because I wrote it by hand. It's simply emotionally hard to apply a "machine" to a piece of hand written art.

For you, an outsider, I am sure you will not have these emotional obstructions/barricades. If the AI takes it into a different direction you are free to do so.

Perhaps this is for the better, pass it on to somebody else, something else, which is not emotionally attached this this code. The emotion is holding me back as well, too scared to screw something up.

The memory management by the way is under developed, I could not come up with a nice design/framework to wrap all those strange/difficult complex memory transfer functions, I didn't really see a nice pattern in there, maybe there is none, maybe there is something...

I personally feel like the AI is not completely there yet to do what I did back in those days, it's understanding might be lacking, it's context window might not be big enough. It's coding style is not yet up to par and it's true/thruthfull debugging capabilities is non-existent.

So the time is not yet right to apply AI to my own code bases... making an exception for this code base to build the AI of the future might be the exception.

I do not want to mix "code quality" in my projects. It's either hand-written by me, or it's completely computer generated. In the later case I might help it/bug fix it here and there a little bit.

This is just how I feel about AI in 2026. It's all or nothing.

To prove the usability of this framework I have no choice but to also release my cuda benchmarking software as a show case of what this software can do and how to use it.

Apperently there is a text document in this framework somewhere to in which I tried to investigate the hang phenomenon... and AI came up with some suggestions...

I would advise repeating this experiment to get fresh AI insight into this matter.

The cuda benchmark might be an acceptable of a Delphi application which hangs on shutdown for some reason.

The other test programs in the frameworks might have fallen behind as the code developed further for the driverapi/framework, I am not sure if these are up to date, I release everything as is.

I do delete exe, release, debug, history, recovery folders, dcu and other delphi build products/left over crap/etc (In a special GITHUB CUDA folder, so my original source code folders are untouched).

It's a bit risky to have a duplicate code base, it could confuse me, but I am willing to deal with it, to unleash this software onto the internet/web/github and into your Delphi-programming hands ! ;)

It might even be usefull for free pascal computer programmers ! ;)

I will also include the "easy cuda framework" but it's not so good, but it might get you up and running fast. I disrecommend using this framework, it's not as flexible and powerfull as using the cuda objects individually, but it might save you some coding time, but in todays world of AI, the AI should have little problems combining the CUDA objects into powerfull code ! ;)

I have decided against releasing easy cuda and the application for now, because I just released it would burden/pollute the repo. The repo should stay clean and be about this driverapi and framework only which belong together:

The driver api is the lower layer, the framework wraps the driver api into nicely re-usable easy to use, easy to program and powerfull delphi classes/objects for you and now AI ;).

The lower driver api should not be programmed against directly and should be avoided as much as possible. It might be necessary to include it in uses clausules for certain enums/types/code values, and ofcourse to get the framework itself working, by try to avoid it whenever possible.

Perhaps in the future I will "unleash" the easy cuda framework which is of little value, the benchmark is of high value, it shows how to fall back gracefully among the different graphic cards models.

And in the future perhaps also my opengl framework which wraps opengl in nice opengl classes which cleanly seperate opengl into it's different version as it evolved as it should have been done by others, but was never done. This opengl framework could use a little work to make the switching/specificing of which opengl version to use a bit more friendly.

But that will have to wait until the future. Setting up git repos is a lot of working, maintaing it as well, describing it as well. Perhaps it's not the correct way to go about it, I have a lot of code on my drive accumilated over the years, but at least it allows people to zoom in and focus on this code.

My personal oppinion is that github is becoming a gigantic mess because of the lack of proper categorization and subfolder support. Now we end up with all kinds of stupid repo names and strange nicknames etc.

Git and Github in my oppinion is currently unsuited for hosting large code bases of different kinds, even hosting individual repos is a bit of a mess, shame really :(

I tried to do something about to ask for subfolder support... not going to post the link right now here... because this readme is already long enough.

Writing this readme already took a lot of my (precious) ai development time, but I hope that maybe I will get something back for it in the future.

It would be kinda fun to see my framework integrated into other projects or even be improved upon or derivative(s).

Good lucccccccccccccccccccccccccccckkkk ! Yes AI does depend on some luck ! Lucky neuron values ? ;) :)

About

CUDA driver api and framework written in/for Delphi (development repo)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Pascal 100.0%