DocumentSASS

The instruction sets for NVIDIA GPUs have a very sparse official documentation.

Other projects have worked on examining the instructions mainly through reverse-engineering, such as MaxAs, AsFermi, CuAssembler, TuringAs, KeplerAs, Decuda, and the paper Dissecting the NVidia Turing T4 GPU.

Since the instructions and architecture changes from generation to generation, it is an uphill battle.
What if a description of the instruction encoding could be found within the tools provided by NVIDIA?
What if the instruction latencies could be found inside these as well?

The answer is of course they can. Otherwise the compiler would do a poor job scheduling instructions. Furthermore, for SASS it turns out that fixed-latency instructions have the number of stall cycles hard-coded into them [src]. It is just a question of finding where this data is hidden.

It turns out that an extensive description of SASS instructions as well as latencies was contained in two specific strings in nvdisasm. Instead of having to write micro-benchmarks to find latencies, or use reverse engineering to make an assembler, one could in theory just consult these files. Instruction scheduling info is given in the latencies file, with the minimum time for fixed-latency ops. essentially being the latency. See NOTES.

For some additional, unrelated observations, see OTHER.

How to run

The easy way is by simply running this notebook in Google Colab. No requirements.

Requirements to run locally: Linux, Python 3, CUDA Toolkit. Run make to generate the raw files describing instructions and latencies. Be sure to change the paths in the beginning of the Makefile if they are different on your system. Tested with CUDA 11.6.

How it works

nvcc is used to compile example.cu to .cubin binaries for a list of architectures.
cc is used to compile intercept.c to a .so library that serves as a man-in-the-middle for data from memcpy calls.
We intercept nvdisasm applied on each binary file using intercept.so.
The result is filtered with strings to only get text, and then the script funnel.py gathers the relevant portions and writes them to files.

An initial approach was to simply run strings nvdisasm to get text embedded in the executable, but it turned out the relevant strings were dynamically generated (and only for the input architecture), which is why this solution is needed.

TODO

It appears the instruction string may be slightly corrupted for compute capability 3.5 currently.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
Makefile		Makefile
NOTES.md		NOTES.md
OTHER.md		OTHER.md
README.md		README.md
example.cu		example.cu
funnel.py		funnel.py
intercept.c		intercept.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

Makefile

Makefile

NOTES.md

NOTES.md

OTHER.md

OTHER.md

README.md

README.md

example.cu

example.cu

funnel.py

funnel.py

intercept.c

intercept.c

Repository files navigation

DocumentSASS

How to run

How it works

TODO

About

Releases

Packages

Languages

License

2015xli/DocumentSASS

Folders and files

Latest commit

History

Repository files navigation

DocumentSASS

How to run

How it works

TODO

About

Resources

License

Stars

Watchers

Forks

Languages