Skip to content

Unofficial description of the CUDA assembly (SASS) instruction sets.

License

Notifications You must be signed in to change notification settings

2015xli/DocumentSASS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocumentSASS

The instruction sets for NVIDIA GPUs have a very sparse official documentation.

Other projects have worked on examining the instructions mainly through reverse-engineering, such as MaxAs, AsFermi, CuAssembler, TuringAs, KeplerAs, Decuda, and the paper Dissecting the NVidia Turing T4 GPU.

Since the instructions and architecture changes from generation to generation, it is an uphill battle.
What if a description of the instruction encoding could be found within the tools provided by NVIDIA?
What if the instruction latencies could be found inside these as well?

The answer is of course they can. Otherwise the compiler would do a poor job scheduling instructions. Furthermore, for SASS it turns out that fixed-latency instructions have the number of stall cycles hard-coded into them [src]. It is just a question of finding where this data is hidden.

It turns out that an extensive description of SASS instructions as well as latencies was contained in two specific strings in nvdisasm. Instead of having to write micro-benchmarks to find latencies, or use reverse engineering to make an assembler, one could in theory just consult these files. Instruction scheduling info is given in the latencies file, with the minimum time for fixed-latency ops. essentially being the latency. See NOTES.

For some additional, unrelated observations, see OTHER.

How to run

The easy way is by simply running this notebook in Google Colab. No requirements.

Requirements to run locally: Linux, Python 3, CUDA Toolkit. Run make to generate the raw files describing instructions and latencies. Be sure to change the paths in the beginning of the Makefile if they are different on your system. Tested with CUDA 11.6.

How it works

  1. nvcc is used to compile example.cu to .cubin binaries for a list of architectures.
  2. cc is used to compile intercept.c to a .so library that serves as a man-in-the-middle for data from memcpy calls.
  3. We intercept nvdisasm applied on each binary file using intercept.so.
  4. The result is filtered with strings to only get text, and then the script funnel.py gathers the relevant portions and writes them to files.

An initial approach was to simply run strings nvdisasm to get text embedded in the executable, but it turned out the relevant strings were dynamically generated (and only for the input architecture), which is why this solution is needed.

TODO

  • It appears the instruction string may be slightly corrupted for compute capability 3.5 currently.

About

Unofficial description of the CUDA assembly (SASS) instruction sets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 66.7%
  • C 18.2%
  • Makefile 14.2%
  • Cuda 0.9%