The code for implementing the LLMGuard: Safeguarding Real-Time Inference for Large Language Models on Edge Devices.
A device equipped with Intel SGX is required as the hardware to run the code. It is recommended to test on Linux since we have not tested Gramine in Windows.
The following steps are necessary to build a Gramine environment.
-
Linux-SGX Driver. SGX-Driver is required to be installed, which is the fundemental environment. Please refer to Linux-SGX Respository to build from source-code. For some versions of CPUs and systems, SGX may already be integrated in the system driver.
-
Gramine-SGX. Gramine-SGX is a libOS which supports runing application in SGX without modification. Please follow the Gramine Respository to install the Gramine.
-
Test. You can test your Gramine according to this simple Demo.
Please following Official Documentation to install OP-TEE system, and test it with examples.
It is suggested to install OP-TEE>=1.0 to achieve acceleration with ARM NEON. For NVIDIA Jetson devices, this is already installed in L4T systems.
Our code is tested in python 3.9, and is theoretically suitable for python >= 3.8. We provide a simple instruction to configure the essential python libraries.
For the training experiments, a basic Torch environment with GPU support is necessary, along with the Huggingface and PEFT libraries. No specific versions of these libraries are needed.
For the experiments focused on inference latency, you will need both NumPy and CuPy.
Our code works well under CUDA 12.1, and is supposed to work under CUDA>=12.0. cuDNN is also suggested to support inference acceleration, although you can run without it.
All the dataset, except for E2E dataset, are available in Huggingface, and our code will automatically download them.
All the code are under acc directory, formulated as {dataset}/{model}.
We have provided a simple shell to run training experiments. For example:
cd ./acc/alpaca/llama-7b
./command.shEach shell will train at least two adapters and composite them. The time consumed is usually long, in excess of 24 hours. You can adjust the number of adapters easily in code.
The hyper-parameters should be met with settings in here.
All the code are under latency directory, formulated as inference/inference_{platform}/{model}/{method}.
We have provided a simple shell to run training experiments. For example:
cd ./latency/inference_sgx/llama/ours
./run.shcd ./latency/inference_sgx/llama/tee_inference
./run.shWe provide a demo to run single layer inference of adapters. You need to change the Makefile to your own settings, and run the following commands:
cd ./latency/inference_trustzone/1layer_demo
make clean && make
sudo cp /ta/{uuid}.ta {path2your optee_armtz}
sudo ./host/inference_demoThe commands above is tested on-device. Notably, as our experimental code regarding latency in OPTEE is so big, they are suggested to be cross-compiled in the workstation using toolchains provided by Nvidia Jetson Resources.
Just run following command to compare it with naive GEMM:
cd ./latency/inference_trustzone/gemm_optimization
./run.shOnce you train the victim model, use Knockoff to conduct attacks. Notably, some libraries could be too recent, so you may need older versions.
The project is released under MIT license.