This repository includes all code and data needed to reproduce results of the work titled LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services.
The repository contains inference performance data collected using 10 Large Language Model (LLM) inference services running on a variety of GPUs (stored in preprocess_data/performance_characterization_data_raw).
The notebook provided in the preprocess_data directory performs initial preprocessing of raw performance measurements, and visualizes collected data for a selected LLM.
The code in directory predict_performance reproduces results presented in the manuscript related to performance prediction and GPU recommendation.
The notebook (1) performs further processing of the aggregate data files, (2) trains the performance prediction model of LLM-Pilot, as well as a variety of baselines used in the work, and (3) uses all methods to recommend the most cost-effective GPU for a previously unseen LLM with unknown inference performance, subject to performance constraints.
All code included in this project is shared under the Apache-2.0 license available in the LICENSE file.
However, the files containing the performance measurements are shared under the CDLA-Permissive-2.0 license.
The details of the CDLA-Permissive-2.0 license are available in the LICENSE-DATA.md file.
This applies to all files stored in preprocess_data/performance_characterization_data_raw.
Our experiments were conducted on a machine with one 14-core Intel Core i9-10940X CPU @ 3.30GHz, 125GB memory and two NVIDIA GeForce RTX 3070 GPUs. The machine uses the Ubuntu 22.04.4 (LTS) operating system, CUDA version 12.2, and docker version 24.0.5.
If you want to use a remote machine to run the experiments, ensure local forwarding of port 8889 when connecting to the machine via ssh:
ssh -L 8889:localhost:8889 USERNAME@REMOTE_IP_ADDRESS
Once you have access to the machine, clone and open this repository:
git clone https://github.com/IBM/LLM-performance-prediction.git
cd LLM-performance-prediction
To simplify environment setup, in the docker directory we provide files needed to create a docker image, and run a jupyter notebook inside a container using the created docker image.
To be able to build the image and run the container, it is necessary to have docker and nvidia-sontainer-toolkit installed.
If your machine does not have these installed, you can install them using the script docker/install_docker.sh:
chmod u+x docker/install_docker.sh
./docker/install_docker.sh
Then, you can build the docker image using the following command:
sudo docker build -f docker/Dockerfile . -t llm-pilot
This step may take a while (approx. 15 minutes). After the image has been built, you can start the jupyter notebook inside the container with the following command:
sudo docker run -it \
--gpus all \
-v ./preprocess_data:/app/preprocess_data \
-v ./predict_performance:/app/predict_performance \
-p 8889:8889 \
llm-pilot
At the end of this step, jupyter notebook will display the command needed to connect to it via a browser. Copy the provided address in your browser and you are ready to go.
The first step to reproduce the experimental results presented in the work is to run all cells in the notebook preprocess_data/Preprocess_data.ipynb.
The script reads all raw files including performance measurements and processes them into an aggregate dataset file.
Finally, the script produces example plots visualizing the performance of a selected LLM inference service across various GPUs, reproducing figures presented in the manuscript.
Directory preprocess_data/expected_results contains the copies of the expected plots, which can be compared to the output produced by the provided code.
After the data has been preprocessed, execute all cells in the notebook predict_performance/Predict_LLM_performance.ipynb.
First, the script augments and encodes the aggregate dataset generated by Preprocess_data.ipynb with features describing in detail the respective LLM and the GPU profile for every entry.
The GPU profile is defined as the number and type of GPUs on which the inference service was running (including a possibility of sharding the LLM across multiple GPUs of the same type in a tensor-parallel manner).
Then, the script uses the dataset to train our performance prediction model, as well as various baselines used in this work. We use a nested cross-validation scheme to tune the hyperparameters of our method and the baselines (whenever they weren't clearly stated in the original publications). In order to allow reproducing all results of the work in a reasonable time frame, we have omitted the hyperparameter tuning process of Morphling, as it was significantly longer than other methods, and would not fit in the time limit of the conference reproducibility review process.
Finally, the performance predictions made by all methods (both LLM-Pilot and the baselines) are used to recommend the most cost-effective GPU profile that meets the performance requirements for all LLMs included in the dataset. Finally, the script calculates the GPU recommendation evaluation metrics (introduced in the manuscript) achieved by all methods, and presents them visually.
Directory predict_performance/expected_results stores copies of the output files expected to be generated by running the notebook.