Skip to content

Predict the performance of LLM inference services

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-DATA.md
Notifications You must be signed in to change notification settings

IBM/LLM-performance-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-performance-prediction

This repository includes all code and data needed to reproduce results of the work titled LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services.

The repository contains inference performance data collected using 10 Large Language Model (LLM) inference services running on a variety of GPUs (stored in preprocess_data/performance_characterization_data_raw). The notebook provided in the preprocess_data directory performs initial preprocessing of raw performance measurements, and visualizes collected data for a selected LLM.

The code in directory predict_performance reproduces results presented in the manuscript related to performance prediction and GPU recommendation. The notebook (1) performs further processing of the aggregate data files, (2) trains the performance prediction model of LLM-Pilot, as well as a variety of baselines used in the work, and (3) uses all methods to recommend the most cost-effective GPU for a previously unseen LLM with unknown inference performance, subject to performance constraints.

License

All code included in this project is shared under the Apache-2.0 license available in the LICENSE file.

However, the files containing the performance measurements are shared under the CDLA-Permissive-2.0 license. The details of the CDLA-Permissive-2.0 license are available in the LICENSE-DATA.md file. This applies to all files stored in preprocess_data/performance_characterization_data_raw.

Results reproduction

Step 1: Setup

Our experiments were conducted on a machine with one 14-core Intel Core i9-10940X CPU @ 3.30GHz, 125GB memory and two NVIDIA GeForce RTX 3070 GPUs. The machine uses the Ubuntu 22.04.4 (LTS) operating system, CUDA version 12.2, and docker version 24.0.5.

If you want to use a remote machine to run the experiments, ensure local forwarding of port 8889 when connecting to the machine via ssh:

ssh -L 8889:localhost:8889 USERNAME@REMOTE_IP_ADDRESS

Once you have access to the machine, clone and open this repository:

git clone https://github.com/IBM/LLM-performance-prediction.git
cd LLM-performance-prediction

To simplify environment setup, in the docker directory we provide files needed to create a docker image, and run a jupyter notebook inside a container using the created docker image. To be able to build the image and run the container, it is necessary to have docker and nvidia-sontainer-toolkit installed. If your machine does not have these installed, you can install them using the script docker/install_docker.sh:

chmod u+x docker/install_docker.sh
./docker/install_docker.sh

Then, you can build the docker image using the following command:

sudo docker build -f docker/Dockerfile . -t llm-pilot

This step may take a while (approx. 15 minutes). After the image has been built, you can start the jupyter notebook inside the container with the following command:

sudo docker run -it \
  --gpus all \
  -v ./preprocess_data:/app/preprocess_data \
  -v ./predict_performance:/app/predict_performance \
  -p 8889:8889 \
  llm-pilot

At the end of this step, jupyter notebook will display the command needed to connect to it via a browser. Copy the provided address in your browser and you are ready to go.

Step 2: Data preprocessing

The first step to reproduce the experimental results presented in the work is to run all cells in the notebook preprocess_data/Preprocess_data.ipynb.

The script reads all raw files including performance measurements and processes them into an aggregate dataset file.

Finally, the script produces example plots visualizing the performance of a selected LLM inference service across various GPUs, reproducing figures presented in the manuscript. Directory preprocess_data/expected_results contains the copies of the expected plots, which can be compared to the output produced by the provided code.

Step 3: Performance prediction and GPU recommendation

After the data has been preprocessed, execute all cells in the notebook predict_performance/Predict_LLM_performance.ipynb.

First, the script augments and encodes the aggregate dataset generated by Preprocess_data.ipynb with features describing in detail the respective LLM and the GPU profile for every entry. The GPU profile is defined as the number and type of GPUs on which the inference service was running (including a possibility of sharding the LLM across multiple GPUs of the same type in a tensor-parallel manner).

Then, the script uses the dataset to train our performance prediction model, as well as various baselines used in this work. We use a nested cross-validation scheme to tune the hyperparameters of our method and the baselines (whenever they weren't clearly stated in the original publications). In order to allow reproducing all results of the work in a reasonable time frame, we have omitted the hyperparameter tuning process of Morphling, as it was significantly longer than other methods, and would not fit in the time limit of the conference reproducibility review process.

Finally, the performance predictions made by all methods (both LLM-Pilot and the baselines) are used to recommend the most cost-effective GPU profile that meets the performance requirements for all LLMs included in the dataset. Finally, the script calculates the GPU recommendation evaluation metrics (introduced in the manuscript) achieved by all methods, and presents them visually.

Directory predict_performance/expected_results stores copies of the output files expected to be generated by running the notebook.

About

Predict the performance of LLM inference services

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-DATA.md

Stars

Watchers

Forks

Packages

No packages published

Languages