Skip to content

ONNX Serving is a project written with C++ to serve onnx-mlir compiled models with GRPC and other protocols.Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high throughput. ONNX Servring provides dynamic batch aggregation and workers pool to fully utilize AI accelerators on the machine.

IBM/onnx-mlir-serving

Repository files navigation

ONNX-MLIR Serving

This project implements a GRPC server written with C++ to serve onnx-mlir compiled models. Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high throughput.

ONNX Servring provides dynamic batch aggregation and workers pool feature to fully utilize AI accelerators on the machine.

ONNX-MLIR is compiler technology to transform a valid Open Neural Network Exchange (ONNX) graph into code that implements the graph with minimum runtime support. It implements the ONNX standard and is based on the underlying LLVM/MLIR compiler technology.

Build

There are two ways to build this project.

Build ONNX-MLIR Serving on local environment

Prerequisite

1. GPRC Installed

Build GRPC from Source

GPRC Installation DIR example: grpc/cmake/install

2. ONNX MLIR Build is built

Copy include files from onnx-mlir source to onnx-mlir build dir.

ls onnx-mlir-serving/onnx-mlir-build/*
onnx-mlir-sering/onnx-mlir-build/include:
benchmark  CMakeLists.txt  google  onnx  onnx-mlir  OnnxMlirCompiler.h  OnnxMlirRuntime.h  rapidcheck  rapidcheck.h

onnx-mlir-serving/onnx-mlir-build/lib:
libcruntime.a

Build ONNX-MLIR Serving

cmake -DCMAKE_BUILD_TYPE=Release -DGRPC_DIR:STRING={GPRC_SRC_DIR} -DONNX_COMPILER_DIR:STRING={ONNX_MLIR_BUILD_DIR} -DCMAKE_PREFIX_PATH={GPRC_INSTALL_DIR} ../..
make -j

Build ONNX-MLIR Serving on Docker environment

Build AI GPRC Server and Client

docker build -t onnx/aigrpc-server .

Run ONNX-MLIR Server and Client

Server:

./grpc_server -h
usage: grpc_server [options]
    -w arg     wait time for batch size, default is 0
    -b arg     server side batch size, default is 1
    -n arg     thread numberm default is 1

./grpc_server

Add more models

Build Models Directory

/cmake/build
mkdir models

example models directory

models
└── mnist
    ├── config
    ├── model.so
    └── model.onnx

config

discripte model configs, can be generated usng utils/OnnxReader <model.onnx> examle of mnist config

input {
  name: "Input3"
  type {
    tensor_type {
      elem_type: 1
      shape {
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 28
        }
        dim {
          dim_value: 28
        }
      }
    }
  }
}
output {
  name: "Plus214_Output_0"
  type {
    tensor_type {
      elem_type: 1
      shape {
        dim {
          dim_value: 1
        }
        dim {
          dim_value: 10
        }
      }
    }
  }
}
max_batch_size: 1

Inference request

see utils/inference.proto and utils/onnx.proto

Use Batching

There are two place to input batch size

  1. In model config file 'max_batch_size'
  2. When start grpc_server -b [batch size]

situation_1: grpc_server without -b, defaule batch size is 1, means no batching situation_2: grpc_server -b <batch_size>, batch_size > 1, and model A config max_batch_size > 1, when query model A, will use the mininum batch size. situation_3: grpc_server -b <batch_size>, batch_size > 1, and model B config max_batch_size = 1 (generated by default), when query model B, will not using batching.

example client:

example/cpp or example/python

Example

See grpc-test.cc

  • TEST_F is a simpliest example to serve minst model.

About

ONNX Serving is a project written with C++ to serve onnx-mlir compiled models with GRPC and other protocols.Benefiting from C++ implementation, ONNX Serving has very low latency overhead and high throughput. ONNX Servring provides dynamic batch aggregation and workers pool to fully utilize AI accelerators on the machine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published