Skip to content

AIGC platform provides powerful AI training and image generation services to meet various types of needs, with flexible resource management and scheduling capabilities. It can provide better services according to the requirements of physical resources, including AI model training, AI image generation, elastic management, load balance, etc.,

License

Notifications You must be signed in to change notification settings

LeonTing1010/stable-diffusion-webui

 
 

Repository files navigation

Overview of AIGC Platform

The AIGC Platform meets the demands for AI model training and image generation, supports flexible resource management and scheduling, and can provide better services according to the demands of physical resources. It has main functions such as AI model training, AI image generation, elastic management, load balancing, and can also be integrated with external systems through API interfaces to achieve different computing tasks.

Framework Structure of AIGC Platform

The platform is mainly composed of the following core modules:

  • The Scheduling Management System is used to flexibly manage hardware resources such as GPUs. It can automatically adjust and optimize resource allocation according to the current computing resources to meet the demands of different computing tasks;
  • AITP is mainly responsible for personalized model training and model service deployment. It can not only improve the performance of the model, but also make the model more stable;
  • AIGC is mainly responsible for generating personalized scene pictures. It can generate accurate scene pictures that fit customer requirements according to different customer requirements;
  • SS Support Service is responsible for adapting the interaction between external interfaces and the AI platform, including parsing MQ messages, docking distributed caches, and querying AIGC process information. It can effectively ensure the stable operation of the AI platform;
  • The Monitoring Platform is responsible for monitoring the operation of each module, collecting real-time logs, real-time status monitoring, to ensure the normal operation of the AI platform and timely discover abnormalities;
  • GW, as a gateway, mainly implements authentication and flow control of external access to ensure the safe operation of the AI platform.

Screen Shot 2023-02-22 at 11 27 05

All requests must pass through the Gateway (GW) for processing, whether it be a normal read request, such as a state data query during image generation, or a compensation in an exceptional situation, which must be synchronized with the Support Server (SS) via GW; while read requests for images with different styles generated by AIGC access the Data File Storage (DFS) directly, thus saving unnecessary waiting time. Moreover, write operations related to image training and AIGC generation require sending Message Queue (MQ) to the AI module for asynchronous processing in order to ensure service performance and quality of image generation. As for write requests such as cancel commands, they must go directly to the Support Server (SS) via GW to complete the corresponding cancel operations. In order to ensure that each request is processed as quickly as possible, GW also ranks all requests, with urgent requests first and ordinary requests later, for fair scheduling, which can be achieved through different priority queues.

AI Training Platform

The AI Training Platform, based on the DreamBooth algorithm, enables users to train personalized text-to-image conversion models. Inputs include three to five images, a theme category name, and a base model, while outputs include a unique identifier and a personalized text-to-image conversion model. Additionally, users can deploy their trained models to existing applications for real-time text-to-image transformation.

SM does not directly handle MQ messages, but instead leaves it to the SS service to parse. It will pass the relevant scheduling information to the SM interface to create and bind Pods to Nodes (GPUs), and start the computing containers. It will then transmit the compute task information to the Node for processing, and the result will be encapsulated as an MQ message by SS and sent to the MQ queue.

MQ messages consist of two parts: message header and message body.

  1. The message header format is as follows:
    • Header: Contains the request's metadata to help the SM schedule different GPU resources to implement AI computation tasks;
    • reqNo: UUID globally unique, tracks all AI task requests, and is also the unique identifier of the request;
    • priority: Priority, SM schedules resources based on different priorities, the higher the priority, the higher the priority;
    • taskType: Task type, currently mainly divided into two categories: TP training tasks and GC generation tasks;
    • group: Group identifier, one-to-one correspondence with Node, assigns the same group of computation tasks to the same Node.
  2. The message body format is as follows:
    • method: The method name of the AI algorithm invoked by the AI task;
    • parameters: The parameters of the method.

AIGC Platform

The AIGC Platform is based on the Stable Diffusion Algorithm and supports users' text-to-image conversion. Inputs include text prompts and image generation parameters, outputs are a unique identifier and generated image. It is a trained model that supports users to generate images of various styles in real time, to meet different needs.

The processing flow is similar to the training platform. The acquired model is a newly trained personalized model. The SS service parses the AIGC generation algorithm parameters, calls the SM interface to create an appropriate algorithm Pod, and binds it to the Node (GPU) to pass the algorithm parameters and generate the image.

MQ messages consist of two parts: message header and message body.

  1. The message header format is as follows:
    • Header: Contains the request's metadata to help the SM schedule different GPU resources to implement AI computation tasks;
    • reqNo: UUID globally unique, tracks all AI task requests, and is also the unique identifier of the request;
    • priority: Priority, SM schedules resources based on different priorities, the higher the priority, the higher the priority;
    • taskType: Task type, currently mainly divided into two categories: TP training tasks and GC generation tasks;
    • group: Group identifier, one-to-one correspondence with Node, assigns the same group of computation tasks to the same Node.
  2. The message body format is as follows:
    • method: The method name of the AI algorithm invoked by the AI task;
    • parameters: The parameters of the method.

Resource Scheduling System

The Resource Scheduling System consists of SM cluster management, multi-level scheduling framework, and container resource prediction, and beyond GPU resources, can also be extended to support physical resource scheduling (e.g., CPU, memory, disk, and network) based on a physical resource list. The physical resource list is a list of all physical resources used by the AIGC platform, including specifications and quantities of each resource. The resource list can be used to match resources required for various tasks and determine system capacity, as well as to forecast resources needed for different types of tasks in advance to ensure optimal usage of all resources.

The GPU Resource Scheduling System is implemented by an SM scheduling framework for elastic management of GPU resources, and monitored by DevicePlugin for GPU resource usage system.

It mainly implements the following functions:

  1. AI-related functions Mainly used to schedule AI algorithm Pods to process MQ transmitted computing tasks based on the free degree of GPU resources in the Node. There are some parameters in the MQ message, such as priority. When encountering such tasks, resources are scheduled first to meet the request. In addition, newly added and deleted GPUs can be automatically sensed.
  2. Service-related functions Load balancing.

The SM scheduling system transmits MQ messages to the SS service. Upon receiving the message, the SS service calls the SM interface and allocates idle GPU nodes according to the required resource information. It then creates Pods and binds them to the GPU nodes. After the calculation is complete, it notifies the SM scheduling system to delete the Pod and reclaim the resources, thus completing resource scheduling and management.

Summary

AIGC platform provides powerful AI training and image generation services to meet various types of needs, with flexible resource management and scheduling capabilities. It can provide better services according to the requirements of physical resources, including AI model training, AI image generation, elastic management, load balance, etc., and supports various algorithms such as DreamBooth for real-time text-to-image conversion, giving AI developers more functions and possibilities.

About

AIGC platform provides powerful AI training and image generation services to meet various types of needs, with flexible resource management and scheduling capabilities. It can provide better services according to the requirements of physical resources, including AI model training, AI image generation, elastic management, load balance, etc.,

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.6%
  • JavaScript 5.6%
  • HTML 1.7%
  • CSS 1.3%
  • Other 0.8%