Do you have anything to add to this problem resolving around a scheduler calling multiple dockerized clients that handle AI requests?

ML Parallelity Problem (Docker)

- some models need all the hardware
- we want the same hardware to work on multiple jobs
- -> models need to be switched

Possible Solutions

- Put all models in same container
- Scheduler sends load and unload signal to container
It seems you are trying to run multiple AI models in parallel on the same hardware resources, and you're considering using Docker to manage this. I have a few suggestions to improve the problem resolution process and offer some alternative solutions:

Model Serving Frameworks: Consider using model serving frameworks like TensorFlow Serving, NVIDIA Triton Inference Server, or MLflow. These frameworks can handle the loading and unloading of models, optimize the hardware usage, and support multiple models concurrently.

Load Balancer & Auto-scaling: Instead of managing the model switching manually, you can use a load balancer like HAProxy or Nginx to distribute the incoming requests across multiple containers running your AI models. Additionally, you can configure auto-scaling to handle peaks in demand, allowing you to create and destroy containers based on the current workload.

Kubernetes & Docker Swarm: Leverage container orchestration tools like Kubernetes or Docker Swarm to manage your containers. These tools can help you scale your containers, manage resources, and handle failures more efficiently.

Resource Constraints: Define resource constraints (CPU, GPU, and memory) for your Docker containers, which will help you allocate specific resources to each container and prevent them from monopolizing the entire hardware.

Model Pipelining: If possible, split your AI models into smaller components, and create a pipeline to process requests. This way, you can parallelize the work across different containers, improving throughput and reducing the need to switch between models frequently.

Job Queuing: Implement a job queuing system like RabbitMQ, Apache Kafka, or Celery to manage incoming requests. This way, you can buffer requests when the system is busy and process them in parallel or sequentially, depending on your hardware capabilities.

To summarize, using a combination of model serving frameworks, container orchestration tools, load balancers, and job queuing systems can help you efficiently manage multiple AI models on the same hardware. It will enable you to optimize resource usage, handle failures, and scale your system more effectively.
Can you provide more details on how to implement resource constraints for Docker containers to allocate specific resources to each model?