In [1]:
!pwd
!ls
!tree

/home/lithira/Project1/19_06_24_conda/triton_repo
build_env.sh  requirements.txt	triton_run.ipynb  violation
[01;34m.[0m
├── [01;32mbuild_env.sh[0m
├── requirements.txt
├── triton_run.ipynb
└── [01;34mviolation[0m
    ├── [01;34m1[0m
    │   ├── [01;34m__pycache__[0m
    │   │   └── model.cpython-310.pyc
    │   ├── model.py
    │   ├── [01;35mshow.mp4[0m
    │   ├── xgb_model.pkl
    │   ├── yolov8x-seg.engine
    │   └── yolov8x-seg.pt
    └── config.pbtxt

3 directories, 10 files


In [7]:
!docker run -d --shm-size=5G --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $PWD:/mnt/data/model_repository \
custom_triton \
tritonserver \
--model-repository=/mnt/data/model_repository \
--log-verbose=1

8f0970455524cc463593908a3734be4cec5537c2b44b378bad516cf778312fdd


In [4]:
!docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


## Docker Command Explanation

This `docker run` command starts a container with the NVIDIA Triton Inference Server, configured with various options:

- `-d`: Runs the container in detached mode, meaning it runs in the background and does not block your terminal.

- `--shm-size=5G`: Sets the size of `/dev/shm` (shared memory) to 5 gigabytes. Shared memory is used for inter-process communication and can be necessary for certain applications, like machine learning inference.

- `--gpus all`: Grants the container access to all GPUs on the host machine. This is important for running machine learning models that utilize GPU acceleration.

- `-p8000:8000 -p8001:8001 -p8002:8002`: Maps ports from the host to the container. This allows you to access services running on these container ports from the corresponding ports on your host machine.
   - `8000`: Typically used for HTTP requests to the Triton server.
   - `8001`: Typically used for GRPC requests.
   - `8002`: Typically used for the server's metrics endpoint.

- `-v $PWD/triton_repo:/mnt/data/model_repository`: Mounts the `triton_repo` directory from your current working directory (`$PWD`) on the host to `/mnt/data/model_repository` inside the container. This is where Triton expects to find the model repository.

- `nvcr.io/nvidia/tritonserver:24.05-py3`: Specifies the Docker image to use, which is version `24.05-py3` of NVIDIA's Triton Inference Server from NVIDIA's container registry (`nvcr.io`).

- `tritonserver`: The command run inside the container to start the Triton Inference Server.

- `--model-repository=/mnt/data/model_repository`: Tells Triton where to find the model repository inside the container.

- `--log-verbose=1`: Sets the verbosity level of Triton's logging. Level 1 provides a moderate amount of logging detail.

This command sets up and runs a Triton Inference Server in a Docker container with access to shared memory, GPUs, and specific ports, using your local model repository.


In [14]:
!docker run -d --shm-size=5G --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $PWD:/mnt/data/model_repository \
  --network bridge \
  custom_triton \
  tritonserver \
  --model-repository=/mnt/data/model_repository \
  --log-verbose=1

bf89196247b6d15ae878baa4a71df80543fc304b501ebe91de2a7eb9e51439f9


In [15]:
!docker ps

CONTAINER ID   IMAGE           COMMAND                  CREATED         STATUS         PORTS                              NAMES
bf89196247b6   custom_triton   "/opt/nvidia/nvidia_…"   6 seconds ago   Up 2 seconds   0.0.0.0:8000-8002->8000-8002/tcp   trusting_stonebraker


In [18]:
import requests
import json

url = "http://localhost:8000/v2/models/violation/versions/1/infer"

payload = json.dumps({
  "inputs": [
    {
      "name": "INPUT0",
      "shape": [1, 1],  # Corrected shape
      "datatype": "BYTES",
      "data": ["some_input_data"]
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)

print(json.loads(response.text))


{'model_name': 'violation', 'model_version': '1', 'outputs': [{'name': 'OUTPUT0', 'datatype': 'INT32', 'shape': [1], 'data': [1]}]}
