## Triton Inference Server PyTorch Example

- <b>References/Docs</b>: https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_triton.html
- <b>Environment/Setup</b>: SageMaker g4dn.xlarge classic notebook instance, conda_pytorch_p310 kernel. You can also run this model on a CPU instance if you desire, just using GPU for the entirety of this sample.

### Setup

We will be orchestrating inference with the HTTP Triton Client: https://github.com/triton-inference-server/client.

In [None]:
!pip install tritonclient[http]

### Dummy Local TorchScript Model

Credits: Utilized ChatGPT to give me a mock simple linear regression PyTorch model, just so we have a model artifact to work with. In this case it will be torchsript model (model.pt)

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
# Generate some random data for a linear regression problem
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 1 + 2 * X + np.random.randn(100, 1)

# Convert the NumPy arrays to PyTorch tensors
X_tensor = torch.from_numpy(X).float()
y_tensor = torch.from_numpy(y).float()

In [None]:
# Define a linear regression model
class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = nn.Linear(1, 1)  # One input feature, one output

    def forward(self, x):
        return self.linear(x)

# Instantiate the model and specify a loss function and optimizer
model = LinearRegression()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 1000
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

### Save Model + Local Inference

In [None]:
# save model as a torchscript model
torch.jit.save(torch.jit.script(model), 'model.pt')

In [None]:
# Load the saved model
loaded_model = torch.jit.load('model.pt')

In [None]:
# sample inference
test = torch.tensor([[2.5]])
pred = loaded_model(test)
pred

### Triton Setup

We first setup the artifacts we need in the structure the model server expects, this is the model repository structure it's expecting for this backend:

- linear_regression_model
    - 1
        - model.pt
        - model.py (optional, not included in this case)
    - config.pbtxt

#### Create Config File For PyTorch Backend

In [None]:
%%writefile config.pbtxt
name: "linear_regression_model"
platform: "pytorch_libtorch"

input {
  name: "input"
  data_type: TYPE_FP32
  dims: [ 1, 1 ]
}

output {
  name: "output"
  data_type: TYPE_FP32
  dims: [ 1, 1 ]
}

In [None]:
%%sh
mkdir linear_regression_model
mv config.pbtxt model.pt linear_regression_model
cd linear_regression_model
mkdir 1
mv model.pt 1/
cd ..

Second we want to run the following Docker command in a terminal to ensure we have Triton Inference Server up and running, we use the latest Triton Image available to execute it (updated to 25.03). Ensure to update the command to reflect the path for where you are executing this (run a pwd command where this NB is located).

```
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/ec2-user/SageMaker/triton-inference-server-examples/pytorch-backend:/models nvcr.io/nvidia/tritonserver:25.03-py3 tritonserver --model-repository=/models --exit-on-error=false --log-verbose=1
```

Once the server is started we can send requests.

### Triton Inference

There's two different ways we can run inference

1. Using Python requests library and passing in the Triton Server at port 8000 for HTTP requests
2. Utilizing Triton Client Library

#### Python Requests Library

In [None]:
import numpy as np
import requests
import json

# sample data
input_data = np.array([[2.5]], dtype=np.float32)

# Specify the model name and version
model_name = "linear_regression_model" #specified in config.pbtxt
model_version = "1"

# Set the inference URL based on the Triton server's address
url = f"http://localhost:8000/v2/models/{model_name}/versions/{model_version}/infer"

# payload with input params
payload = {
    "inputs": [
        {
            "name": "input",  # what you named input in config.pbtxt
            "datatype": "FP32",  
            "shape": input_data.shape,
            "data": input_data.tolist(),
        }
    ]
}

# sample invoke
response = requests.post(url, data=json.dumps(payload))
response.raise_for_status()

# output result
inference_result = response.json()
output_data = np.array(inference_result["outputs"][0]["data"])
output_data

#### Triton Client Library

In [None]:
import numpy as np
import tritonclient.http as httpclient

# setup triton inference client
client = httpclient.InferenceServerClient(url="localhost:8000")

In [None]:
# triton can infer the inputs from your config values
inputs = httpclient.InferInput("input", input_data.shape, datatype="FP32")
inputs.set_data_from_numpy(input_data) #we set a numpy array in this case
inputs

In [None]:
# output configuration
outputs = httpclient.InferRequestedOutput("output")
outputs

In [None]:
#sample inference
res = client.infer(model_name = "linear_regression_model", inputs=[inputs], outputs=[outputs],
                  )
inference_output = res.as_numpy('output') #serialize numpy output
inference_output

In [None]:
%%time

for i in range(100):
    res = client.infer(model_name = "linear_regression_model", inputs=[inputs], outputs=[outputs],
                  )