## Run Inference on all deployed endpoints: Various combinations of payloads, concurrency levels, model configurations
---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of our solution design includes running inferences on all deployed model endpoints (with different configurations, concurrency levels and payload sizes). This notebook runs inferences in a manner that is calls endpoints concurrently and asychronously to generate responses and record metrics. Here are some of the key components:

- **Accessing the deployed endpoints**, creating a predictor object for these endpoints to call them during inference time.

- **Functions to define metrics**: This notebook sets stage for metrics to be recorded during the time of invocation of all these models for benchmarking purposes.

- **Running Actual Inferences**: Once the metrics are defined, we set a blocker function that is responsible for creating inference on a single payload called get_inference. We then run a series of asynchronous functions that can be viewed in the code (link above), to create asychronous inferefences on the deployed models. The way we send requests are by creating combinations: this means creating combinations of payloads of different sizes that can be viewed in the config.yml file, with different concurrency levels (in this case we first go through all patches of payloads with a concurrency level of 1, then 2, and then 4). You can set this to your desired value.

In [27]:
## auto reload all of the changes made in the config/globals.py file 
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Import all of the necessary libraries below to run this notebook

In [28]:
import glob
import time
import json
import copy
import boto3
import asyncio
import logging
import itertools
import sagemaker
import pandas as pd
from globals import *
from datetime import datetime
from transformers import AutoTokenizer
from sagemaker.predictor import Predictor
from utils import load_config, count_tokens
from sagemaker.serializers import JSONSerializer
from typing import Dict, List, Optional, Tuple, Union

#### Pygmentize globals.py to view and use any of the globally initialized variables 

In [29]:
# global constants
!pygmentize globals.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36myaml[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36menum[39;49;00m [34mimport[39;49;00m Enum[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILEPATH_FILE: [36mstr[39;49;00m = [33m"[39;49;00m[33mconfig_filepath.txt[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILE: [36mstr[39;49;00m = Path(CONFIG_FILEPATH_FILE).read_text()[37m[39;49;00m
[36mprint[39;49;00m([33mf[39;49;00m[33m"[39;49;00m[33mCONFIG_FILE=[39;49;00m[33m{[39;49;00mCONFIG_FILE[33m}[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[34mwith[39;49;00m [36mopen[39;49;00m(CONFIG_FILE, [33m'[39;49;00m[33mr[39;49;00m[33m'[39;49;00m) [34mas[39;49;00m file:[37m[39;49;00m
    config = yaml.safe_load(file)[37m[39;49;00m
[37m[39;49;00m
DATA_DIR: [36mstr[39;49;00m = [33m"[39;49;00m[33

In [30]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

#### Load the Config.yml file that contains information that is used across this benchmarking environment

In [31]:
config = load_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

[2024-01-23 15:28:07,370] p1345 {635462509.py:2} INFO - {
  "general": {
    "name": "llama2-70b-g5-p4d-trt-v1",
    "model_name": "Llama2-70b"
  },
  "aws": {
    "region": "us-east-1"
  },
  "prompt": {
    "template_file": "prompt_template.txt",
    "all_prompts_file": "all_prompts.csv"
  },
  "datasets": [
    {
      "language": "en",
      "min_length_in_tokens": 1,
      "max_length_in_tokens": 500,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
      "language": "en",
      "min_length_in_tokens": 500,
      "max_length_in_tokens": 1000,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
      "language": "en",
      "min_length_in_tokens": 1000,
      "max_length_in_tokens": 2000,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
      "language": "en",
      "min_length_in_tokens": 2000,
      "max_length_in_tokens": 3000,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
      "language": "en",

In [33]:
date_time = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

### Access the deployed model endpoints from the endpoints.json file 

In [34]:
# read the list of deployed endpoints
endpoint_info_list = json.loads(Path(ENDPOINT_LIST_FPATH).read_text())
logger.info(f"found information for {len(endpoint_info_list)} endpoints")
logger.info(json.dumps(endpoint_info_list, indent=2))

[2024-01-23 15:29:02,539] p1345 {3756670252.py:3} INFO - found information for 2 endpoints
[2024-01-23 15:29:02,540] p1345 {3756670252.py:4} INFO - [
  {
    "experiment_name": "llama2-70b-g5.48xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0",
    "endpoint": {
      "EndpointName": "llama-2-70b-g5-48xlarge-1706022365",
      "EndpointArn": "arn:aws:sagemaker:us-east-1:015469603702:endpoint/llama-2-70b-g5-48xlarge-1706022365",
      "EndpointConfigName": "llama-2-70b-g5-48xlarge-1706022365",
      "ProductionVariants": [
        {
          "VariantName": "AllTraffic",
          "DeployedImages": [
            {
              "SpecifiedImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04",
              "ResolvedImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference@sha256:2739b630b95d8a95e6b4665e66d8243dd43b99c4fdb865feff13aab9c1da06eb",
              "ResolutionTime":

In [35]:
# List down the endpoint names that have been deployed
endpoint_name_list = [e['endpoint']['EndpointName'] for e in endpoint_info_list]
logger.info(f"there are {len(endpoint_name_list)} deployed endpoint(s), endpoint_name_list->{endpoint_name_list}")

[2024-01-23 15:29:03,937] p1345 {1455142584.py:3} INFO - there are 2 deployed endpoint(s), endpoint_name_list->['llama-2-70b-g5-48xlarge-1706022365', 'llama2-70bdjl-2024-01-23-15-06-05-901-endpoint']


### Creating predictor objects from the deployed endpoints

In [36]:
# create predictor objects

## create a sagemaker predictor for these endpoints
def create_predictor(endpoint_name: str) -> Optional[sagemaker.base_predictor.Predictor]:
    # Create a SageMaker Predictor object
    predictor = Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker.Session(),
        serializer=JSONSerializer()
    )
    return predictor

## Display the list of predictor objects that have been deployed ready for inferencing from
predictor_list: List = [create_predictor(ep) for ep in endpoint_name_list]
logger.info(predictor_list)

[2024-01-23 15:29:04,534] p1345 {3970465137.py:15} INFO - [<sagemaker.base_predictor.Predictor object at 0x7ff6d94a7310>, <sagemaker.base_predictor.Predictor object at 0x7ff6ea50d090>]


### Creating functions to define and calculate metrics during the time of invocations

In [37]:
def safe_sum(l: List) -> Union[int, float]:
    return sum(filter(None, l))

def safe_div(n: Union[int, float], d: Union[int, float]) -> Optional[Union[int, float]]:
    return n/d if d else None

## Represents the function to calculate all of the metrics at the time of inference
def calculate_metrics(responses, chunk, elapsed_async, experiment_name, concurrency, payload_file) -> Dict:
    
    ## calculate errors based on the completion status of the inference prompt
    errors = [r for r in responses if r['completion'] is None]
    
    ## Calculate the difference as the successes 
    successes = len(chunk) - len(errors)
    
    ## Count all of the prompts token count during inference
    all_prompts_token_count = safe_sum([r['prompt_tokens'] for r in responses])
    prompt_token_throughput = round(all_prompts_token_count / elapsed_async, 2)
    prompt_token_count_mean = safe_div(all_prompts_token_count, successes)
    all_completions_token_count = safe_sum([r['completion_tokens'] for r in responses])
    completion_token_throughput = round(all_completions_token_count / elapsed_async, 2)
    completion_token_count_mean = safe_div(all_completions_token_count, successes)
    transactions_per_second = round(successes / elapsed_async, 2)
    transactions_per_minute = int(transactions_per_second * 60)
    
    ## calculate the latency mean utilizing the safe_sum function defined above
    latency_mean = safe_div(safe_sum([r['latency'] for r in responses]), successes)
    
    ## Function returns all these values at the time of the invocations
    return {
        'experiment_name': experiment_name,
        'concurrency': concurrency,
        'payload_file': payload_file,
        'errors': errors,
        'successes': successes,
        'error_rate': len(errors)/len(chunk),
        'all_prompts_token_count': all_prompts_token_count,
        'prompt_token_count_mean': prompt_token_count_mean,
        'prompt_token_throughput': prompt_token_throughput,
        'all_completions_token_count': all_completions_token_count,
        'completion_token_count_mean': completion_token_count_mean,
        'completion_token_throughput': completion_token_throughput,
        'transactions': len(chunk),
        'transactions_per_second': transactions_per_second,
        'transactions_per_minute': transactions_per_minute,
        'latency_mean': latency_mean
    }

### Set a blocker function and a series of asynchronous concurrent model prompt invocations

In [38]:
def set_metrics(endpoint_name=None,
                    prompt=None,
                    inference_params=None,
                    completion=None,
                    prompt_tokens=None,
                    completion_tokens=None,
                    latency=None) -> Dict:
    return dict(endpoint_name=endpoint_name,                
                prompt=prompt,
                **inference_params,
                completion=completion,
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                latency=latency)

def get_inference(predictor, payload) -> Dict:
    
    smr_client = boto3.client("sagemaker-runtime")
    latency = 0

    try:
        prompt_tokens = count_tokens(payload['inputs'])
        logger.info(f"get_inference, endpoint={predictor.endpoint_name}, prompt_tokens={prompt_tokens}")

        # get inference
        st = time.perf_counter()        
        response = predictor.predict(payload)        
        latency = time.perf_counter() - st

        if isinstance(response, bytes):
            response = response.decode('utf-8')
        response_json = json.loads(response)
        if isinstance(response_json, list):
            response_json = response_json[0]

        completion = response_json.get("generated_text", "")
        completion_tokens = count_tokens(completion)

        # Set metrics and logging for both cases
        response = set_metrics(predictor.endpoint_name,
                               payload['inputs'],
                               payload['parameters'],
                               completion,
                               prompt_tokens,
                               completion_tokens,
                               latency)
        # logger.info(f"get_inference, done, endpoint={predictor.endpoint_name}, response={json.dumps(response, indent=2)}, latency={latency:.2f}")
        logger.info(f"get_inference, done, endpoint={predictor.endpoint_name}, completion_tokens={completion_tokens}, latency={latency:.2f}")
    except Exception as e:
        print(f"error occurred with {predictor.endpoint_name}, exception={str(e)}")
        response = set_metrics(predictor.endpoint_name,
                               payload['inputs'],
                               payload['parameters'],
                               None,
                               prompt_tokens,
                               None,
                               None)

    return response

### Setting a series of asynchronous functions to invoke and run inferences concurrently and asynchronously

In [39]:
## Represents a function to start invoking models in separate thread asynchronously for the blocker function
async def async_get_inference(predictor, payload: Dict) -> Dict:
    return await asyncio.to_thread(get_inference, predictor, payload)

## Gathers all of the tasks and sets of the concurrent calling of the asychronous invocations
async def async_get_all_inferences(predictor, payload_list: List) -> List:
    return await asyncio.gather(*[async_get_inference(predictor, payload) for payload in payload_list])

In [40]:
## This function runs the asychronous function series above together for different experiments and concurrency levels.
async def run_inferences(predictor: sagemaker.base_predictor.Predictor, chunk: List, experiment: Dict, concurrency: int, payload_file: str) -> Tuple[List, Dict]:
    logger.info(f"Processing chunk with concurrency={concurrency}")
    s = time.perf_counter()
    responses = await async_get_all_inferences(predictor, chunk)
    elapsed_async = time.perf_counter() - s

    # Add more metadata about this experiment
    for r in responses:
        r['experiment_name'] = experiment['name']
        r['concurrency'] = concurrency

    metrics = calculate_metrics(responses, chunk, elapsed_async, experiment['name'], concurrency, payload_file)
    return responses, metrics

In [41]:
## Function to create the predictors from the experiment we are iterating over
def create_predictor_for_experiment(experiment: str, config: Dict, endpoint_info_list: List) -> Optional[sagemaker.base_predictor.Predictor]:

    ## Here, we set the index and then iterate through the experiments
    e_idx = config['experiments'].index(experiment) + 1

    ## Iterate through the endpoint information to fetch the endpoint name
    ep_info = [e for e in endpoint_info_list if e['experiment_name'] == experiment['name']]
    if not ep_info:
        logger.error(f"endpoint for experiment={experiment['name']} not found, skipping")
        return None
    ep_name = ep_info[0]['endpoint']['EndpointName']
    logger.info(f"experiment={e_idx}, name={experiment['name']}, ep_name={ep_name}")

    # create a predictor from each endpoint in experiments
    return create_predictor(ep_name)

In [42]:
## Here, we will process combinations of concurrency levels, the payload files and then loop through the 
## different combinations to make payloads splitted in terms of the concurrency metric and how we can run 
## it and make inference

def create_payload_dict(jline: str, experiment: Dict) -> Dict:
    payload: Dict = json.loads(jline)
    if experiment.get('remove_truncate', False) is True:
        if payload['parameters'].get('truncate'):
            del payload['parameters']['truncate']
    return payload
    
    
def create_combinations(experiment: Dict) -> List[Tuple]:
    combinations_data = []

    # Repeat for each concurrency level
    combinations = list(itertools.product(experiment['concurrency_levels'], experiment['payload_files']))
    logger.info(f"there are {len(combinations)} combinations of {combinations} to run")

    for concurrency, payload_file in combinations:
        # Read the payload file
        fpath = os.path.join(PROMPTS_DIR, payload_file)
        payload_list = [create_payload_dict(jline, experiment) for jline in Path(fpath).read_text().splitlines()]
        logger.info(f"read {fpath}, contains {len(payload_list)} lines")      

        logger.info(f"creating combinations for concurrency={concurrency}, payload_file={payload_file}, payload_list length={len(payload_list)}")
        
        # check if we have enough element in the list to run at least the concurrency count numberof transactions..
        # for example if there are only 2 prompts and we want to run say 6 in parallel then take the first element (prompt)
        # and replicate it 6-2=4 times and add it to the original list
        n = concurrency
        
        if len(payload_list) < n:
            elements_to_add = n - len(payload_list)
            element_to_replicate = payload_list[0]
            # payload_list = payload_list.extend([element_to_replicate]*elements_to_add)
            payload_list.extend([element_to_replicate]*elements_to_add)
            
        # Split the original list into sublists which contain the number of requests we want to send concurrently        
        payload_list_splitted = [payload_list[i * n:(i + 1) * n] for i in range((len(payload_list) + n - 1) // n )]  
        
        for p in payload_list_splitted:
            if len(p) < n:
                elements_to_add = n - len(p)
                element_to_replicate = p[0]
                # p = p.extend([element_to_replicate]*elements_to_add)
                p.extend([element_to_replicate]*elements_to_add)
            

        # Only keep lists that have at least concurrency number of elements
        len_before = len(payload_list_splitted)
        payload_list_splitted = [p for p in payload_list_splitted if len(p) == concurrency]
        logger.info(f"after only retaining chunks of length {concurrency}, we have {len(payload_list_splitted)} chunks, previously we had {len_before} chunks")
        combinations_data.append((concurrency, payload_file, payload_list_splitted))
    logger.info(f"there are {len(combinations)} for {experiment}")
    return combinations_data

# process_combinations(experiment, predictor, PROMPTS_DIR)

In [43]:
# for each experiment
#   - for each endpoint and concurrency in an experiment

def clear_dir(dir_path: str):
    files = glob.glob(os.path.join(dir_path, "*"))
    for f in files:
        os.remove(f)

_ = list(map(clear_dir, [METRICS_PER_INFERENCE_DIR, METRICS_PER_CHUNK_DIR]))

num_experiments: int = len(config['experiments'])
for e_idx, experiment in enumerate(config['experiments']):
    e_idx += 1  # Increment experiment index
    # Call do_experiment function to create the predictor object
 
    predictor = create_predictor_for_experiment(experiment, config, endpoint_info_list)
    if predictor is None:
        logger.error(f"predictor could not be created for experiment={experiment}, moving to next...")
        continue

    # Process combinations of concurrency levels and payload files
    combination_data = create_combinations(experiment)

    for concurrency, payload_file, split_payload in combination_data:
        for chunk_index, chunk in enumerate(split_payload):
            logger.info(f"e_idx={e_idx}/{num_experiments}, chunk_index={chunk_index+1}/{len(split_payload)}")

            # Process each chunk and calculate metrics
            responses, metrics = await run_inferences(predictor, chunk, experiment, concurrency, payload_file)
            if metrics:
                #per_concurrency_level_response_metrics.append(metrics)
                fpath: str = os.path.join(METRICS_PER_CHUNK_DIR, f"{time.time()}.json")
                Path(fpath).write_text(json.dumps(metrics, indent=2))
            if responses:
                for r in responses:
                    fpath: str = os.path.join(METRICS_PER_INFERENCE_DIR, f"{time.time()}.json")
                    Path(fpath).write_text(json.dumps(r, indent=2))
            
            logger.info(f"completed processing chunk {chunk_index+1}/{len(split_payload)} with concurrency={concurrency}")

    logger.info(f"experiment={e_idx}/{num_experiments}, name={experiment['name']}, done")

[2024-01-23 15:29:08,900] p1345 {663335596.py:13} INFO - experiment=1, name=llama2-70b-g5.48xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0, ep_name=llama-2-70b-g5-48xlarge-1706022365
[2024-01-23 15:29:08,924] p1345 {1352403322.py:18} INFO - there are 25 combinations of [(1, 'payload_en_1-500.jsonl'), (1, 'payload_en_500-1000.jsonl'), (1, 'payload_en_1000-2000.jsonl'), (1, 'payload_en_2000-3000.jsonl'), (1, 'payload_en_3000-4000.jsonl'), (2, 'payload_en_1-500.jsonl'), (2, 'payload_en_500-1000.jsonl'), (2, 'payload_en_1000-2000.jsonl'), (2, 'payload_en_2000-3000.jsonl'), (2, 'payload_en_3000-4000.jsonl'), (4, 'payload_en_1-500.jsonl'), (4, 'payload_en_500-1000.jsonl'), (4, 'payload_en_1000-2000.jsonl'), (4, 'payload_en_2000-3000.jsonl'), (4, 'payload_en_3000-4000.jsonl'), (6, 'payload_en_1-500.jsonl'), (6, 'payload_en_500-1000.jsonl'), (6, 'payload_en_1000-2000.jsonl'), (6, 'payload_en_2000-3000.jsonl'), (6, 'payload_en_3000-4000.jsonl'), (8, 'payload_en_1-500.jsonl'), (8, 'payl

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:45:56,498] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=4, latency=28.14
[2024-01-23 16:46:28,567] p1345 {3496713318.py:39} INFO - completed processing chunk 2/10 with concurrency=6
[2024-01-23 16:46:28,567] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=3/10
[2024-01-23 16:46:28,568] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6
[2024-01-23 16:46:28,583] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3508
[2024-01-23 16:46:28,605] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3680
[2024-01-23 16:46:28,603] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3087
[2024-01-23 16:46:28,607] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3249
[2024-01-2

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:47:01,544] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=27, latency=32.94
[2024-01-23 16:47:28,837] p1345 {3496713318.py:39} INFO - completed processing chunk 3/10 with concurrency=6
[2024-01-23 16:47:28,837] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=4/10
[2024-01-23 16:47:28,838] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6


error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:47:28,906] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3463
[2024-01-23 16:47:28,907] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3050
[2024-01-23 16:47:28,911] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3410
[2024-01-23 16:47:28,912] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3979
[2024-01-23 16:47:28,928] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3737
[2024-01-23 16:47:28,935] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3996
[2024-01-23 16:48:29,160] p1345 {3496713318.py:39} INFO - completed processing chunk 4/10 with concurrency=6
[2024-01-23 16:48:29,160] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:48:29,186] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3662
[2024-01-23 16:48:29,187] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3655
[2024-01-23 16:48:29,200] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3171
[2024-01-23 16:48:29,201] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3098
[2024-01-23 16:48:29,202] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3400
[2024-01-23 16:48:29,203] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3704
[2024-01-23 16:49:29,366] p1345 {3496713318.py:39} INFO - completed processing chunk 5/10 with concurrency=6
[2024-01-23 16:49:29,367] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:50:01,602] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=11, latency=32.18
[2024-01-23 16:50:29,647] p1345 {3496713318.py:39} INFO - completed processing chunk 6/10 with concurrency=6
[2024-01-23 16:50:29,647] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=7/10
[2024-01-23 16:50:29,648] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6
[2024-01-23 16:50:29,664] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3848


error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:50:29,674] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3208
[2024-01-23 16:50:29,678] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3475
[2024-01-23 16:50:29,694] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3681
[2024-01-23 16:50:29,698] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3302
[2024-01-23 16:50:29,702] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3581
[2024-01-23 16:51:29,871] p1345 {3496713318.py:39} INFO - completed processing chunk 7/10 with concurrency=6
[2024-01-23 16:51:29,872] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=8/10
[2024-01-23 16:51:29,873] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6
[2024-01-23 16:51:29,908] p1345 

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:51:56,137] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=5, latency=26.21
[2024-01-23 16:52:17,311] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=9, latency=47.39
[2024-01-23 16:52:28,502] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=8, latency=58.57


error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com

[2024-01-23 16:52:30,546] p1345 {3496713318.py:39} INFO - completed processing chunk 8/10 with concurrency=6
[2024-01-23 16:52:30,547] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=9/10
[2024-01-23 16:52:30,547] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6
[2024-01-23 16:52:30,582] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3882
[2024-01-23 16:52:30,587] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3567
[2024-01-23 16:52:30,592] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3566
[2024-01-23 16:52:30,592] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3330
[2024-01-23 16:52:30,594] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3497
[2024-01-23 16:52:30,595] p1345 

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.


[2024-01-23 16:53:00,865] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=4, latency=30.28
[2024-01-23 16:53:22,376] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=5, latency=51.78
[2024-01-23 16:53:30,751] p1345 {3496713318.py:39} INFO - completed processing chunk 9/10 with concurrency=6
[2024-01-23 16:53:30,752] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=10/10
[2024-01-23 16:53:30,752] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=6
[2024-01-23 16:53:30,776] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3003
[2024-01-23 16:53:30,780] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3003
[2024-01-23 16:53:30,793] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:54:06,141] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=7, latency=35.33
[2024-01-23 16:54:16,456] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=4, latency=45.68
[2024-01-23 16:54:31,032] p1345 {3496713318.py:39} INFO - completed processing chunk 10/10 with concurrency=6
[2024-01-23 16:54:31,032] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=1/1
[2024-01-23 16:54:31,033] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 16:54:31,038] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=304
[2024-01-23 16:54:31,047] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=304
[2024-01-23 16:54:31,050] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_to

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:54:42,808] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=9, latency=11.77
[2024-01-23 16:54:42,812] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=304
[2024-01-23 16:54:49,880] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=18.82
[2024-01-23 16:54:49,882] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=18.82
[2024-01-23 16:54:49,882] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=18.83
[2024-01-23 16:54:49,882] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=18.83
[2024-01-23 16:54:49,883] p1345 {701838357.py:48} INFO - get_infere

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:58:23,071] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=10, latency=25.48
[2024-01-23 16:58:29,689] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=43.51
[2024-01-23 16:58:29,778] p1345 {3496713318.py:39} INFO - completed processing chunk 1/4 with concurrency=8
[2024-01-23 16:58:29,779] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=2/4
[2024-01-23 16:58:29,780] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 16:58:29,808] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2541
[2024-01-23 16:58:29,812] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2186
[2024-01-23 16:58:29,812] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 16:59:43,114] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=40.96
[2024-01-23 16:59:44,162] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=15.06
[2024-01-23 16:59:44,251] p1345 {3496713318.py:39} INFO - completed processing chunk 2/4 with concurrency=8
[2024-01-23 16:59:44,251] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=3/4
[2024-01-23 16:59:44,252] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 16:59:44,269] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2458
[2024-01-23 16:59:44,278] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2428
[2024-01-23 16:59:44,284] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, promp

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:00:49,790] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=19, latency=44.11
[2024-01-23 17:00:49,884] p1345 {3496713318.py:39} INFO - completed processing chunk 3/4 with concurrency=8
[2024-01-23 17:00:49,885] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=4/4
[2024-01-23 17:00:49,886] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:00:49,902] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2213
[2024-01-23 17:00:49,912] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2770
[2024-01-23 17:00:49,913] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2062
[2024-01-23 17:00:49,914] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=2564
[2024-01-23

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:02:01,381] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=40.35
[2024-01-23 17:02:01,478] p1345 {3496713318.py:39} INFO - completed processing chunk 4/4 with concurrency=8
[2024-01-23 17:02:01,479] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=1/8
[2024-01-23 17:02:01,479] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:02:01,498] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3000
[2024-01-23 17:02:01,506] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3789
[2024-01-23 17:02:01,525] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3144
[2024-01-23 17:02:01,521] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3450
[2024-01-2

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:03:38,759] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=37.18
[2024-01-23 17:03:38,760] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=41.09
[2024-01-23 17:03:38,848] p1345 {3496713318.py:39} INFO - completed processing chunk 1/8 with concurrency=8
[2024-01-23 17:03:38,849] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=2/8
[2024-01-23 17:03:38,850] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:03:38,882] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3458
[2024-01-23 17:03:38,884] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3891
[2024-01-23 17:03:38,888] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, promp

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:05:14,615] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=35.63
[2024-01-23 17:05:14,617] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=101, latency=35.63
[2024-01-23 17:05:14,723] p1345 {3496713318.py:39} INFO - completed processing chunk 2/8 with concurrency=8
[2024-01-23 17:05:14,724] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=3/8
[2024-01-23 17:05:14,724] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:05:14,759] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3680
[2024-01-23 17:05:14,760] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3463
[2024-01-23 17:05:14,771] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, promp

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:06:57,034] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=99, latency=42.14
[2024-01-23 17:06:57,034] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=42.15
[2024-01-23 17:06:57,144] p1345 {3496713318.py:39} INFO - completed processing chunk 3/8 with concurrency=8
[2024-01-23 17:06:57,145] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=4/8
[2024-01-23 17:06:57,145] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:06:57,177] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3171
[2024-01-23 17:06:57,181] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3655
[2024-01-23 17:06:57,185] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:08:25,761] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=11, latency=42.42
[2024-01-23 17:08:33,206] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=35.92
[2024-01-23 17:08:33,301] p1345 {3496713318.py:39} INFO - completed processing chunk 4/8 with concurrency=8
[2024-01-23 17:08:33,302] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=5/8
[2024-01-23 17:08:33,302] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:08:33,326] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3132
[2024-01-23 17:08:33,332] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3135
[2024-01-23 17:08:33,343] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:10:09,600] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=36.11
[2024-01-23 17:10:09,600] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=36.22
[2024-01-23 17:10:09,712] p1345 {3496713318.py:39} INFO - completed processing chunk 5/8 with concurrency=8
[2024-01-23 17:10:09,713] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=6/8
[2024-01-23 17:10:09,714] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:10:09,736] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3581
[2024-01-23 17:10:09,737] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3302
[2024-01-23 17:10:09,750] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, promp

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:11:40,833] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=59.75
[2024-01-23 17:11:40,921] p1345 {3496713318.py:39} INFO - completed processing chunk 6/8 with concurrency=8
[2024-01-23 17:11:40,922] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=7/8
[2024-01-23 17:11:40,923] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:11:40,952] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3330
[2024-01-23 17:11:40,962] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3616
[2024-01-23 17:11:40,966] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3882
[2024-01-23 17:11:40,976] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3566
[2024-01-2

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:13:09,844] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=5, latency=54.73
[2024-01-23 17:13:10,010] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=7, latency=31.30
[2024-01-23 17:13:10,100] p1345 {3496713318.py:39} INFO - completed processing chunk 7/8 with concurrency=8
[2024-01-23 17:13:10,101] p1345 {3496713318.py:26} INFO - e_idx=1/2, chunk_index=8/8
[2024-01-23 17:13:10,101] p1345 {1750118496.py:3} INFO - Processing chunk with concurrency=8
[2024-01-23 17:13:10,132] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3271
[2024-01-23 17:13:10,140] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_tokens=3271
[2024-01-23 17:13:10,134] p1345 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706022365, prompt_to

error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/llama-2-70b-g5-48xlarge-1706022365 in account 015469603702 for more information.
error occurred with llama-2-70b-g5-48xlarge-1706022365, exception=An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.co

[2024-01-23 17:14:36,831] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=18, latency=31.46
[2024-01-23 17:14:43,284] p1345 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706022365, completion_tokens=102, latency=33.11
[2024-01-23 17:14:43,369] p1345 {3496713318.py:39} INFO - completed processing chunk 8/8 with concurrency=8
[2024-01-23 17:14:43,370] p1345 {3496713318.py:41} INFO - experiment=1/2, name=llama2-70b-g5.48xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0, done
[2024-01-23 17:14:43,371] p1345 {663335596.py:13} INFO - experiment=2, name=llama2-70b-chat-p4d.24xlarge-djl-inference-0.26.0-tensorrtllm0.7.1-cu122, ep_name=llama2-70bdjl-2024-01-23-15-06-05-901-endpoint
[2024-01-23 17:14:43,524] p1345 {1352403322.py:18} INFO - there are 25 combinations of [(1, 'payload_en_1-500.jsonl'), (1, 'payload_en_500-1000.jsonl'), (1, 'payload_en_1000-2000.jsonl'), (1, 'payload_en_2000-30

In [44]:
# read all per chunk files
json_list: List[Dict] = [json.loads(Path(f_name).read_text()) for f_name in glob.glob(os.path.join(METRICS_PER_INFERENCE_DIR, "*.json"))]
df_responses = pd.DataFrame(json_list)
logger.info(f"created dataframe of shape {df_responses.shape} from all responses")
df_responses.head()


[2024-01-23 17:28:14,262] p1345 {3838589563.py:4} INFO - created dataframe of shape (1172, 14) from all responses


Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,completion,prompt_tokens,completion_tokens,latency,experiment_name,concurrency,truncate
0,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n</s>,2675,4.0,1.737736,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,8,
1,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n[INST] <<SYS>>\nYou are an assistant for...,3132,102.0,3.169297,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,
2,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n```\n<answer>\nPassage 1:\n\n```\n\n```\nP...,3436,102.0,17.781332,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,1,3436.0
3,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,3003,102.0,3.138371,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,
4,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n```\n\n[/INST]\nAnswer:\n```\n\n[INST] <<SYS...,3614,102.0,18.645712,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,1,3614.0


In [45]:
# read all per inference files
json_list: List[Dict] = [json.loads(Path(f_name).read_text()) for f_name in glob.glob(os.path.join(METRICS_PER_INFERENCE_DIR, "*.json"))]
df_responses = pd.DataFrame(json_list)
logger.info(f"created dataframe of shape {df_responses.shape} from all inference responses")
df_responses.head()


[2024-01-23 17:28:15,578] p1345 {1971204328.py:4} INFO - created dataframe of shape (1172, 14) from all inference responses


Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,completion,prompt_tokens,completion_tokens,latency,experiment_name,concurrency,truncate
0,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n</s>,2675,4.0,1.737736,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,8,
1,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n[INST] <<SYS>>\nYou are an assistant for...,3132,102.0,3.169297,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,
2,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n```\n<answer>\nPassage 1:\n\n```\n\n```\nP...,3436,102.0,17.781332,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,1,3436.0
3,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,3003,102.0,3.138371,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,
4,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n```\n\n[/INST]\nAnswer:\n```\n\n[INST] <<SYS...,3614,102.0,18.645712,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,1,3614.0


In [46]:
df_responses[df_responses.endpoint_name.str.contains("inf2")]

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,completion,prompt_tokens,completion_tokens,latency,experiment_name,concurrency,truncate


In [47]:
# read all per inference files
json_list: List[Dict] = [json.loads(Path(f_name).read_text()) for f_name in glob.glob(os.path.join(METRICS_PER_CHUNK_DIR, "*.json"))]
df_metrics = pd.DataFrame(json_list)
logger.info(f"created dataframe of shape {df_metrics.shape} from all chunk responses")
df_metrics.head()

[2024-01-23 17:28:16,393] p1345 {976317414.py:4} INFO - created dataframe of shape (454, 16) from all chunk responses


Unnamed: 0,experiment_name,concurrency,payload_file,errors,successes,error_rate,all_prompts_token_count,prompt_token_count_mean,prompt_token_throughput,all_completions_token_count,completion_token_count_mean,completion_token_throughput,transactions,transactions_per_second,transactions_per_minute,latency_mean
0,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,payload_en_3000-4000.jsonl,[],1,0.0,3662,3662.0,1110.16,102,102.0,30.92,1,0.3,18,3.28536
1,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,1,payload_en_3000-4000.jsonl,[],1,0.0,3475,3475.0,1064.98,102,102.0,31.26,1,0.31,18,3.251735
2,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,1,payload_en_1000-2000.jsonl,[],1,0.0,1598,1598.0,145.32,102,102.0,9.28,1,0.09,5,10.988445
3,llama2-70b-g5.48xlarge-huggingface-pytorch-tgi...,2,payload_en_3000-4000.jsonl,[],2,0.0,7389,3694.5,246.81,113,56.5,3.77,2,0.07,4,26.39405
4,llama2-70b-chat-p4d.24xlarge-djl-inference-0.2...,2,payload_en_3000-4000.jsonl,[],2,0.0,6877,3438.5,1829.69,203,101.5,54.01,2,0.53,31,3.731961


In [48]:
df_endpoints = pd.json_normalize(endpoint_info_list)
df_endpoints['instance_type'] = df_endpoints['endpoint_config.ProductionVariants'].map(lambda x: x[0]['InstanceType'])
df_endpoints
cols_for_env = [c for c in df_endpoints.columns if 'Environment' in c]
print(cols_for_env)
cols_of_interest = ['experiment_name', 
                    'instance_type',
                    'endpoint.EndpointName',
                    'model_config.ModelName',
                    'model_config.PrimaryContainer.Image',   
                    'model_config.PrimaryContainer.ModelDataSource.S3DataSource.S3Uri']
cols_of_interest.extend(cols_for_env)

df_endpoints = df_endpoints[cols_of_interest]
df_endpoints = df_endpoints[cols_of_interest]
cols_of_interest_renamed = [c.split('.')[-1] for c in cols_of_interest]
df_endpoints.columns = cols_of_interest_renamed

# Check if 'experiment_name' column exists in both DataFrames
print("Columns in df_responses:", df_responses.columns)
print("Columns in df_endpoints:", df_endpoints.columns)

# Merge operation
df_results = pd.merge(left=df_responses, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')

# Inspect the result
df_results.head()

['model_config.PrimaryContainer.Environment.ENDPOINT_SERVER_TIMEOUT', 'model_config.PrimaryContainer.Environment.HF_MODEL_ID', 'model_config.PrimaryContainer.Environment.MAX_INPUT_LENGTH', 'model_config.PrimaryContainer.Environment.MAX_TOTAL_TOKENS', 'model_config.PrimaryContainer.Environment.MODEL_CACHE_ROOT', 'model_config.PrimaryContainer.Environment.SAGEMAKER_ENV', 'model_config.PrimaryContainer.Environment.SAGEMAKER_MODEL_SERVER_WORKERS', 'model_config.PrimaryContainer.Environment.SAGEMAKER_PROGRAM', 'model_config.PrimaryContainer.Environment.SM_NUM_GPUS', 'model_config.PrimaryContainer.Environment.HEALTH_CHECK_TIMOUT', 'model_config.PrimaryContainer.Environment.INSTANCE_COUNT', 'model_config.PrimaryContainer.Environment.MODEL_LOADING_TIMEOUT', 'model_config.PrimaryContainer.Environment.NUMBER_OF_GPU']
Columns in df_responses: Index(['endpoint_name', 'prompt', 'do_sample', 'temperature', 'top_p', 'top_k',
       'max_new_tokens', 'completion', 'prompt_tokens', 'completion_tokens',

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,completion,prompt_tokens,completion_tokens,...,MAX_TOTAL_TOKENS,MODEL_CACHE_ROOT,SAGEMAKER_ENV,SAGEMAKER_MODEL_SERVER_WORKERS,SAGEMAKER_PROGRAM,SM_NUM_GPUS,HEALTH_CHECK_TIMOUT,INSTANCE_COUNT,MODEL_LOADING_TIMEOUT,NUMBER_OF_GPU
0,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n</s>,2675,4.0,...,,,,,,,300.0,1.0,3600.0,8.0
1,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n[INST] <<SYS>>\nYou are an assistant for...,3132,102.0,...,,,,,,,300.0,1.0,3600.0,8.0
2,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n```\n<answer>\nPassage 1:\n\n```\n\n```\nP...,3436,102.0,...,4096.0,/opt/ml/model,1.0,1.0,inference.py,8.0,,,,
3,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,3003,102.0,...,,,,,,,300.0,1.0,3600.0,8.0
4,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n```\n\n[/INST]\nAnswer:\n```\n\n[INST] <<SYS...,3614,102.0,...,4096.0,/opt/ml/model,1.0,1.0,inference.py,8.0,,,,


In [49]:
df_results = pd.merge(left=df_responses, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')
df_results.head()

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,completion,prompt_tokens,completion_tokens,...,MAX_TOTAL_TOKENS,MODEL_CACHE_ROOT,SAGEMAKER_ENV,SAGEMAKER_MODEL_SERVER_WORKERS,SAGEMAKER_PROGRAM,SM_NUM_GPUS,HEALTH_CHECK_TIMOUT,INSTANCE_COUNT,MODEL_LOADING_TIMEOUT,NUMBER_OF_GPU
0,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n</s>,2675,4.0,...,,,,,,,300.0,1.0,3600.0,8.0
1,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n[INST] <<SYS>>\nYou are an assistant for...,3132,102.0,...,,,,,,,300.0,1.0,3600.0,8.0
2,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n```\n<answer>\nPassage 1:\n\n```\n\n```\nP...,3436,102.0,...,4096.0,/opt/ml/model,1.0,1.0,inference.py,8.0,,,,
3,llama2-70bdjl-2024-01-23-15-06-05-901-endpoint,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,3003,102.0,...,,,,,,,300.0,1.0,3600.0,8.0
4,llama-2-70b-g5-48xlarge-1706022365,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.7,0.92,120,100,\n```\n\n[/INST]\nAnswer:\n```\n\n[INST] <<SYS...,3614,102.0,...,4096.0,/opt/ml/model,1.0,1.0,inference.py,8.0,,,,


In [50]:
fpath: str = os.path.join(METRICS_DIR, config['results']['per_inference_request_file']).format(datetime=date_time)
df_results.to_csv(fpath, index=False)
logger.info(f"saved results dataframe of shape={df_results.shape} in {fpath}")

[2024-01-23 17:28:17,042] p1345 {3435721265.py:3} INFO - saved results dataframe of shape=(1172, 32) in data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference_request_results.csv


In [51]:
df_metrics = pd.merge(left=df_metrics, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')
df_metrics.head()
fpath: str = os.path.join(METRICS_DIR, config['results']['all_metrics_file']).format(datetime=date_time)
df_metrics.to_csv(fpath, index=False)
logger.info(f"saved metrics results dataframe of shape={df_metrics.shape} in {fpath}")

[2024-01-23 17:28:17,189] p1345 {3880536130.py:5} INFO - saved metrics results dataframe of shape=(454, 34) in data/metrics/llama2-70b-g5-p4d-trt-v1/all_metrics.csv
