## Run Inference on all deployed endpoints: Various combinations of payloads, concurrency levels, model configurations
---------------------
*This notebook works best with the conda_python3 kernel on a ml.t3.medium machine*.

#### This step of our solution design includes running inferences on all deployed model endpoints (with different configurations, concurrency levels and payload sizes). This notebook runs inferences in a manner that is calls endpoints concurrently and asychronously to generate responses and record metrics. Here are some of the key components:

- **Accessing the deployed endpoints**, creating a predictor object for these endpoints to call them during inference time.

- **Functions to define metrics**: This notebook sets stage for metrics to be recorded during the time of invocation of all these models for benchmarking purposes.

- **Running Actual Inferences**: Once the metrics are defined, we set a blocker function that is responsible for creating inference on a single payload called get_inference. We then run a series of asynchronous functions that can be viewed in the code (link above), to create asychronous inferefences on the deployed models. The way we send requests are by creating combinations: this means creating combinations of payloads of different sizes that can be viewed in the config.yml file, with different concurrency levels (in this case we first go through all patches of payloads with a concurrency level of 1, then 2, and then 4). You can set this to your desired value.

In [1]:
## auto reload all of the changes made in the config/globals.py file 
%load_ext autoreload
%autoreload 2

### Import all of the necessary libraries below to run this notebook

In [2]:
import glob
import time
import json
import io
import copy
import boto3
import asyncio
import logging
import itertools
import sagemaker
import pandas as pd
from globals import *
from datetime import datetime
from transformers import AutoTokenizer
from sagemaker.predictor import Predictor
from utils import load_config, count_tokens, write_to_s3, read_from_s3
from sagemaker.serializers import JSONSerializer
from typing import Dict, List, Optional, Tuple, Union

sagemaker.config INFO - Not applying SDK defaults from location: /opt/homebrew/share/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/madhurpt/Library/Application Support/sagemaker/config.yaml
CONFIG_FILE=configs/config-llama2-70b-g5-p4d-trt.yml


#### Pygmentize globals.py to view and use any of the globally initialized variables 

In [3]:
# global constants
!pygmentize globals.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36myaml[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36menum[39;49;00m [34mimport[39;49;00m Enum[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mpathlib[39;49;00m [34mimport[39;49;00m Path[37m[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILEPATH_FILE: [36mstr[39;49;00m = [33m"[39;49;00m[33mconfig_filepath.txt[39;49;00m[33m"[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m# S3 client initialization[39;49;00m[37m[39;49;00m
s3_client = boto3.client([33m'[39;49;00m[33ms3[39;49;00m[33m'[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
CONFIG_FILE: [36mstr[39;49;00m = Path(CONFIG_FILEPATH_FILE).read_text()[37m[39;49;00m
[36mprint[39;49;00m([33mf[39;49;00m[33m"[39;49;00m[33mCONFIG_FILE=[39;49;00m[33m{[39;49;00mCONFIG_FILE[33m}[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[34mwith[39;49;00m [36mopen[

In [4]:
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)

#### Load the Config.yml file that contains information that is used across this benchmarking environment

In [5]:
config = load_config(CONFIG_FILE)
logger.info(json.dumps(config, indent=2))

[2024-01-28 12:05:12,820] p75618 {635462509.py:2} INFO - {
  "general": {
    "name": "llama2-70b-g5-p4d-trt-v1",
    "model_name": "Llama2-70b"
  },
  "aws": {
    "region": "us-east-1",
    "sagemaker_execution_role": "arn:aws:iam::015469603702:role/service-role/SageMaker-ExecutionRole-20240111T084686",
    "bucket": "fmbt2039",
    "prefix": "data",
    "source_data_bucket_prefix": "source_data",
    "prompt_template": "prompt_template/prompt_template.txt",
    "custom_tokenizer": "tokenizer",
    "bring_your_script": "byo_script"
  },
  "prompt": {
    "template_file": "prompt_template.txt",
    "all_prompts_file": "all_prompts.csv"
  },
  "datasets": [
    {
      "language": "en",
      "min_length_in_tokens": 1,
      "max_length_in_tokens": 500,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
      "language": "en",
      "min_length_in_tokens": 500,
      "max_length_in_tokens": 1000,
      "payload_file": "payload_{lang}_{min}-{max}.jsonl"
    },
    {
 

In [6]:
date_time = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

In [7]:
## getting access to the s3 bucket where endpoints.json for different models resides
s3_client = boto3.client('s3')

### Access the deployed model endpoints from the endpoints.json file 

In [8]:
endpoint_info_str = read_from_s3(BUCKET_NAME, ENDPOINT_S3_PATH)
logger.info(f"endpoint from --> s3://{BUCKET_NAME}/{ENDPOINT_S3_PATH}")

# Process the retrieved content
if endpoint_info_str:
    try:
        endpoint_info_list = json.loads(endpoint_info_str)
        logger.info(f"Found information for {len(endpoint_info_list)} endpoints in the respective S3 path")
        logger.info(json.dumps(endpoint_info_list, indent=2))
    except json.JSONDecodeError as e:
        logger.error(f"Error parsing JSON from S3: {e}")
else:
    logger.error("Error reading from S3 or no data found")

[2024-01-28 12:05:13,130] p75618 {1472297992.py:2} INFO - endpoint from --> s3://fmbt2039/data/models/llama2-70b-g5-p4d-trt-v1/endpoints.json
[2024-01-28 12:05:13,130] p75618 {1472297992.py:8} INFO - Found information for 2 endpoints in the respective S3 path
[2024-01-28 12:05:13,131] p75618 {1472297992.py:9} INFO - [
  {
    "experiment_name": "llama2-70b-g5.48xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0",
    "endpoint": {
      "EndpointName": "llama-2-70b-g5-48xlarge-1706460294",
      "EndpointArn": "arn:aws:sagemaker:us-east-1:015469603702:endpoint/llama-2-70b-g5-48xlarge-1706460294",
      "EndpointConfigName": "llama-2-70b-g5-48xlarge-1706460294",
      "ProductionVariants": [
        {
          "VariantName": "AllTraffic",
          "DeployedImages": [
            {
              "SpecifiedImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.1-tgi1.1.0-gpu-py39-cu118-ubuntu20.04",
              "ResolvedImage": "763104351884.d

In [9]:
# List down the endpoint names that have been deployed
endpoint_name_list = [e['endpoint']['EndpointName'] for e in endpoint_info_list]
logger.info(f"there are {len(endpoint_name_list)} deployed endpoint(s), endpoint_name_list->{endpoint_name_list}")

[2024-01-28 12:05:13,161] p75618 {1455142584.py:3} INFO - there are 2 deployed endpoint(s), endpoint_name_list->['llama-2-70b-g5-48xlarge-1706460294', 'llama2-70bdjl-2024-01-28-16-44-54-820-endpoint']


### Creating predictor objects from the deployed endpoints

In [10]:
# create predictor objects

## create a sagemaker predictor for these endpoints
def create_predictor(endpoint_name: str) -> Optional[sagemaker.base_predictor.Predictor]:
    # Create a SageMaker Predictor object
    predictor = Predictor(
        endpoint_name=endpoint_name,
        sagemaker_session=sagemaker.Session(),
        serializer=JSONSerializer()
    )
    return predictor

## Display the list of predictor objects that have been deployed ready for inferencing from
predictor_list: List = [create_predictor(ep) for ep in endpoint_name_list]
logger.info(predictor_list)

[2024-01-28 12:05:13,268] p75618 {3970465137.py:15} INFO - [<sagemaker.base_predictor.Predictor object at 0x289b412d0>, <sagemaker.base_predictor.Predictor object at 0x28c3dca10>]


### Creating functions to define and calculate metrics during the time of invocations

In [11]:
def safe_sum(l: List) -> Union[int, float]:
    return sum(filter(None, l))

def safe_div(n: Union[int, float], d: Union[int, float]) -> Optional[Union[int, float]]:
    return n/d if d else None

## Represents the function to calculate all of the metrics at the time of inference
def calculate_metrics(responses, chunk, elapsed_async, experiment_name, concurrency, payload_file) -> Dict:
    
    ## calculate errors based on the completion status of the inference prompt
    errors = [r for r in responses if r['completion'] is None]
    
    ## Calculate the difference as the successes 
    successes = len(chunk) - len(errors)
    
    ## Count all of the prompts token count during inference
    all_prompts_token_count = safe_sum([r['prompt_tokens'] for r in responses])
    prompt_token_throughput = round(all_prompts_token_count / elapsed_async, 2)
    prompt_token_count_mean = safe_div(all_prompts_token_count, successes)
    all_completions_token_count = safe_sum([r['completion_tokens'] for r in responses])
    completion_token_throughput = round(all_completions_token_count / elapsed_async, 2)
    completion_token_count_mean = safe_div(all_completions_token_count, successes)
    transactions_per_second = round(successes / elapsed_async, 2)
    transactions_per_minute = int(transactions_per_second * 60)
    
    ## calculate the latency mean utilizing the safe_sum function defined above
    latency_mean = safe_div(safe_sum([r['latency'] for r in responses]), successes)
    
    ## Function returns all these values at the time of the invocations
    return {
        'experiment_name': experiment_name,
        'concurrency': concurrency,
        'payload_file': payload_file,
        'errors': errors,
        'successes': successes,
        'error_rate': len(errors)/len(chunk),
        'all_prompts_token_count': all_prompts_token_count,
        'prompt_token_count_mean': prompt_token_count_mean,
        'prompt_token_throughput': prompt_token_throughput,
        'all_completions_token_count': all_completions_token_count,
        'completion_token_count_mean': completion_token_count_mean,
        'completion_token_throughput': completion_token_throughput,
        'transactions': len(chunk),
        'transactions_per_second': transactions_per_second,
        'transactions_per_minute': transactions_per_minute,
        'latency_mean': latency_mean
    }

### Set a blocker function and a series of asynchronous concurrent model prompt invocations

In [12]:
def set_metrics(endpoint_name=None,
                    prompt=None,
                    inference_params=None,
                    completion=None,
                    prompt_tokens=None,
                    completion_tokens=None,
                    latency=None) -> Dict:
    return dict(endpoint_name=endpoint_name,                
                prompt=prompt,
                **inference_params,
                completion=completion,
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                latency=latency)

def get_inference(predictor, payload) -> Dict:
    
    smr_client = boto3.client("sagemaker-runtime")
    latency = 0

    try:
        prompt_tokens = count_tokens(payload['inputs'])
        logger.info(f"get_inference, endpoint={predictor.endpoint_name}, prompt_tokens={prompt_tokens}")

        # get inference
        st = time.perf_counter()        
        response = predictor.predict(payload)        
        latency = time.perf_counter() - st

        if isinstance(response, bytes):
            response = response.decode('utf-8')
        response_json = json.loads(response)
        if isinstance(response_json, list):
            response_json = response_json[0]

        completion = response_json.get("generated_text", "")
        completion_tokens = count_tokens(completion)

        # Set metrics and logging for both cases
        response = set_metrics(predictor.endpoint_name,
                               payload['inputs'],
                               payload['parameters'],
                               completion,
                               prompt_tokens,
                               completion_tokens,
                               latency)
        # logger.info(f"get_inference, done, endpoint={predictor.endpoint_name}, response={json.dumps(response, indent=2)}, latency={latency:.2f}")
        logger.info(f"get_inference, done, endpoint={predictor.endpoint_name}, completion_tokens={completion_tokens}, latency={latency:.2f}")
    except Exception as e:
        print(f"error occurred with {predictor.endpoint_name}, exception={str(e)}")
        response = set_metrics(predictor.endpoint_name,
                               payload['inputs'],
                               payload['parameters'],
                               None,
                               prompt_tokens,
                               None,
                               None)

    return response

### Setting a series of asynchronous functions to invoke and run inferences concurrently and asynchronously

In [13]:
## Represents a function to start invoking models in separate thread asynchronously for the blocker function
async def async_get_inference(predictor, payload: Dict) -> Dict:
    return await asyncio.to_thread(get_inference, predictor, payload)

## Gathers all of the tasks and sets of the concurrent calling of the asychronous invocations
async def async_get_all_inferences(predictor, payload_list: List) -> List:
    return await asyncio.gather(*[async_get_inference(predictor, payload) for payload in payload_list])

In [14]:
## This function runs the asychronous function series above together for different experiments and concurrency levels.
async def run_inferences(predictor: sagemaker.base_predictor.Predictor, chunk: List, experiment: Dict, concurrency: int, payload_file: str) -> Tuple[List, Dict]:
    logger.info(f"Processing chunk with concurrency={concurrency}")
    s = time.perf_counter()
    responses = await async_get_all_inferences(predictor, chunk)
    elapsed_async = time.perf_counter() - s

    # Add more metadata about this experiment
    for r in responses:
        r['experiment_name'] = experiment['name']
        r['concurrency'] = concurrency

    metrics = calculate_metrics(responses, chunk, elapsed_async, experiment['name'], concurrency, payload_file)
    return responses, metrics

In [15]:
## Function to create the predictors from the experiment we are iterating over
def create_predictor_for_experiment(experiment: str, config: Dict, endpoint_info_list: List) -> Optional[sagemaker.base_predictor.Predictor]:

    ## Here, we set the index and then iterate through the experiments
    e_idx = config['experiments'].index(experiment) + 1

    ## Iterate through the endpoint information to fetch the endpoint name
    ep_info = [e for e in endpoint_info_list if e['experiment_name'] == experiment['name']]
    if not ep_info:
        logger.error(f"endpoint for experiment={experiment['name']} not found, skipping")
        return None
    ep_name = ep_info[0]['endpoint']['EndpointName']
    logger.info(f"experiment={e_idx}, name={experiment['name']}, ep_name={ep_name}")

    # create a predictor from each endpoint in experiments
    return create_predictor(ep_name)

In [16]:
## Here, we will process combinations of concurrency levels, the payload files and then loop through the 
## different combinations to make payloads splitted in terms of the concurrency metric and how we can run 
## it and make inference

def create_payload_dict(jline: str, experiment: Dict) -> Dict:
    payload: Dict = json.loads(jline)
    if experiment.get('remove_truncate', False) is True:
        if payload['parameters'].get('truncate'):
            del payload['parameters']['truncate']
    return payload
    
    
def create_combinations(experiment: Dict) -> List[Tuple]:
    combinations_data = []

    # Repeat for each concurrency level
    combinations = list(itertools.product(experiment['concurrency_levels'], experiment['payload_files']))
    logger.info(f"there are {len(combinations)} combinations of {combinations} to run")

    for concurrency, payload_file in combinations:
        # Construct the full S3 file path
        s3_file_path = os.path.join(PROMPTS_DIR, payload_file)
        logger.info(f"s3 path where the payload files are being read from -> {s3_file_path}")

        # Read the payload file from S3
        try:
            response = s3_client.get_object(Bucket=BUCKET_NAME, Key=s3_file_path)
            payload_file_content = response['Body'].read().decode('utf-8')

            # Create a payload list by processing each line
            payload_list = [create_payload_dict(jline, experiment) for jline in payload_file_content.splitlines()]
            logger.info(f"read from s3://{BUCKET_NAME}/{s3_file_path}, contains {len(payload_list)} lines")

        except Exception as e:
            logger.error(f"Error reading file from S3: {e}")
            continue

        logger.info(f"creating combinations for concurrency={concurrency}, payload_file={payload_file}, payload_list length={len(payload_list)}")
        
        n = concurrency
        
        if len(payload_list) < n:
            elements_to_add = n - len(payload_list)
            element_to_replicate = payload_list[0]
            # payload_list = payload_list.extend([element_to_replicate]*elements_to_add)
            payload_list.extend([element_to_replicate]*elements_to_add)
            
        # Split the original list into sublists which contain the number of requests we want to send concurrently        
        payload_list_splitted = [payload_list[i * n:(i + 1) * n] for i in range((len(payload_list) + n - 1) // n )]  
        
        for p in payload_list_splitted:
            if len(p) < n:
                elements_to_add = n - len(p)
                element_to_replicate = p[0]
                # p = p.extend([element_to_replicate]*elements_to_add)
                p.extend([element_to_replicate]*elements_to_add)
            

        # Only keep lists that have at least concurrency number of elements
        len_before = len(payload_list_splitted)
        payload_list_splitted = [p for p in payload_list_splitted if len(p) == concurrency]
        logger.info(f"after only retaining chunks of length {concurrency}, we have {len(payload_list_splitted)} chunks, previously we had {len_before} chunks")
        combinations_data.append((concurrency, payload_file, payload_list_splitted))
    logger.info(f"there are {len(combinations)} for {experiment}")
    return combinations_data

# process_combinations(experiment, predictor, PROMPTS_DIR)

In [17]:
# for each experiment
#   - for each endpoint and concurrency in an experiment

def clear_dir(dir_path: str):
    files = glob.glob(os.path.join(dir_path, "*"))
    for f in files:
        os.remove(f)

_ = list(map(clear_dir, [METRICS_PER_INFERENCE_DIR, METRICS_PER_CHUNK_DIR]))

num_experiments: int = len(config['experiments'])
for e_idx, experiment in enumerate(config['experiments']):
    e_idx += 1  # Increment experiment index
    # Call do_experiment function to create the predictor object
 
    predictor = create_predictor_for_experiment(experiment, config, endpoint_info_list)
    if predictor is None:
        logger.error(f"predictor could not be created for experiment={experiment}, moving to next...")
        continue

    # Process combinations of concurrency levels and payload files
    combination_data = create_combinations(experiment)

    for concurrency, payload_file, split_payload in combination_data:
        for chunk_index, chunk in enumerate(split_payload):
            logger.info(f"e_idx={e_idx}/{num_experiments}, chunk_index={chunk_index+1}/{len(split_payload)}")

            # Process each chunk and calculate metrics
            responses, metrics = await run_inferences(predictor, chunk, experiment, concurrency, payload_file)
            if metrics:
                # Convert metrics to JSON string
                metrics_json = json.dumps(metrics, indent=2)
                # Define S3 file path
                metrics_file_name = f"{time.time()}.json"
                metrics_s3_path = os.path.join(METRICS_PER_CHUNK_DIR, metrics_file_name)
                # Write to S3
                write_to_s3(metrics_json, BUCKET_NAME, "", METRICS_PER_CHUNK_DIR, metrics_file_name)

            if responses:
                for r in responses:
                    # Convert response to JSON string
                    response_json = json.dumps(r, indent=2)
                    # Define S3 file path
                    response_file_name = f"{time.time()}.json"
                    response_s3_path = os.path.join(METRICS_PER_INFERENCE_DIR, response_file_name)
                    # Write to S3
                    write_to_s3(response_json, BUCKET_NAME, "", METRICS_PER_INFERENCE_DIR, response_file_name)

    logger.info(f"experiment={e_idx}/{num_experiments}, name={experiment['name']}, done")

[2024-01-28 12:05:13,506] p75618 {663335596.py:13} INFO - experiment=1, name=llama2-70b-g5.48xlarge-huggingface-pytorch-tgi-inference-2.0.1-tgi1.1.0, ep_name=llama-2-70b-g5-48xlarge-1706460294
[2024-01-28 12:05:13,520] p75618 {2873079850.py:18} INFO - there are 25 combinations of [(1, 'payload_en_1-500.jsonl'), (1, 'payload_en_500-1000.jsonl'), (1, 'payload_en_1000-2000.jsonl'), (1, 'payload_en_2000-3000.jsonl'), (1, 'payload_en_3000-4000.jsonl'), (2, 'payload_en_1-500.jsonl'), (2, 'payload_en_500-1000.jsonl'), (2, 'payload_en_1000-2000.jsonl'), (2, 'payload_en_2000-3000.jsonl'), (2, 'payload_en_3000-4000.jsonl'), (4, 'payload_en_1-500.jsonl'), (4, 'payload_en_500-1000.jsonl'), (4, 'payload_en_1000-2000.jsonl'), (4, 'payload_en_2000-3000.jsonl'), (4, 'payload_en_3000-4000.jsonl'), (6, 'payload_en_1-500.jsonl'), (6, 'payload_en_500-1000.jsonl'), (6, 'payload_en_1000-2000.jsonl'), (6, 'payload_en_2000-3000.jsonl'), (6, 'payload_en_3000-4000.jsonl'), (8, 'payload_en_1-500.jsonl'), (8, 'pa

[2024-01-28 12:05:13,631] p75618 {2873079850.py:32} INFO - read from s3://fmbt2039/data/prompts/payload_en_1-500.jsonl, contains 1 lines
[2024-01-28 12:05:13,632] p75618 {2873079850.py:38} INFO - creating combinations for concurrency=1, payload_file=payload_en_1-500.jsonl, payload_list length=1
[2024-01-28 12:05:13,633] p75618 {2873079850.py:62} INFO - after only retaining chunks of length 1, we have 1 chunks, previously we had 1 chunks
[2024-01-28 12:05:13,633] p75618 {2873079850.py:23} INFO - s3 path where the payload files are being read from -> data/prompts/payload_en_500-1000.jsonl
[2024-01-28 12:05:13,682] p75618 {2873079850.py:32} INFO - read from s3://fmbt2039/data/prompts/payload_en_500-1000.jsonl, contains 1 lines
[2024-01-28 12:05:13,687] p75618 {2873079850.py:38} INFO - creating combinations for concurrency=1, payload_file=payload_en_500-1000.jsonl, payload_list length=1
[2024-01-28 12:05:13,689] p75618 {2873079850.py:62} INFO - after only retaining chunks of length 1, we h

Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461521.6148639.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461521.801458.json


[2024-01-28 12:05:30,413] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=8.46
[2024-01-28 12:05:30,764] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=1/15
[2024-01-28 12:05:30,764] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:05:30,768] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1339


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461530.4140658.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461530.573148.json


[2024-01-28 12:05:41,542] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=10.77


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461541.544867.json


[2024-01-28 12:05:41,964] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=2/15
[2024-01-28 12:05:41,964] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:05:41,973] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1932


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461541.758087.json


[2024-01-28 12:05:54,326] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.35
[2024-01-28 12:05:54,651] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=3/15
[2024-01-28 12:05:54,652] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:05:54,656] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1154


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461554.327583.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461554.474415.json


[2024-01-28 12:06:04,104] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=9.45
[2024-01-28 12:06:04,447] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=4/15
[2024-01-28 12:06:04,448] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:06:04,458] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1646


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461564.104956.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461564.278544.json


[2024-01-28 12:06:15,961] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=11.50


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461575.9685252.json


[2024-01-28 12:06:16,368] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=5/15
[2024-01-28 12:06:16,369] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:06:16,386] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1397


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461576.151705.json


[2024-01-28 12:06:26,746] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=10.36
[2024-01-28 12:06:27,173] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=6/15
[2024-01-28 12:06:27,174] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461586.7482932.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461586.977518.json


[2024-01-28 12:06:27,193] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1746
[2024-01-28 12:06:39,107] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=11.91
[2024-01-28 12:06:39,520] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=7/15
[2024-01-28 12:06:39,521] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461599.109429.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461599.323955.json


[2024-01-28 12:06:39,539] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1373
[2024-01-28 12:06:49,623] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=10.07


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461609.632614.json


[2024-01-28 12:06:50,060] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=8/15
[2024-01-28 12:06:50,061] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:06:50,077] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1598


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461609.818618.json


[2024-01-28 12:07:01,341] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=11.26


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461621.343738.json


[2024-01-28 12:07:01,779] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=9/15
[2024-01-28 12:07:01,783] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:07:01,799] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1743


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461621.561663.json


[2024-01-28 12:07:14,237] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.44


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461634.239818.json


[2024-01-28 12:07:14,670] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=10/15
[2024-01-28 12:07:14,675] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:07:14,731] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1539


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461634.44089.json


[2024-01-28 12:07:25,479] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=10.75


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461645.480931.json


[2024-01-28 12:07:25,902] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=11/15
[2024-01-28 12:07:25,903] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:07:25,919] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1695


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461645.681529.json


[2024-01-28 12:07:37,632] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=11.71


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461657.633645.json


[2024-01-28 12:07:38,055] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=12/15
[2024-01-28 12:07:38,057] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:07:38,083] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1421


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461657.8373399.json


[2024-01-28 12:07:48,711] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=10.63


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461668.713842.json


[2024-01-28 12:07:49,189] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=13/15
[2024-01-28 12:07:49,190] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:07:49,208] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1918


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461668.970093.json


[2024-01-28 12:08:01,365] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.16


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461681.367708.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461681.583321.json


[2024-01-28 12:08:01,784] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=14/15
[2024-01-28 12:08:01,785] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:08:01,805] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1910
[2024-01-28 12:08:13,638] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=11.83


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461693.642186.json


[2024-01-28 12:08:14,036] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=15/15
[2024-01-28 12:08:14,043] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:08:14,065] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=1939


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461693.822013.json


[2024-01-28 12:08:27,359] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.29


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461707.362029.json


[2024-01-28 12:08:27,780] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=1/32
[2024-01-28 12:08:27,780] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:08:27,797] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2637


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461707.5642219.json


[2024-01-28 12:08:43,088] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.29


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461723.0910258.json


[2024-01-28 12:08:43,534] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=2/32
[2024-01-28 12:08:43,534] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:08:43,543] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3000


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461723.269987.json


[2024-01-28 12:09:00,102] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.56


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461740.1047628.json


[2024-01-28 12:09:00,513] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=3/32
[2024-01-28 12:09:00,513] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:09:00,528] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2148


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461740.299179.json


[2024-01-28 12:09:13,646] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.12
[2024-01-28 12:09:14,005] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=4/32
[2024-01-28 12:09:14,006] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461753.649469.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461753.8224158.json


[2024-01-28 12:09:14,024] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2715
[2024-01-28 12:09:29,543] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.52


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461769.547829.json


[2024-01-28 12:09:29,943] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=5/32
[2024-01-28 12:09:29,943] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:09:29,959] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2404


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461769.737819.json


[2024-01-28 12:09:43,734] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.77


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461783.736483.json


[2024-01-28 12:09:44,179] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=6/32
[2024-01-28 12:09:44,180] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:09:44,193] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2150


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461783.951519.json


[2024-01-28 12:09:57,090] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.84


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461797.092932.json


[2024-01-28 12:09:57,545] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=7/32
[2024-01-28 12:09:57,546] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:09:57,554] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2803


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461797.3226082.json


[2024-01-28 12:10:13,274] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=101, latency=15.72
[2024-01-28 12:10:13,714] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=8/32
[2024-01-28 12:10:13,715] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:10:13,734] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2369


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461813.278052.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461813.549855.json


[2024-01-28 12:10:27,401] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.67
[2024-01-28 12:10:27,770] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=9/32
[2024-01-28 12:10:27,771] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:10:27,790] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2675


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461827.403563.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461827.598955.json


[2024-01-28 12:10:43,069] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.27


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461843.074733.json


[2024-01-28 12:10:43,554] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=10/32
[2024-01-28 12:10:43,555] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:10:43,567] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2541


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461843.289514.json


[2024-01-28 12:10:57,775] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.21


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461857.778465.json


[2024-01-28 12:10:58,218] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=11/32
[2024-01-28 12:10:58,218] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:10:58,229] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2186


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461857.968318.json


[2024-01-28 12:11:11,185] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=101, latency=12.95


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461871.188097.json


[2024-01-28 12:11:11,753] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=12/32
[2024-01-28 12:11:11,753] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:11:11,771] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2775


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461871.5273972.json


[2024-01-28 12:11:27,281] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.51


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461887.283442.json


[2024-01-28 12:11:27,725] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=13/32
[2024-01-28 12:11:27,726] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:11:27,743] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2686


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461887.483595.json


[2024-01-28 12:11:42,617] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.87


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461902.620367.json


[2024-01-28 12:11:43,062] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=14/32
[2024-01-28 12:11:43,063] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:11:43,082] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2500


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461902.815459.json


[2024-01-28 12:11:57,339] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.26


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461917.342035.json


[2024-01-28 12:11:57,923] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=15/32
[2024-01-28 12:11:57,924] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:11:57,942] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2443


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461917.536362.json


[2024-01-28 12:12:11,836] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.89


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461931.840127.json


[2024-01-28 12:12:12,219] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=16/32
[2024-01-28 12:12:12,220] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:12:12,238] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2321


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461932.00141.json


[2024-01-28 12:12:25,619] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.38
[2024-01-28 12:12:25,970] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=17/32
[2024-01-28 12:12:25,970] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:12:25,979] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2428


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461945.6207302.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461945.8007429.json


[2024-01-28 12:12:39,743] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.76


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461959.74683.json


[2024-01-28 12:12:40,187] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=18/32
[2024-01-28 12:12:40,189] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:12:40,210] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2458


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461959.939127.json


[2024-01-28 12:12:54,330] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.12


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461974.334408.json


[2024-01-28 12:12:54,771] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=19/32
[2024-01-28 12:12:54,772] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:12:54,788] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2101


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461974.5237792.json


[2024-01-28 12:13:07,409] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.62


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706461987.412029.json


[2024-01-28 12:13:07,812] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=20/32
[2024-01-28 12:13:07,813] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:13:07,829] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2440


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706461987.5814872.json


[2024-01-28 12:13:22,125] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.29
[2024-01-28 12:13:22,531] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=21/32
[2024-01-28 12:13:22,532] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462002.130007.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462002.348941.json


[2024-01-28 12:13:22,553] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2711
[2024-01-28 12:13:37,795] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.24
[2024-01-28 12:13:38,226] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=22/32
[2024-01-28 12:13:38,228] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462017.801107.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462018.030903.json


[2024-01-28 12:13:38,263] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2771
[2024-01-28 12:13:53,824] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.56


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462033.826321.json


[2024-01-28 12:13:54,278] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=23/32
[2024-01-28 12:13:54,278] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:13:54,287] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2691


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462034.051965.json


[2024-01-28 12:14:09,352] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.06


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462049.3540711.json


[2024-01-28 12:14:09,803] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=24/32
[2024-01-28 12:14:09,803] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:14:09,825] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2624


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462049.569994.json


[2024-01-28 12:14:24,799] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.97


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462064.8038452.json


[2024-01-28 12:14:25,266] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=25/32
[2024-01-28 12:14:25,267] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:14:25,293] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2062


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462065.020154.json


[2024-01-28 12:14:38,005] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=12.71


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462078.0071738.json


[2024-01-28 12:14:38,423] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=26/32
[2024-01-28 12:14:38,424] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:14:38,445] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2213


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462078.193016.json


[2024-01-28 12:14:51,550] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.10


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462091.553428.json


[2024-01-28 12:14:52,017] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=27/32
[2024-01-28 12:14:52,018] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:14:52,043] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2608


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462091.7648919.json


[2024-01-28 12:15:07,067] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.02


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462107.074784.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462107.328418.json


[2024-01-28 12:15:07,527] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=28/32
[2024-01-28 12:15:07,529] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:15:07,549] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2770
[2024-01-28 12:15:22,792] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=15.24
[2024-01-28 12:15:23,209] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=29/32
[2024-01-28 12:15:23,211] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462122.793779.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462123.0149179.json


[2024-01-28 12:15:23,242] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2564
[2024-01-28 12:15:37,610] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.37


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462137.6164749.json


[2024-01-28 12:15:38,649] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=30/32
[2024-01-28 12:15:38,649] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:15:38,660] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2333


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462138.083358.json


[2024-01-28 12:15:52,145] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.48


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462152.147295.json


[2024-01-28 12:15:52,525] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=31/32
[2024-01-28 12:15:52,526] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:15:52,547] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2613


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462152.322217.json


[2024-01-28 12:16:07,366] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=14.82
[2024-01-28 12:16:07,757] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=32/32
[2024-01-28 12:16:07,758] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462167.3698492.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462167.564896.json


[2024-01-28 12:16:07,775] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=2358
[2024-01-28 12:16:21,427] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=13.65
[2024-01-28 12:16:21,806] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=1/57
[2024-01-28 12:16:21,807] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462181.43024.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462181.6110039.json


[2024-01-28 12:16:21,825] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3000
[2024-01-28 12:16:38,144] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.32


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462198.151179.json


[2024-01-28 12:16:38,634] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=2/57
[2024-01-28 12:16:38,636] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:16:38,679] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3896


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462198.3777769.json


[2024-01-28 12:16:58,302] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.62


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462218.303953.json


[2024-01-28 12:16:58,706] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=3/57
[2024-01-28 12:16:58,707] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:16:58,741] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3789


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462218.499069.json


[2024-01-28 12:17:18,083] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.34


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462238.085845.json


[2024-01-28 12:17:18,513] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=4/57
[2024-01-28 12:17:18,514] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:17:18,537] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3450


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462238.272243.json


[2024-01-28 12:17:36,284] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.74


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462256.286.json


[2024-01-28 12:17:36,783] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=5/57
[2024-01-28 12:17:36,783] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:17:36,806] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3482


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462256.5015428.json


[2024-01-28 12:17:54,798] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.99


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462274.800815.json


[2024-01-28 12:17:55,258] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=6/57
[2024-01-28 12:17:55,258] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:17:55,276] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3144


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462275.005314.json


[2024-01-28 12:18:12,027] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.75


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462292.0298188.json


[2024-01-28 12:18:12,509] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=7/57
[2024-01-28 12:18:12,509] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:18:12,531] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3639


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462292.277594.json


[2024-01-28 12:18:31,103] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.57


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462311.1054301.json


[2024-01-28 12:18:31,571] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=8/57
[2024-01-28 12:18:31,571] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:18:31,587] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3014


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462311.326041.json


[2024-01-28 12:18:47,944] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.35


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462327.946704.json


[2024-01-28 12:18:48,381] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=9/57
[2024-01-28 12:18:48,382] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:18:48,405] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3891


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462328.16227.json


[2024-01-28 12:19:07,766] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.36


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462347.768356.json


[2024-01-28 12:19:08,207] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=10/57
[2024-01-28 12:19:08,207] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:19:08,231] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3575


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462347.961466.json


[2024-01-28 12:19:26,670] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.44


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462366.6720252.json


[2024-01-28 12:19:27,087] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=11/57
[2024-01-28 12:19:27,088] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:19:27,108] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3419


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462366.872112.json


[2024-01-28 12:19:44,812] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.70


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462384.813555.json


[2024-01-28 12:19:45,294] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=12/57
[2024-01-28 12:19:45,295] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:19:45,316] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3458


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462385.027553.json


[2024-01-28 12:20:03,341] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.02


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462403.343495.json


[2024-01-28 12:20:03,743] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=13/57
[2024-01-28 12:20:03,743] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:20:03,755] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3508


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462403.524118.json


[2024-01-28 12:20:21,767] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.01
[2024-01-28 12:20:22,129] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=14/57
[2024-01-28 12:20:22,130] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:20:22,151] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3436


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462421.768918.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462421.960552.json


[2024-01-28 12:20:39,879] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.73


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462439.880791.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462440.0628722.json


[2024-01-28 12:20:40,263] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=15/57
[2024-01-28 12:20:40,264] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:20:40,288] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3249
[2024-01-28 12:20:57,143] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.85


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462457.1449118.json


[2024-01-28 12:20:57,572] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=16/57
[2024-01-28 12:20:57,573] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:20:57,589] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3087


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462457.344359.json


[2024-01-28 12:21:14,502] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.91


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462474.504142.json


[2024-01-28 12:21:14,980] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=17/57
[2024-01-28 12:21:14,980] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:21:15,002] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3680


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462474.7255309.json


[2024-01-28 12:21:33,814] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.81


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462493.819278.json


[2024-01-28 12:21:34,283] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=18/57
[2024-01-28 12:21:34,283] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:21:34,297] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3950


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462494.0411992.json


[2024-01-28 12:21:53,702] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.40


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462513.7047622.json


[2024-01-28 12:21:54,169] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=19/57
[2024-01-28 12:21:54,170] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:21:54,190] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3463


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462513.90523.json


[2024-01-28 12:22:12,485] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.29


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462532.486422.json


[2024-01-28 12:22:12,991] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=20/57
[2024-01-28 12:22:12,993] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:22:13,041] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3996


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462532.682731.json


[2024-01-28 12:22:32,794] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.75


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462552.8015842.json


[2024-01-28 12:22:33,263] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=21/57
[2024-01-28 12:22:33,264] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:22:33,290] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3050


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462553.00823.json


[2024-01-28 12:22:49,660] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.37


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462569.664415.json


[2024-01-28 12:22:50,219] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=22/57
[2024-01-28 12:22:50,220] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:22:50,262] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3737


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462569.956404.json


[2024-01-28 12:23:09,093] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.83


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462589.096591.json


[2024-01-28 12:23:09,560] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=23/57
[2024-01-28 12:23:09,561] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:23:09,595] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3410


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462589.302536.json


[2024-01-28 12:23:27,094] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.50


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462607.0958998.json


[2024-01-28 12:23:27,589] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=24/57
[2024-01-28 12:23:27,589] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:23:27,615] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3979


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462607.316318.json


[2024-01-28 12:23:47,704] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=20.09


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462627.7070668.json


[2024-01-28 12:23:48,184] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=25/57
[2024-01-28 12:23:48,185] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:23:48,204] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3171


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462627.900641.json


[2024-01-28 12:24:04,753] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.55
[2024-01-28 12:24:05,142] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=26/57


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462644.755378.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462644.944352.json


[2024-01-28 12:24:05,143] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:24:05,169] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3655
[2024-01-28 12:24:24,164] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.99


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462664.167101.json


[2024-01-28 12:24:24,611] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=27/57
[2024-01-28 12:24:24,611] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:24:24,631] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3662


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462664.3570712.json


[2024-01-28 12:24:42,958] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.33


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462682.959716.json


[2024-01-28 12:24:43,402] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=28/57
[2024-01-28 12:24:43,403] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:24:43,418] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3098


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462683.1493142.json


[2024-01-28 12:24:59,893] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.47


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462699.8949828.json


[2024-01-28 12:25:00,355] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=29/57
[2024-01-28 12:25:00,355] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:25:00,368] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3704


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462700.123463.json


[2024-01-28 12:25:19,263] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.89


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462719.2658198.json


[2024-01-28 12:25:19,712] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=30/57
[2024-01-28 12:25:19,712] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:25:19,731] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3400


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462719.462038.json


[2024-01-28 12:25:37,267] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.53


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462737.269637.json


[2024-01-28 12:25:37,705] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=31/57
[2024-01-28 12:25:37,705] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:25:37,725] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3098


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462737.455301.json


[2024-01-28 12:25:54,097] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.37


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462754.0993059.json


[2024-01-28 12:25:54,529] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=32/57
[2024-01-28 12:25:54,530] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:25:54,558] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3909


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462754.304333.json


[2024-01-28 12:26:14,107] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.55


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462774.109914.json


[2024-01-28 12:26:14,558] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=33/57
[2024-01-28 12:26:14,559] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:26:14,586] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3971


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462774.3107488.json


[2024-01-28 12:26:34,377] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.79


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462794.3792992.json


[2024-01-28 12:26:34,783] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=34/57
[2024-01-28 12:26:34,783] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:26:34,809] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3132


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462794.569195.json


[2024-01-28 12:26:51,345] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.53
[2024-01-28 12:26:51,712] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=35/57
[2024-01-28 12:26:51,713] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:26:51,732] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3135


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462811.3466802.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462811.536299.json


[2024-01-28 12:27:08,316] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.58


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462828.3201332.json


[2024-01-28 12:27:08,747] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=36/57
[2024-01-28 12:27:08,747] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:27:08,764] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3030


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462828.511977.json


[2024-01-28 12:27:24,923] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.16


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462844.9260318.json


[2024-01-28 12:27:25,355] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=37/57
[2024-01-28 12:27:25,356] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:27:25,379] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3848


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462845.114121.json


[2024-01-28 12:27:44,573] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.19


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462864.575946.json


[2024-01-28 12:27:45,038] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=38/57
[2024-01-28 12:27:45,038] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:27:45,060] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3208


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462864.7654452.json


[2024-01-28 12:28:02,013] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=16.95


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462882.016337.json


[2024-01-28 12:28:02,685] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=39/57
[2024-01-28 12:28:02,686] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:28:02,712] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3475


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462882.213212.json


[2024-01-28 12:28:21,623] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.91


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462901.625699.json


[2024-01-28 12:28:22,035] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=40/57
[2024-01-28 12:28:22,038] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:28:22,063] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3681


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462901.8119512.json


[2024-01-28 12:28:41,138] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=19.07


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462921.1406112.json


[2024-01-28 12:28:41,630] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=41/57
[2024-01-28 12:28:41,630] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:28:41,657] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3581


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462921.333308.json


[2024-01-28 12:29:00,207] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.55


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462940.209284.json


[2024-01-28 12:29:00,799] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=42/57
[2024-01-28 12:29:00,800] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:29:00,820] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3302


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462940.571202.json


[2024-01-28 12:29:18,252] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=17.43


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462958.254721.json


[2024-01-28 12:29:18,841] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=43/57
[2024-01-28 12:29:18,842] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:29:18,871] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3614


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462958.469755.json


[2024-01-28 12:29:37,273] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.40


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462977.2757728.json


[2024-01-28 12:29:37,743] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=44/57
[2024-01-28 12:29:37,743] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1
[2024-01-28 12:29:37,761] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3537


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462977.504333.json


[2024-01-28 12:29:56,074] p75618 {701838357.py:48} INFO - get_inference, done, endpoint=llama-2-70b-g5-48xlarge-1706460294, completion_tokens=102, latency=18.31
[2024-01-28 12:29:56,446] p75618 {1751706209.py:26} INFO - e_idx=1/2, chunk_index=45/57
[2024-01-28 12:29:56,447] p75618 {1750118496.py:3} INFO - Processing chunk with concurrency=1


Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_chunk/1706462996.075784.json
Data successfully written to s3://fmbt2039/data/metrics/llama2-70b-g5-p4d-trt-v1/per_inference/1706462996.2660182.json


[2024-01-28 12:29:56,465] p75618 {701838357.py:23} INFO - get_inference, endpoint=llama-2-70b-g5-48xlarge-1706460294, prompt_tokens=3165


In [None]:
# Function to list files in S3 bucket with a specific prefix
def list_s3_files(bucket, prefix, suffix='.json'):
    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
    return [item['Key'] for item in response.get('Contents', []) if item['Key'].endswith(suffix)]

# List .json files in the specified S3 directory
s3_files = list_s3_files(BUCKET_NAME, METRICS_PER_INFERENCE_DIR)

# Read and parse each JSON file from S3
json_list = []
for file_key in s3_files:
    response = s3_client.get_object(Bucket=BUCKET_NAME, Key=file_key)
    file_content = response['Body'].read().decode('utf-8')
    json_obj = json.loads(file_content)
    json_list.append(json_obj)

# Create DataFrame
df_responses = pd.DataFrame(json_list)
logger.info(f"created dataframe of shape {df_responses.shape} from all responses")
df_responses.head()


[2024-01-26 23:45:52,931] p32493 {2679022863.py:19} INFO - created dataframe of shape (42, 14) from all responses


Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,truncate,completion,prompt_tokens,completion_tokens,latency,experiment_name,concurrency
0,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,98,3.752459,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1
1,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,100,3.940795,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1
2,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia and Staunton...,304,39,1.648553,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
3,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,99,3.759369,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
4,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,16,1.085169,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2


In [None]:
df_responses[df_responses.endpoint_name.str.contains("g5")]

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,truncate,completion,prompt_tokens,completion_tokens,latency,experiment_name,concurrency
0,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,98,3.752459,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1
1,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,100,3.940795,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1
2,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia and Staunton...,304,39,1.648553,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
3,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,99,3.759369,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
4,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,16,1.085169,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
5,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,100,4.162502,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2
6,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia and Staunton...,304,98,4.01382,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,4
7,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nPassage 1:\nStauntonia\nStauntonia is...,304,101,4.012472,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,4
8,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,98,4.048025,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,4
9,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nPassage 1:\nStauntonia\nStauntonia is...,304,101,4.039957,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,4


In [None]:
def list_s3_files(bucket, prefix, suffix='.json'):
    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
    logger.info(f"files recieved from s3 for per inference request --> {response}")
    return [item['Key'] for item in response.get('Contents', []) if item['Key'].endswith(suffix)]

# List .json files in the specified S3 directory
s3_files = list_s3_files(BUCKET_NAME, METRICS_PER_CHUNK_DIR)

# Read and parse each JSON file from S3
json_list = []
for file_key in s3_files:
    response = s3_client.get_object(Bucket=BUCKET_NAME, Key=file_key)
    file_content = response['Body'].read().decode('utf-8')
    json_obj = json.loads(file_content)
    json_list.append(json_obj)

# Create DataFrame
df_metrics = pd.DataFrame(json_list)
logger.info(f"created dataframe of shape {df_metrics.shape} from all responses")
df_metrics.head()

[2024-01-26 23:45:56,531] p32493 {2358223361.py:3} INFO - files recieved from s3 for per inference request --> {'ResponseMetadata': {'RequestId': '151CGWQ29AH11640', 'HostId': '2Tc4L6zuZ/KwUbEYBUBIMp4B1gtzy6ReNumKl1xQ1JMmNgy11R91Y6giv2UNiF2KqsXs72J4phs=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '2Tc4L6zuZ/KwUbEYBUBIMp4B1gtzy6ReNumKl1xQ1JMmNgy11R91Y6giv2UNiF2KqsXs72J4phs=', 'x-amz-request-id': '151CGWQ29AH11640', 'date': 'Sat, 27 Jan 2024 04:45:57 GMT', 'x-amz-bucket-region': 'us-east-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'data/metrics/mistral-7b-tgi-g5-v1/per_chunk/1706330672.592785.json', 'LastModified': datetime.datetime(2024, 1, 27, 4, 44, 33, tzinfo=tzutc()), 'ETag': '"201806841e991778e563be75b30c06ad"', 'Size': 556, 'StorageClass': 'STANDARD'}, {'Key': 'data/metrics/mistral-7b-tgi-g5-v1/per_chunk/1706330677.033725.json', 'LastModified': datetime.date

Unnamed: 0,experiment_name,concurrency,payload_file,errors,successes,error_rate,all_prompts_token_count,prompt_token_count_mean,prompt_token_throughput,all_completions_token_count,completion_token_count_mean,completion_token_throughput,transactions,transactions_per_second,transactions_per_minute,latency_mean
0,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1,payload_en_1-500.jsonl,[],1,0.0,304,304.0,79.07,98,98.0,25.49,1,0.26,15,3.752459
1,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,1,payload_en_500-1000.jsonl,[],1,0.0,980,980.0,248.06,100,100.0,25.31,1,0.25,15,3.940795
2,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2,payload_en_1-500.jsonl,[],2,0.0,608,304.0,161.4,138,69.0,36.63,2,0.53,31,2.703961
3,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,2,payload_en_500-1000.jsonl,[],2,0.0,1960,980.0,470.13,116,58.0,27.82,2,0.48,28,2.623836
4,mistral-7b-g5-huggingface-pytorch-tgi-inferenc...,4,payload_en_1-500.jsonl,[],4,0.0,1216,304.0,299.26,398,99.5,97.95,4,0.98,58,4.028569


In [None]:
df_endpoints = pd.json_normalize(endpoint_info_list)
df_endpoints['instance_type'] = df_endpoints['endpoint_config.ProductionVariants'].map(lambda x: x[0]['InstanceType'])
df_endpoints
cols_for_env = [c for c in df_endpoints.columns if 'Environment' in c]
print(cols_for_env)
cols_of_interest = ['experiment_name', 
                    'instance_type',
                    'endpoint.EndpointName',
                    'model_config.ModelName',
                    'model_config.PrimaryContainer.Image',   
                    'model_config.PrimaryContainer.ModelDataSource.S3DataSource.S3Uri']
cols_of_interest.extend(cols_for_env)

df_endpoints = df_endpoints[cols_of_interest]
df_endpoints = df_endpoints[cols_of_interest]
cols_of_interest_renamed = [c.split('.')[-1] for c in cols_of_interest]
df_endpoints.columns = cols_of_interest_renamed

# Check if 'experiment_name' column exists in both DataFrames
print("Columns in df_responses:", df_responses.columns)
print("Columns in df_endpoints:", df_endpoints.columns)

# Merge operation
df_results = pd.merge(left=df_responses, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')

# Inspect the result
df_results.head()

['model_config.PrimaryContainer.Environment.ENDPOINT_SERVER_TIMEOUT', 'model_config.PrimaryContainer.Environment.HF_MODEL_ID', 'model_config.PrimaryContainer.Environment.MAX_BATCH_PREFILL_TOKENS', 'model_config.PrimaryContainer.Environment.MAX_INPUT_LENGTH', 'model_config.PrimaryContainer.Environment.MAX_TOTAL_TOKENS', 'model_config.PrimaryContainer.Environment.MODEL_CACHE_ROOT', 'model_config.PrimaryContainer.Environment.SAGEMAKER_ENV', 'model_config.PrimaryContainer.Environment.SAGEMAKER_MODEL_SERVER_WORKERS', 'model_config.PrimaryContainer.Environment.SAGEMAKER_PROGRAM', 'model_config.PrimaryContainer.Environment.SM_NUM_GPUS']
Columns in df_responses: Index(['endpoint_name', 'prompt', 'do_sample', 'temperature', 'top_p', 'top_k',
       'max_new_tokens', 'truncate', 'completion', 'prompt_tokens',
       'completion_tokens', 'latency', 'experiment_name', 'concurrency'],
      dtype='object')
Columns in df_endpoints: Index(['experiment_name', 'instance_type', 'EndpointName', 'ModelNam

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,truncate,completion,prompt_tokens,...,ENDPOINT_SERVER_TIMEOUT,HF_MODEL_ID,MAX_BATCH_PREFILL_TOKENS,MAX_INPUT_LENGTH,MAX_TOTAL_TOKENS,MODEL_CACHE_ROOT,SAGEMAKER_ENV,SAGEMAKER_MODEL_SERVER_WORKERS,SAGEMAKER_PROGRAM,SM_NUM_GPUS
0,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
1,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
2,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia and Staunton...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
3,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
4,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1


In [None]:
df_results = pd.merge(left=df_responses, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')
df_results.head()

Unnamed: 0,endpoint_name,prompt,do_sample,temperature,top_p,top_k,max_new_tokens,truncate,completion,prompt_tokens,...,ENDPOINT_SERVER_TIMEOUT,HF_MODEL_ID,MAX_BATCH_PREFILL_TOKENS,MAX_INPUT_LENGTH,MAX_TOTAL_TOKENS,MODEL_CACHE_ROOT,SAGEMAKER_ENV,SAGEMAKER_MODEL_SERVER_WORKERS,SAGEMAKER_PROGRAM,SM_NUM_GPUS
0,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
1,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
2,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia and Staunton...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
3,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,304,\n\n```\nThe genus Sinofranchetia is from the ...,304,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1
4,lmistral7b-g5-2xlarge-1706329863,<s>[INST] <<SYS>>\nYou are an assistant for qu...,True,0.1,0.92,120,100,980,Both WAGS Atlanta and WAGS are reality televi...,980,...,3600,/opt/ml/model,8191,8191,8192,/opt/ml/model,1,1,inference.py,1


In [None]:
# Convert df_results to CSV and write to S3
csv_buffer = io.StringIO()
df_results.to_csv(csv_buffer, index=False)
csv_data_results = csv_buffer.getvalue()
results_file_name = config['results']['per_inference_request_file'].format(datetime=date_time)
results_s3_path = os.path.join(METRICS_DIR, results_file_name)
logger.info(f"results s3 path for per inference csv --> {results_s3_path}")
write_to_s3(csv_data_results, BUCKET_NAME, "", METRICS_DIR, results_file_name)
logger.info(f"saved results dataframe of shape={df_results.shape} in s3://{BUCKET_NAME}/{results_s3_path}")

[2024-01-26 23:46:16,611] p32493 {2449880785.py:7} INFO - results s3 path for per inference csv --> data/metrics/mistral-7b-tgi-g5-v1/per_inference_request_results.csv


[2024-01-26 23:46:17,031] p32493 {2449880785.py:9} INFO - saved results dataframe of shape=(42, 29) in s3://fmbt/data/metrics/mistral-7b-tgi-g5-v1/per_inference_request_results.csv


Data successfully written to s3://fmbt/data/metrics/mistral-7b-tgi-g5-v1/per_inference_request_results.csv


In [None]:
df_metrics = pd.merge(left=df_metrics, right=df_endpoints, how='left', left_on='experiment_name', right_on='experiment_name')
df_metrics.head()

# Convert df_metrics to CSV and write to S3
csv_buffer = io.StringIO()
df_metrics.to_csv(csv_buffer, index=False)
csv_data_metrics = csv_buffer.getvalue()
metrics_file_name = config['results']['all_metrics_file'].format(datetime=date_time)
metrics_s3_path = os.path.join(METRICS_DIR, metrics_file_name)
logger.info(f"results s3 path for metrics csv --> {metrics_s3_path}")
write_to_s3(csv_data_metrics, BUCKET_NAME, "", METRICS_DIR, metrics_file_name)
logger.info(f"saved metrics results dataframe of shape={df_metrics.shape} in s3://{BUCKET_NAME}/{metrics_s3_path}")

[2024-01-26 23:46:31,228] p32493 {3262244809.py:10} INFO - results s3 path for metrics csv --> data/metrics/mistral-7b-tgi-g5-v1/all_metrics.csv


[2024-01-26 23:46:31,470] p32493 {3262244809.py:12} INFO - saved metrics results dataframe of shape=(10, 31) in s3://fmbt/data/metrics/mistral-7b-tgi-g5-v1/all_metrics.csv


Data successfully written to s3://fmbt/data/metrics/mistral-7b-tgi-g5-v1/all_metrics.csv
