# Human Evaluation

## What Will We Do? 
* Create an eval prompt template (notebook 4a) 
* Create two different RAG prompt templates and compare results (notebook 4a) 
* Run LLM-As-A-Judge against an eval dataset and grade RAG system response (notebook 4a) 
* **Run Human-Eval to ensure LLM-As-A-Judge is aligned with human preferences (notebook 4b)** 
* **Run Human-Eval and compare GroundTruth with RAG system response (notebook 4b)**
* Run through evaluation with different LLMs (notebook 4c) 

So let's start.

### Install Dependencies

In [360]:
# !conda create -y --name llm-system-eval python=3.11.7
# !conda init && activate llm-system-eval
# !conda install -n llm-system-eval ipykernel --update-deps --force-reinstall -y
# OR
# !pyenv virtualenv 3.11.7 llm-system-eval
# !pyenv activate llm-system-eval

# %pip install -r requirements.txt

### Set Environment Variables

In [3]:
# set variables
from dotenv import load_dotenv, find_dotenv
import os
import sagemaker
import numpy as np
import pandas as pd
import json
import boto3


# loading environment variables that are stored in local file
local_env_filename = 'llm-system-eval.env'
load_dotenv(find_dotenv(local_env_filename),override=True)
os.environ['REGION'] = os.getenv('REGION')
REGION = os.environ['REGION']
os.environ['WORKTEAM_ARN'] = os.getenv('WORKTEAM_ARN')
WORKTEAM_ARN = os.environ['WORKTEAM_ARN']
os.environ['S3_BUCKET_NAME'] = os.getenv('S3_BUCKET_NAME')
S3_BUCKET_NAME = os.environ['S3_BUCKET_NAME']
os.environ['SAGEMAKER_ROLE_ARN'] = os.getenv('SAGEMAKER_ROLE_ARN')
SAGEMAKER_ROLE_ARN = os.environ['SAGEMAKER_ROLE_ARN'] # OR sagemaker.get_execution_role()



# Also initialize the bedrock client so we can call some embedding models!
session = boto3.Session(profile_name='default')
bedrock_runtime_client = boto3.client('bedrock-runtime')

sagemaker_session = sagemaker.Session()
sagemaker_client = boto3.client('sagemaker')
role_name = SAGEMAKER_ROLE_ARN.split("/")[-1]

bucket = sagemaker_session.default_bucket()
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

# Specify ARNs for resources needed to run an text classification job.
ac_arn_map = {
    "us-west-2": "081040173940",
    "us-east-1": "432418664414",
    "us-east-2": "266458841044",
    "eu-west-1": "568282634449"
}
# PreHumanTaskLambdaArn for text classification(single)
prehuman_arn = "arn:aws:lambda:{}:{}:function:PRE-PassThrough".format(
    REGION, ac_arn_map[REGION]
)

# AnnotationConsolidationConfig for text classification(single)
acs_arn = "arn:aws:lambda:{}:{}:function:ACS-PassThrough".format(REGION, ac_arn_map[REGION])

print(prehuman_arn)
print(acs_arn)

arn:aws:lambda:us-east-1:432418664414:function:PRE-PassThrough
arn:aws:lambda:us-east-1:432418664414:function:ACS-PassThrough


# Human Evaluation

## Human Eval1

In [4]:
# Human Eval1: review LLM-as-a-Judge evaluation results with human evaluation results to ensure they align.

# Step 1: Load jsonl file, e.g. 4_answer_validation_opensearch_graded1.jsonl

human_eval_df = pd.read_json('../data/eval-datasets/4_answer_validation_opensearch_graded1.jsonl', lines=True)

eval_criteria = '''Evaluation Criteria (Additive Score, 0-3):
                    1. Context: Award 1 point if the answer uses only information provided in the context, without introducing external or fabricated details.
                    2. Completeness: Add 1 point if the answer addresses all key elements of the question based on the available context, without omissions.
                    3. Conciseness: Add a final point if the answer uses the fewest words possible to address the question and avoids redundancy.
                    '''
# Step 2: Convert to the desired format for GroundTruth
formatted_data = ""
for index, row in human_eval_df.iterrows():
    entry = {
        "source": row["query_text"],
        "response": row["generated_answer"],
        "eval_criteria": eval_criteria,
        "reasoning": row["reasoning"],
        "score": row["score"],
        "retrieved_chunks": str(row["retrieved_chunks"]),
    }
    formatted_data += json.dumps(entry) + "\n"

# Step 3: Save the output as a new JSON file
local_path = '../data/eval-datasets/4_review_judge.json'
with open(local_path, 'w') as f:
    f.write(formatted_data)

print(f"Conversion complete. The file {local_path} is ready.")

# Step 4: Upload the manifest to S3
bucket_name = bucket
prefix = 'genai'
# Upload the file to S3
manifest_s3_uri = sagemaker_session.upload_data(path=local_path, bucket=bucket, key_prefix=prefix)

print(f"File uploaded to {manifest_s3_uri}")

Conversion complete. The file ../data/eval-datasets/4_review_judge.json is ready.
File uploaded to s3://sagemaker-us-east-1-026459568683/genai/4_review_judge.json


In [5]:
# Step 5: Create template for judge eval review
from IPython.display import display, HTML

def make_template(save_fname="4_instructions_judge_review.template"):
    template = """
      <html>
      <head>
        <meta charset="UTF-8" />
        <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
        <script src="https://assets.crowd.aws/gen-ai-components.js"></script>
        <script>
          document.getElementById('submitButton').onclick = function() {
              document.querySelector('crowd-form').submit();
          };
        </script>
        <style>
          body {
            padding-left: 20px;
            margin-bottom: 20px;
          }
          .highlight {
            background-color: yellow;
          }
          .container {
            margin-bottom: 20px;
          }
          .outer-container {
            display: flex;
            justify-content: space-between;
            width: 100%;
            max-width: 1400px;
          }
          .left-container, .right-container {
            padding: 20px;
            min-width: 75%;
          }
          .right-container {
            border-left: 2px solid #d5dbdb;
          }
        </style>
      </head>
    <body>
    <crowd-form>
      <!-- Prevent crowd-form from creating its own button -->
      <crowd-button form-action="submit" style="display: none;"></crowd-button> 
      <div class="container">
        <h1>Instructions</h1>
        Review the prompt, RAG System response, RAG Context, and evaluation critera. Then complete the tasks on the right panel using your best judgment. 
      </div>
      <br>
      <div class="outer-container">
        <div class="left-container">
          <h2>Evaluation Criteria</h2>
          <div>{{ task.input.eval_criteria }}</div>
          <br>
          <h2>Original Prompt</h2>
          <div>{{ task.input.source }}</div>
          <br>
          <h2>RAG System Response</h2>
          <div>{{ task.input.response }}</div>
          <br>
          <h2>RAG Context</h2>
          <div>{{ task.input.retrieved_chunks }}</div>
          <br> 
          <h2>LLM-as-a-Judge Score</h2>
          <div>{{ task.input.score }}</div>
          <br>
          <h2>LLM-as-a-Judge Reasoning</h2>
          <div>{{ task.input.reasoning }}</div>
          
                
        </div>
        <div class="right-container">
          <h2>Evaluation</h2>
          <div>
            <label for="context">Context:</label>
            <div>
              <input type="radio" id="context1" name="context" value="1" required>
              <label for="context1">1 Point</label><br>
              <input type="radio" id="context2" name="context" value="0">
              <label for="context2">0 Point</label><br>
            </div>
          </div>
          <br>
          <div>
            <label for="completeness">Completeness:</label>
            <div>
              <input type="radio" id="completeness1" name="completeness" value="1" required>
              <label for="completeness1">1 Point</label><br>
              <input type="radio" id="completeness2" name="completeness" value="0">
              <label for="completeness2">0 Point</label><br>
            </div>
          </div>
          <br>
          <div>
            <label for="conciseness">Conciseness:</label>
            <div>
              <input type="radio" id="conciseness1" name="conciseness" value="1" required>
              <label for="conciseness1">1 Point</label><br>
              <input type="radio" id="conciseness2" name="conciseness" value="0">
              <label for="conciseness2">0 Point</label><br>
            </div>
          </div>
          <br>
          <div>
            <label for="issues">Select any issues observed:</label>
            <div>
              <input type="checkbox" id="grammar" name="issues" value="grammar">
              <label for="grammar">Grammar issues</label><br>
              <input type="checkbox" id="factual" name="issues" value="factual">
              <label for="factual">Factual inaccuracies</label><br>
              <input type="checkbox" id="missing" name="issues" value="missing">
              <label for="missing">Missing information</label><br>
              <input type="checkbox" id="other" name="issues" value="other">
              <label for="other">Other</label>
            </div>
          </div>
          <br>
          <div>
           <crowd-button id="submitButton">Submit</crowd-button>
          </div
        </div>
      </div>
    </crowd-form>
    </body>
    </html>
    """
    with open(save_fname, "w") as f:
        f.write(template)

# Create the template file locally
local_path ='../data/eval-datasets/4_instructions_judge_review.template'
make_template(save_fname=local_path)

# Upload the file to S3
template_s3_uri = sagemaker_session.upload_data(path=local_path, bucket=bucket, key_prefix=prefix)

print(f"Template file uploaded to {template_s3_uri}")

Template file uploaded to s3://sagemaker-us-east-1-026459568683/genai/4_instructions_judge_review.template


In [6]:
# Step 6: Create Human Eval job with GroundTruth: review LLM as a Judge evaluation results with human evaluation results
import datetime

now = datetime.datetime.now()
timestamp_str = now.strftime("%Y%m%d-%H%M%S")
labeling_job_name = "rag-review-llm-judge-" + timestamp_str

sagemaker_client.create_labeling_job(
    LabelingJobName=labeling_job_name,
    LabelAttributeName='label',
    InputConfig={
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': manifest_s3_uri #Enter S3 URI of Input Data Json
            }
        }
    },
    OutputConfig={
        'S3OutputPath': f's3://{bucket}/{prefix}/output/' #Enter S3 URI of Output folder
    },
    RoleArn=SAGEMAKER_ROLE_ARN, #Enter IAM Sagemaker Execution Role here,
    HumanTaskConfig={
        'WorkteamArn': WORKTEAM_ARN, #Enter Workteam ARN
        'UiConfig': {
            'UiTemplateS3Uri': template_s3_uri #Enter S3 URI of UI template
        },
        'PreHumanTaskLambdaArn': prehuman_arn,
        'TaskKeywords': [
            'QnA',
        ],
        'TaskTitle': labeling_job_name,
        'TaskDescription': "Review the LLM-as-a-Judge results using the dimensions on the side",
        'NumberOfHumanWorkersPerDataObject': 1,
        'TaskTimeLimitInSeconds': 60*30,
        'TaskAvailabilityLifetimeInSeconds': 60*60*24*10,
        'MaxConcurrentTaskCount': 100,
        'AnnotationConsolidationConfig': {
            'AnnotationConsolidationLambdaArn': acs_arn
        }
    })

{'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:026459568683:labeling-job/rag-review-llm-judge-20240827-233822',
 'ResponseMetadata': {'RequestId': '700b4ee3-ad60-4d9c-aee4-786aaa0cf8f1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '700b4ee3-ad60-4d9c-aee4-786aaa0cf8f1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '111',
   'date': 'Wed, 28 Aug 2024 03:38:22 GMT'},
  'RetryAttempts': 0}}

In [7]:
# Step 7: Do Human evaluation
workforce = sagemaker_client.describe_workforce(WorkforceName="default")
worker_portal_url = 'https://' + workforce["Workforce"]["SubDomain"]


# Display the URL and instructions
display(HTML(f"""
<body>
<h4>Review LLM-as-a-Judge results</h4>
<p>Please complete the human evaluation tasks available in the labeling portal.</p>
<p><a href="{worker_portal_url}">{worker_portal_url}</a>
<p><b>Ensure all tasks are completed before proceeding to the next steps in this notebook.<b></p>
<body>
"""))

In [14]:
# Step 8: monitor the job's progress
sagemaker_client.describe_labeling_job(LabelingJobName=labeling_job_name)
# stop labeling job
# sagemaker_client.stop_labeling_job(LabelingJobName=labeling_job_name)

{'LabelingJobStatus': 'Completed',
 'LabelCounters': {'TotalLabeled': 24,
  'HumanLabeled': 24,
  'MachineLabeled': 0,
  'FailedNonRetryableError': 0,
  'Unlabeled': 0},
 'CreationTime': datetime.datetime(2024, 8, 27, 23, 38, 23, 55000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 8, 28, 11, 2, 40, 498000, tzinfo=tzlocal()),
 'JobReferenceCode': '026459568683QNSoxT1GelBEmC4k',
 'LabelingJobName': 'rag-review-llm-judge-20240827-233822',
 'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:026459568683:labeling-job/rag-review-llm-judge-20240827-233822',
 'LabelAttributeName': 'label',
 'InputConfig': {'DataSource': {'S3DataSource': {'ManifestS3Uri': 's3://sagemaker-us-east-1-026459568683/genai/4_review_judge.json'}}},
 'OutputConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-026459568683/genai/output/',
  'KmsKeyId': ''},
 'RoleArn': 'arn:aws:iam::026459568683:role/demo-SagemakerNotebookIAMRole-026459568683',
 'StoppingConditions': {'MaxPercentageOfInputDatasetLabeled': 1

In [17]:
# # Step 9: Compare HumanEval with LLM-as-a-Judge results
def load_json_from_s3(bucket, key):
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    data = json.loads(obj['Body'].read().decode('utf-8'))
    return data

source_bucket_name = bucket
key = f'{prefix}/output/{labeling_job_name}/manifests/output/output.manifest'

# Download the manifest file
response = s3_client.get_object(Bucket=source_bucket_name, Key=key)
content = response['Body'].read().decode('utf-8')

# Split the content into lines
lines = content.strip().split('\n')

# Parse each line as JSON and create a list of dictionaries
data = [json.loads(line) for line in lines]
manifest_df = pd.DataFrame(data)

for index, row in manifest_df.iterrows():
    label_metadata = row['label-metadata']
    key = label_metadata['worker-response-ref'].replace(f's3://{bucket}/','')
    data = load_json_from_s3(bucket, key)
    completeness = 0
    conciseness = 0
    context = 0
    for answer in data["answers"]:
        human_eval_score = 0
        if answer['answerContent']['completeness']['1'] == True:
            human_eval_score += 1
            completeness = 1

        if answer['answerContent']['conciseness']['1'] == True:
            human_eval_score += 1
            conciseness = 1
        
        if answer['answerContent']['context']['1'] == True:
            human_eval_score += 1
            context = 1
        
    manifest_df.at[index, 'completeness'] = completeness
    manifest_df.at[index, 'conciseness'] = conciseness
    manifest_df.at[index, 'context'] = context
    manifest_df.at[index, 'human_eval_score'] = human_eval_score

manifest_df = manifest_df.drop(columns=['label','eval_criteria','retrieved_chunks','label-metadata'])

# Convert the dataframe to an HTML table with escaped HTML content
table_html = manifest_df.to_html(index=False, classes='table table-striped', escape=False)
display(HTML(table_html))

source,response,reasoning,score,completeness,conciseness,context,human_eval_score
I'm using version 2.1 of open search and trying to use zstd compression. Why isn't it working?,There is not enough information available to answer this question. The context does not mention why zstd compression is not working for your specific case with OpenSearch version 2.1. The context only provides information about the zstd compression codec being introduced in OpenSearch 2.9 and later versions.,Context:\n1. The context does not provide any information about why zstd compression is not working for OpenSearch version 2.1. It only mentions that zstd compression was introduced in OpenSearch 2.9 and later versions.\n\nCompleteness:\n1. The answer correctly states that there is not enough information in the context to explain why zstd compression is not working for OpenSearch 2.1.\n2. The answer addresses the key elements of the question by acknowledging that the context does not cover version 2.1 and only provides information about compression codecs available in 2.9 and later versions.\n\nConciseness:\n1. The answer is concise and does not contain any redundant information. It directly states that the context lacks the necessary information to answer the question.,3,0.0,1.0,1.0,2.0
"I'm trying to set up this new aggregate view thing for saved objects in OpenSearch Dashboards, but I'm worried about messing up our existing multi-tenancy setup. The docs mention something about tenant indexes and a kibana_server role. How do I make sure I don't break anything when I turn this feature on? And what's the deal with not being able to turn it off once it's enabled","The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. As for the kibana_server role, it is required for OpenSearch Dashboards to perform necessary operations and has permissions defined for various indexes and actions. There is not enough information provided about how to ensure the existing multi-tenancy setup is not broken when enabling the aggregate view feature.","Context:\n1. The context mentions that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag. It also states that once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The context does not provide any information on how to ensure the existing multi-tenancy setup is not broken when enabling this feature.\n\nCompleteness:\n1. The answer addresses the key elements of the question by mentioning that the aggregate view feature is experimental, must be enabled using a feature flag, and cannot be disabled once enabled without potential issues. However, it does not provide any information on how to ensure the existing multi-tenancy setup is not broken, as mentioned in the question.\n\nConciseness:\n1. The answer is concise and does not contain any redundant information.",2,1.0,1.0,1.0,3.0
What software license does Opensearch use?,There is not enough information available to answer this question,"Context:\n1. The context does not provide any information about the software license used by OpenSearch. Therefore, I cannot award a point for this criterion.\n\nCompleteness:\n2. The generated answer correctly states that there is not enough information available to answer the question based on the provided context. Therefore, I will award 1 point for completeness.\n\nConciseness:\n3. The generated answer is concise and uses the fewest words possible to address the question without redundancy. I will award 1 point for conciseness.",2,0.0,1.0,1.0,2.0
Does GPU accelerated nodes support Pytorch?,"Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report.","Context:\n1. The context clearly states that ""GPU-accelerated ML nodes require [PyTorch](https://pytorch.org/docs/stable/index.html) 1.12.1 work with ML models."" Therefore, the answer uses information provided in the context and does not introduce external or fabricated details. I will award 1 point for this criterion.\n\nCompleteness:\n2. The question asks if GPU accelerated nodes support PyTorch, and the answer confirms that they support PyTorch 1.12.1 according to the report. This addresses all key elements of the question based on the available context, so I will award 1 point for completeness.\n\nConciseness:\n3. The answer is concise and uses the fewest words possible to address the question without redundancy. I will award 1 point for conciseness.",3,1.0,1.0,1.0,3.0
Does opensearch support hugging face models? If so which ones?,"Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context.","Context:\n1. The context provides a detailed list of Hugging Face models supported by OpenSearch for text embedding, sparse encoding, and cross-encoding tasks. It covers the model names, versions, vector dimensions, auto-truncation support, and artifact links for downloading the models.\n\nCompleteness:\n2. The generated answer directly addresses the original question by confirming that OpenSearch supports Hugging Face models and specifying which ones are supported under the ""Sentence transformers"" and ""Cross-encoder models"" sections. The answer covers all key elements of the question based on the provided context.\n\nConciseness:\n3. The answer is concise and uses the fewest words possible to address the question without redundancy.\n\nScoring:\nContext: 1 point - The answer uses only information provided in the context without introducing external details.\nCompleteness: 1 point - The answer addresses all key elements of the question based on the available context.\nConciseness: 1 point - The answer is concise and avoids redundancy.\n\nTotal score: 3/3",3,0.0,1.0,1.0,2.0
"I have a custom model, can I run it in Opensearch?","Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search.","1. Context: The context provides detailed information about registering and deploying models in OpenSearch, including custom models, pretrained models, and models hosted on third-party platforms. It covers the required parameters and provides examples for each scenario.\n\n2. Completeness: The generated answer fully addresses the original question of whether custom models can be run in OpenSearch. It confirms that custom models can be registered and deployed in OpenSearch, and provides the necessary steps and parameters to do so.\n\n3. Conciseness: The generated answer is concise and directly addresses the question without redundancy.",3,1.0,1.0,1.0,3.0
"I have a model and some ML nodes, how do I boost it's performance?","To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.","Context:\n- The context provides information on how to boost the performance of an ML model by using GPU acceleration on ML nodes in an OpenSearch cluster.\n- It mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia.\n- It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.\n\nQuestion:\n""I have a model and some ML nodes, how do I boost it's performance?""\n\nGenerated answer:\n""To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.""\n\nEvaluation based on criteria:\n\n1. Context: The answer uses only information provided in the context, without introducing external or fabricated details. Award 1 point.\n\n2. Completeness: The answer addresses all key elements of the question based on the available context, without omissions. It explains how to boost performance by using GPU acceleration and mentions the supported GPU instances. Award 1 point.\n\n3. Conciseness: The answer is concise and uses the fewest words possible to address the question without redundancy. Award 1 point.\n\nTotal score: 3/3",3,1.0,1.0,1.0,3.0
Can you show me an example of how to use lat/long coordinates?,"The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field in OpenSearch:\n\n- As an object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- As a string with latitude and longitude separated by a comma: ""40.71,74.00""\n- As a geohash string: ""txhxegj0uyp3""\n- As an array with longitude first, then latitude: [74.00, 40.71]\n- As a Well-Known Text string: ""POINT (74.00 40.71)""\n- As a GeoJSON object: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }","Context:\n1. The answer uses only information provided in the context, without introducing any external or fabricated details. It directly quotes examples from the context, so I will award 1 point for this criterion.\n\nCompleteness:\n2. The answer addresses all key elements of the question by providing multiple examples of how to use latitude and longitude coordinates to index a geopoint field. It covers the various formats mentioned in the context, so I will award 1 point for completeness.\n\nConciseness:\n3. The answer includes all the relevant examples from the context without any unnecessary redundancy. It uses a concise format to present the examples, so I will award 1 point for conciseness as well.",3,1.0,1.0,1.0,3.0
How do I use vector search?,"There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search.","Context:\n- The context provides an overview of vector search in OpenSearch, including how to create vector indexes, ingest vector data, and perform vector searches using different methods (approximate k-NN, script score k-NN, and Painless extensions).\n- It covers the different engines (nmslib, Faiss, Lucene) and algorithms (HNSW, IVF) supported for approximate k-NN search, as well as how to choose the right method based on requirements like query latency, query quality, memory limits, and indexing latency.\n- The context also explains how to generate vector embeddings within OpenSearch using the Neural Search plugin.\n\nCompleteness:\n- The context does not provide specific instructions on how to use vector search, but rather gives an overview of the different vector search capabilities in OpenSearch.\n- It does not cover the actual steps or API requests needed to perform a vector search query.\n\nConciseness:\n- The context provides a comprehensive overview of vector search in OpenSearch without unnecessary details or redundancy.\n\nBased on the evaluation criteria, I would award the following scores:\n\nContext: 1 point - The answer uses only information provided in the context.\nCompleteness: 0 points - The answer does not address how to use vector search, which is the key element of the question.\nConciseness: 1 point - The answer is concise and does not contain redundant information.\n\nTotal score: 2",2,0.0,1.0,1.0,2.0
How do I understand the memory requirements for using hnsw?,"The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension.","Context:\n1. The answer uses information directly from the provided context, without introducing any external or fabricated details. It provides the memory estimation formulas for HNSW and IVF with different quantization methods like Lucene scalar quantization, Faiss 16-bit scalar quantization, and Faiss product quantization. So I will award 1 point for this criterion.\n\nCompleteness:\n2. The answer addresses all key elements of the question by providing detailed memory estimation formulas and examples for different quantization methods used with HNSW and IVF. It covers the memory requirements for using HNSW with vector quantization comprehensively based on the given context. So I will award 1 point for this criterion.\n\nConciseness:\n3. The answer is quite detailed and comprehensive in covering the different quantization methods and their memory estimation formulas. While it could potentially be more concise, the level of detail provided is necessary to fully address the question based on the given context. Therefore, I will award 1 point for this criterion.",3,1.0,1.0,1.0,3.0


## Human Eval2

In [32]:
# Human Eval2: compare ground truth against generated anwsers to ensure it is in line with human preferences

# Step 1: Load jsonl file, e.g. 4_answer_validation_opensearch_graded1.jsonl

human_eval_df = pd.read_json('../data/eval-datasets/4_answer_validation_opensearch_graded1.jsonl', lines=True)

# Step 2: Convert to the desired format
import re
ANSWER_PATTERN = r'<question_answer>(.*?)</question_answer>'

# Strip out the portion of the response with regex.
def extract_with_regex(response, regex):
    matches = re.search(regex, response, re.DOTALL)
    # Extract the matched content, if any
    return matches.group(1).strip() if matches else None

formatted_data = ""
for index, row in human_eval_df.iterrows():
    entry = {
        "source": row["query_text"],
        "responses": [extract_with_regex(row["groundtruth_answer"], ANSWER_PATTERN), row["generated_answer"]]
        
    }
    formatted_data += json.dumps(entry) + "\n"

# Step 3: Save the output as a new JSON file
with open('../data/eval-datasets/4_gt_input_manifest_textranking.json', 'w') as f:
    f.write(formatted_data)

print("Conversion complete. The file '4_gt_input_manifest_textranking.json' is ready.")

# Upload the manifest to S3
bucket = sagemaker_session.default_bucket()
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

bucket_name = bucket
prefix = 'genai'
local_path = '../data/eval-datasets/4_gt_input_manifest_textranking.json'

# Upload the file to S3
manifest_s3_uri = sagemaker_session.upload_data(path=local_path, bucket=bucket, key_prefix=prefix)

print(f"File uploaded to {manifest_s3_uri}")

Conversion complete. The file '4_gt_input_manifest_textranking.json' is ready.
File uploaded to s3://sagemaker-us-east-1-026459568683/genai/4_gt_input_manifest_textranking.json


In [33]:
# Create Review Template
from IPython.display import display, HTML

def make_template(save_fname="instructions_tr.template"):
    template = """
    <html>
      <head>
        <meta charset="UTF-8" />
        <link rel="stylesheet" href="https://assets.crowd.aws/css/gen-ai-components.css" />
        <link rel="icon" href="data:image/svg+xml,<svg xmlns=%22http://www.w3.org/2000/svg%22 viewBox=%220 0 100 100%22><text y=%22.9em%22 font-size=%2290%22>&#129351;</text></svg>" />
        <title>Ranking Tool</title>
        <script src="https://assets.crowd.aws/gen-ai-components.js"></script>
      </head>

      <body>
        <div>
          <crowd-text-ranking
            crowd-form-element-id="crowd-form-submit"
            instructions='Rank the following responses from a RAG system according to the evaluation criteria.'
            ordinal-ranking-dimensions='[{"name":"Accuracy","allowTie":true},{"name":"Clarity","allowTie":true}]'
            text='{{ task.input.source }}'
            responses='{{ task.input.responses | to_json }}' />
        </div>
        <crowd-form id="crowd-form-submit" style="display: none"></crowd-form>
        <script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
      </body>
    </html>
    """
    with open(save_fname, "w") as f:
        f.write(template)

# Create the template file locally
local_path ='../data/eval-datasets/4_instructions_tr.template'
make_template(save_fname=local_path)

# Upload the file to S3
template_s3_uri = sagemaker_session.upload_data(path=local_path, bucket=bucket, key_prefix=prefix)

print(f"Template file uploaded to {template_s3_uri}")

Template file uploaded to s3://sagemaker-us-east-1-026459568683/genai/4_instructions_tr.template


In [34]:
import datetime

now = datetime.datetime.now()
timestamp_str = now.strftime("%Y%m%d-%H%M%S")
labeling_job_name = "rag-system-eval-" + timestamp_str

sagemaker_client.create_labeling_job(
    LabelingJobName=labeling_job_name,
    LabelAttributeName='label',
    InputConfig={
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': manifest_s3_uri #Enter S3 URI of Input Data Json
            }
        }
    },
    OutputConfig={
        'S3OutputPath': f's3://{bucket}/{prefix}/output/' #Enter S3 URI of Output folder
    },
    RoleArn=SAGEMAKER_ROLE_ARN, #Enter IAM Sagemaker Execution Role here,
    HumanTaskConfig={
        'WorkteamArn': WORKTEAM_ARN, #Enter Workteam ARN
        'UiConfig': {
            'UiTemplateS3Uri': template_s3_uri #Enter S3 URI of UI template
        },
        'PreHumanTaskLambdaArn': prehuman_arn,
        'TaskKeywords': [
            'QnA',
        ],
        'TaskTitle': labeling_job_name,
        'TaskDescription': "Rank the responses provided by the Models based on the Dimensions",
        'NumberOfHumanWorkersPerDataObject': 1,
        'TaskTimeLimitInSeconds': 60*30,
        'TaskAvailabilityLifetimeInSeconds': 60*60*24*10,
        'MaxConcurrentTaskCount': 100,
        'AnnotationConsolidationConfig': {
            'AnnotationConsolidationLambdaArn': acs_arn
        }
    })

{'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:026459568683:labeling-job/rag-system-eval-20240828-140613',
 'ResponseMetadata': {'RequestId': 'b029fd7b-60b4-456f-90f7-82a1c608187e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b029fd7b-60b4-456f-90f7-82a1c608187e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '106',
   'date': 'Wed, 28 Aug 2024 18:06:13 GMT'},
  'RetryAttempts': 0}}

In [35]:
workforce = sagemaker_client.describe_workforce(WorkforceName="default")
worker_portal_url = 'https://' + workforce["Workforce"]["SubDomain"]


# Display the URL and instructions
display(HTML(f"""
<body>
<h4>Gather human preference data</h4>
<p>Please complete the human evaluation tasks available in the labeling portal.</p>
<p><a href="{worker_portal_url}">{worker_portal_url}</a>
<p><b>Ensure all tasks are completed before proceeding to the next steps in this notebook.<b></p>
<body>
"""))

In [40]:
# monitor the job's progress
sagemaker_client.describe_labeling_job(LabelingJobName=labeling_job_name)
# sagemaker_client.stop_labeling_job(LabelingJobName=labeling_job_name)

{'LabelingJobStatus': 'Completed',
 'LabelCounters': {'TotalLabeled': 24,
  'HumanLabeled': 24,
  'MachineLabeled': 0,
  'FailedNonRetryableError': 0,
  'Unlabeled': 0},
 'CreationTime': datetime.datetime(2024, 8, 28, 14, 6, 14, 76000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 8, 28, 14, 38, 53, 84000, tzinfo=tzlocal()),
 'JobReferenceCode': '026459568683UKqXoQllz4To3x2V',
 'LabelingJobName': 'rag-system-eval-20240828-140613',
 'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:026459568683:labeling-job/rag-system-eval-20240828-140613',
 'LabelAttributeName': 'label',
 'InputConfig': {'DataSource': {'S3DataSource': {'ManifestS3Uri': 's3://sagemaker-us-east-1-026459568683/genai/4_gt_input_manifest_textranking.json'}}},
 'OutputConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-026459568683/genai/output/',
  'KmsKeyId': ''},
 'RoleArn': 'arn:aws:iam::026459568683:role/demo-SagemakerNotebookIAMRole-026459568683',
 'StoppingConditions': {'MaxPercentageOfInputDatasetLabel

In [46]:
import boto3
import json
import pandas as pd
from IPython.display import display, HTML

s3_client = boto3.client('s3')

def load_json_from_s3(bucket, key):
    obj = s3_client.get_object(Bucket=bucket, Key=key)
    data = json.loads(obj['Body'].read().decode('utf-8'))
    return data

source_bucket_name = bucket
key = f'{prefix}/output/{labeling_job_name}/manifests/output/output.manifest'

# Download the manifest file
response = s3_client.get_object(Bucket=source_bucket_name, Key=key)
content = response['Body'].read().decode('utf-8')

# Split the content into lines
lines = content.strip().split('\n')

# Parse each line as JSON and create a list of dictionaries
data = [json.loads(line) for line in lines]
manifest_df = pd.DataFrame(data)

def highlight_response(original, ranked, rank, is_changed, has_same_rank):
    if is_changed:
        return f'<span style="background-color: yellow;">{ranked}</span>'
    elif has_same_rank:
        return f'<span style="background-color: #CCFFCC;">{ranked}</span>'
    return ranked

for index, row in manifest_df.iterrows():
    responses = row['responses']
    # print(f'responses: {responses}')
    label_metadata = row['label-metadata']
    key = label_metadata['worker-response-ref'].replace(f's3://{bucket}/','')
    data = load_json_from_s3(bucket, key)
    for answer in data["answers"]:
        for dimension in answer["answerContent"]["ordinalRankingDimensions"]:
            rankings = dimension["responseRankings"]
            
            # Store the original rankings in a new field
            dimension["responseIndex"] = rankings
            # print(f'rankings: {rankings}')
            
            # Create a list to store the response texts based on rankings
            ranked_responses = [responses[i - 1] for i, _ in sorted(enumerate(rankings, start=1), key=lambda x: x[1])]
            
            # Identify responses with the same ranking
            same_rank_indices = [i for i, rank in enumerate(rankings) if rankings.count(rank) > 1]
            
            # Compare ranked_responses with original responses and highlight differences
            highlighted_responses = []
            for i, (orig, ranked) in enumerate(zip(responses, ranked_responses)):
                is_changed = orig != ranked
                has_same_rank = i in same_rank_indices
                highlighted_responses.append(highlight_response(orig, ranked, rankings[i], is_changed, has_same_rank))
            
            dimension["responseRankings"] = highlighted_responses
            if dimension['name'] == 'Accuracy':
                manifest_df.at[index, 'Accuracy'] = ''.join(highlighted_responses)
            if dimension['name'] == 'Clarity':
                manifest_df.at[index, 'Clarity'] = ''.join(highlighted_responses)

manifest_df = manifest_df.drop(columns=['label','label-metadata'])

# Convert the dataframe to an HTML table with escaped HTML content
table_html = manifest_df.to_html(index=False, classes='table table-striped', escape=False)
display(HTML(table_html))

source,responses,Accuracy,Clarity
I'm using version 2.1 of open search and trying to use zstd compression. Why isn't it working?,"[There is not enough information available to answer this question. The context does not mention why zstd compression is not working for OpenSearch version 2.1. The context provides information about compression codecs available in OpenSearch 2.9 and later versions, but does not cover version 2.1., There is not enough information available to answer this question. The context does not mention why zstd compression is not working for your specific case with OpenSearch version 2.1. The context only provides information about the zstd compression codec being introduced in OpenSearch 2.9 and later versions.]","There is not enough information available to answer this question. The context does not mention why zstd compression is not working for OpenSearch version 2.1. The context provides information about compression codecs available in OpenSearch 2.9 and later versions, but does not cover version 2.1.There is not enough information available to answer this question. The context does not mention why zstd compression is not working for your specific case with OpenSearch version 2.1. The context only provides information about the zstd compression codec being introduced in OpenSearch 2.9 and later versions.","There is not enough information available to answer this question. The context does not mention why zstd compression is not working for OpenSearch version 2.1. The context provides information about compression codecs available in OpenSearch 2.9 and later versions, but does not cover version 2.1.There is not enough information available to answer this question. The context does not mention why zstd compression is not working for your specific case with OpenSearch version 2.1. The context only provides information about the zstd compression codec being introduced in OpenSearch 2.9 and later versions."
"I'm trying to set up this new aggregate view thing for saved objects in OpenSearch Dashboards, but I'm worried about messing up our existing multi-tenancy setup. The docs mention something about tenant indexes and a kibana_server role. How do I make sure I don't break anything when I turn this feature on? And what's the deal with not being able to turn it off once it's enabled","[The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. There is no information provided about how to ensure the existing multi-tenancy setup is not broken when enabling this feature., The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. As for the kibana_server role, it is required for OpenSearch Dashboards to perform necessary operations and has permissions defined for various indexes and actions. There is not enough information provided about how to ensure the existing multi-tenancy setup is not broken when enabling the aggregate view feature.]","The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. There is no information provided about how to ensure the existing multi-tenancy setup is not broken when enabling this feature.The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. As for the kibana_server role, it is required for OpenSearch Dashboards to perform necessary operations and has permissions defined for various indexes and actions. There is not enough information provided about how to ensure the existing multi-tenancy setup is not broken when enabling the aggregate view feature.","The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. There is no information provided about how to ensure the existing multi-tenancy setup is not broken when enabling this feature.The docs mention that the aggregate view feature for saved objects in OpenSearch Dashboards is an experimental feature that must be enabled using a feature flag in the opensearch_dashboards.yml file. Once enabled, it cannot be disabled without potentially causing issues like data loss or breaking tenant-to-tenant functionality. The docs also state that this feature should only be used in a new test cluster, not in an existing production cluster. As for the kibana_server role, it is required for OpenSearch Dashboards to perform necessary operations and has permissions defined for various indexes and actions. There is not enough information provided about how to ensure the existing multi-tenancy setup is not broken when enabling the aggregate view feature."
What software license does Opensearch use?,"[There is not enough information available to answer this question, There is not enough information available to answer this question]",There is not enough information available to answer this questionThere is not enough information available to answer this question,There is not enough information available to answer this questionThere is not enough information available to answer this question
Does GPU accelerated nodes support Pytorch?,"[Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report., Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report.]","Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report.Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report.","Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report.Yes, GPU accelerated nodes support PyTorch 1.12.1 according to the report."
Does opensearch support hugging face models? If so which ones?,"[Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context., Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context.]","Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context.Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context.","Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context.Yes, OpenSearch supports Hugging Face models for text embedding and cross-encoding tasks. The supported Hugging Face models are listed under the ""Sentence transformers"" and ""Cross-encoder models"" sections in the provided context."
"I have a custom model, can I run it in Opensearch?","[Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search., Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search.]","Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search.Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search.","Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search.Yes, you can run your custom model in OpenSearch. The documentation provides instructions on how to register and deploy a custom model in OpenSearch. You need to provide the model file URL, model configuration details like model type and embedding dimension, and other metadata to register the model. Once registered, you can deploy the model and use it for inference tasks like text embedding or neural search."
"I have a model and some ML nodes, how do I boost it's performance?","[To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration., To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.]","To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.","To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration.To boost the performance of your ML model, you can take advantage of GPU acceleration on your ML node. The report mentions that OpenSearch supports GPU instances from NVIDIA with CUDA 11.6 and AWS Inferentia. It provides instructions for preparing NVIDIA and AWS Inferentia ML nodes to enable GPU acceleration."
Can you show me an example of how to use lat/long coordinates?,"[The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field:\n\n- An object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- A string with latitude and longitude separated by a comma: ""40.71,74.00""\n- A geohash string: ""txhxegj0uyp3""\n- An array with longitude first, then latitude: [74.00, 40.71]\n- A Well-Known Text string: ""POINT (74.00 40.71)""\n- A GeoJSON object with ""type"": ""Point"" and ""coordinates"" array: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }, The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field in OpenSearch:\n\n- As an object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- As a string with latitude and longitude separated by a comma: ""40.71,74.00""\n- As a geohash string: ""txhxegj0uyp3""\n- As an array with longitude first, then latitude: [74.00, 40.71]\n- As a Well-Known Text string: ""POINT (74.00 40.71)""\n- As a GeoJSON object: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }]","The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field:\n\n- An object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- A string with latitude and longitude separated by a comma: ""40.71,74.00""\n- A geohash string: ""txhxegj0uyp3""\n- An array with longitude first, then latitude: [74.00, 40.71]\n- A Well-Known Text string: ""POINT (74.00 40.71)""\n- A GeoJSON object with ""type"": ""Point"" and ""coordinates"" array: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field in OpenSearch:\n\n- As an object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- As a string with latitude and longitude separated by a comma: ""40.71,74.00""\n- As a geohash string: ""txhxegj0uyp3""\n- As an array with longitude first, then latitude: [74.00, 40.71]\n- As a Well-Known Text string: ""POINT (74.00 40.71)""\n- As a GeoJSON object: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }","The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field:\n\n- An object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- A string with latitude and longitude separated by a comma: ""40.71,74.00""\n- A geohash string: ""txhxegj0uyp3""\n- An array with longitude first, then latitude: [74.00, 40.71]\n- A Well-Known Text string: ""POINT (74.00 40.71)""\n- A GeoJSON object with ""type"": ""Point"" and ""coordinates"" array: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }The report provides several examples of how to use latitude and longitude coordinates to index a geopoint field in OpenSearch:\n\n- As an object with ""lat"" and ""lon"" properties: { ""lat"": 40.71, ""lon"": 74.00 }\n- As a string with latitude and longitude separated by a comma: ""40.71,74.00""\n- As a geohash string: ""txhxegj0uyp3""\n- As an array with longitude first, then latitude: [74.00, 40.71]\n- As a Well-Known Text string: ""POINT (74.00 40.71)""\n- As a GeoJSON object: { ""type"": ""Point"", ""coordinates"": [74.00, 40.71] }"
How do I use vector search?,"[There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search., There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search.]","There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search.There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search.","There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search.There is not enough information available to answer this question. The provided context describes what vector search is and how to create vector indexes, but does not provide instructions on how to use vector search."
How do I understand the memory requirements for using hnsw?,"[The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension., The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension.]","The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension.The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension.","The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension.The memory requirements for using HNSW with vector quantization depend on the quantization method used:\n\nFor Lucene scalar quantization, the memory required for the HNSW graph is estimated as 1.1 * (dimension + 8 * M) bytes/vector, where M is the maximum number of bidirectional links created for each element during graph construction.\n\nFor Faiss 16-bit scalar quantization, the memory required for HNSW is estimated as 1.1 * (2 * dimension + 8 * M) bytes/vector.\n\nFor Faiss product quantization, the memory required for HNSW is estimated as 1.1 * (((pq_code_size / 8) * pq_m + 24 + 8 * hnsw_m) * num_vectors + num_segments * (2^pq_code_size * 4 * d)) bytes, where pq_code_size is the number of bits used to encode each subvector, pq_m is the number of subvectors, hnsw_m is the maximum number of bidirectional links, num_vectors is the number of vectors, num_segments is the number of segments, and d is the vector dimension."


# Takeaways
* With the first human evaluation, we ensure alignment between human preferences and the LLM-as-a-Judge, which is more scalable.
* With the second human evaluation, we assess if the generated response and ground truth is aligned with the human preference.
* An important consideration is how you want to improve the system. Being able to quickly find poor performing prompts / inputs is key to improving it. In a feedback loop, it’s common to take poor performing examples, correct them, and use them as dynamic few shot examples in future prompts. If you’re going down the model training path, it’s important to collect data points in pairs. RLHF whether that’s PPO or DPO relies on pair wise data when going through data cycles. If you output diverse pairs of answers, it saves data annotators time because they are rating their preference between pairs vs. having to hand write the answer they prefer.
