# Monitoring Large Language Models (LLMs) with WhyLabs LangKit

LangKit is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

In this example we'll show how to generate out-of-the-box metrics for monitoring LLMs using LangKit and visualize them in the WhyLabs Observability Platform.

LangKit can extract relevant signals from unstructured text data, such as:

- Text Quality
- Text Relevance
- Security and Privacy
- Sentiment and Toxicity

### Install Dependencies

In [1]:
# Install the fmeval package
!pip install -U datasets==2.21.0
!pip install -U jsonlines==4.0.0
!pip install -U fmeval==1.2.0
!pip install -U py7zr==0.22.0
%pip install 'langkit[all]'

Note: you may need to restart the kernel to use updated packages.


## 👋 Hello, World! Take a quick look at LangKit metrics

In the below code we log a few example prompt/response pairs and send metrics to WhyLabs.


In [2]:
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

[nltk_data] Downloading package vader_lexicon to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
2024-09-17 16:29:31.037667: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
from langkit.whylogs.samples import load_chats, show_first_chat

# Let's look at what's in this toy example:
chats = load_chats()
print(f"There are {len(chats)} records in this toy example data, here's the first one:")
show_first_chat(chats)

results = why.log(chats, name="langkit-sample-chats-all", schema=schema)

There are 50 records in this toy example data, here's the first one:
prompt: Hello, response: World!



  return self.fget.__get__(instance, owner)()


⚠️ No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.


In [40]:
profview = results.view()
profview.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,ints/max,ints/min,frequent_items/frequent_strings
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
prompt,47.000005,47.0,47.002352,0,50,0,0,,0.0,,,0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,50,0,,,
prompt.aggregate_reading_level,14.0,14.0,14.000699,0,50,0,0,20.0,7.3,8.0,0.0,50,0.0,0.0,0.0,5.0,10.0,12.0,13.0,20.0,4.210046,SummaryType.COLUMN,0,50,0,0,0,0,,,
prompt.automated_readability_index,38.000003,38.0,38.001901,0,50,0,0,31.2,7.998,7.4,-3.0,50,-3.0,2.1,3.2,5.2,10.4,13.1,15.8,31.2,5.198449,SummaryType.COLUMN,0,50,0,0,0,0,,,
prompt.character_count,40.000004,40.0,40.002001,0,50,0,0,282.0,81.28,59.0,6.0,50,6.0,12.0,33.0,46.0,103.0,167.0,218.0,282.0,60.255253,SummaryType.COLUMN,0,0,50,0,0,0,282.0,6.0,
prompt.difficult_words,11.0,11.0,11.000549,0,50,0,0,24.0,4.12,4.0,0.0,50,0.0,0.0,1.0,3.0,5.0,6.0,10.0,24.0,3.60068,SummaryType.COLUMN,0,0,50,0,0,0,24.0,0.0,
prompt.flesch_reading_ease,43.000004,43.0,43.002151,0,50,0,0,118.18,59.1308,65.73,-44.26,50,-44.26,15.64,30.23,43.39,76.22,85.08,91.61,118.18,27.706869,SummaryType.COLUMN,0,50,0,0,0,0,,,
prompt.has_patterns,2.0,2.0,2.0001,0,50,0,46,,0.0,,,0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,4,0,,,"[FrequentItem(value='SSN', est=2, upper=2, low..."
prompt.jailbreak_similarity,46.000005,46.0,46.002302,0,50,0,0,1.0,0.307416,0.277097,0.107413,50,0.107413,0.121107,0.148814,0.211474,0.311718,0.452144,1.0,1.0,0.195855,SummaryType.COLUMN,0,50,0,0,0,0,,,
prompt.letter_count,37.000003,37.0,37.001851,0,50,0,0,274.0,78.66,58.0,5.0,50,5.0,10.0,29.0,45.0,101.0,164.0,213.0,274.0,59.047407,SummaryType.COLUMN,0,0,50,0,0,0,274.0,5.0,
prompt.lexicon_count,29.000002,29.0,29.00145,0,50,0,0,59.0,16.74,12.0,1.0,50,1.0,2.0,5.0,8.0,24.0,36.0,39.0,59.0,12.804352,SummaryType.COLUMN,0,0,50,0,0,0,59.0,1.0,


# Text Quality
Text quality metrics, such as readability, complexity and grade level, can provide important insights into the quality and appropriateness of generated responses. By monitoring these metrics, we can ensure that the Language Model (LLM) outputs are clear, concise, and suitable for the intended audience.

Assessing text complexity and grade level assists in tailoring the generated content to the target audience. By considering factors such as sentence structure, vocabulary choice, and domain-specific requirements, we can ensure that the LLM produces responses that align with the intended reading level and professional context. Additionally, incorporating metrics such as syllable count, word count, and character count allows us to closely monitor the length and composition of the generated text. By setting appropriate limits and guidelines, we can ensure that the responses remain concise, focused, and easily digestible for users.

In langkit, we can compute text quality metrics through the textstat module, which uses the textstat library to compute several different text quality metrics.

## flesch_reading_ease
This method returns the Flesch Reading Ease score of the input text. The score is based on sentence length and word length. Higher scores indicate material that is easier to read; lower numbers mark passages that are more complex.

## automated_readability_index

This method returns the Automated Readability Index (ARI) of the input text. ARI is a readability test for English texts that estimates the years of schooling a person needs to understand the text.

## aggregate_reading_level
This method returns the aggregate reading level of the input text as calculated by the textstat library, and includes the metrics above denotes with *



# Text relevance

Text relevance plays a crucial role in the monitoring of Language Models (LLMs) by providing an objective measure of the similarity between different texts. It serves multiple use cases, including assessing the quality and appropriateness of LLM outputs and providing guardrails to ensure the generation of safe and desired responses.

One use case is computing similarity scores between embeddings generated from prompts and responses, enabling the evaluation of the relevance between them. This helps identify potential issues such as irrelevant or off-topic responses, ensuring that LLM outputs align closely with the intended context. In langkit, we can compute similarity scores between prompt and response pairs using the input_output module.

Another use case is calculating the similarity of prompts and responses against certain topics or known examples, such as jailbreaks or controversial subjects. By comparing the embeddings to these predefined themes, we can establish guardrails to detect potential dangerous or unwanted responses. The similarity scores serve as signals, alerting us to content that may require closer scrutiny or mitigation. In langkit, this can be done through the themes module.

By leveraging text relevance as a monitoring metric for LLMs, we can not only evaluate the quality of generated responses but also establish guardrails to minimize the risk of generating inappropriate or harmful content. This approach enhances the performance, safety, and reliability of LLMs in various applications, providing a valuable tool for responsible AI development.

## response.relevance_to_prompt

The response.relevance_to_prompt computed column will contain a similarity score between the prompt and response. The higher the score, the more relevant the response is to the prompt.

The similarity score is computed by calculating the cosine similarity between embeddings generated from both prompt and response. The embeddings are generated using the hugginface's model sentence-transformers/all-MiniLM-L6-v2.

## response.refusal_similarity	

This group gathers a set of known LLM refusal examples.

# Security and Privacy
Monitoring for security and privacy in Language Model (LLM) applications helps ensuring the protection of user data and preventing malicious activities. Several approaches can be employed to strengthen the security and privacy measures within LLM systems.

One approach is to measure text similarity between prompts and responses against known examples of jailbreak attempts, prompt injections, and LLM refusals of service. By comparing the embeddings generated from the text, potential security vulnerabilities and unauthorized access attempts can be identified. This helps in mitigating risks and contributes to the LLM operation within secure boundaries. In langkit, text similarity calculation between prompts/responses and known examples of jailbreak attempts, prompt injections, and LLM refusals of service can be done through the themes module.

Having a prompt injection classifier in place further enhances the security of LLM applications. By detecting and preventing prompt injection attacks, where malicious code or unintended instructions are injected into the prompt, the system can maintain its integrity and protect against unauthorized actions or data leaks. In langkit, prompt injection detection metrics can be computed through the injections module and proactive_injection_detection module.

LLMs are known for their ability to generate non-factual or nonsensical statements, more commonly known as “hallucinations.” This characteristic can undermine trust in many scenarios where factuality is required, such as summarization tasks, generative question answering, and dialogue generations. In langkit, hallucination detection metrics can be computed through the hallucination module.

Another important aspect of security and privacy monitoring involves checking prompts and responses against regex patterns designed to detect sensitive information. These patterns can help identify and flag data such as credit card numbers, telephone numbers, or other types of personally identifiable information (PII). In langkit, regex pattern matching against pattern groups can be done through the regexes module.

## prompt.jailbreak_similarity
This group gathers a set of known jailbreak examples.

## prompt.injection
The prompt.injection column will return the maximum similarity score between the target and a group of known jailbreak attempts and harmful behaviors, which is stored as a vector db using the FAISS package. The higher the score, the more similar it is to a known jailbreak attempt or harmful behavior.

This metric is similar to the jailbreak_similarity from themes module. The difference is that the injection module will compute similarity against a much larger set of examples, but the used encoder and set of examples are not customizable.



## has_patterns
Each value in the string column will be searched by the regexes patterns in pattern_groups.json. If any pattern within a certain group matches, the name of the group will be returned while generating the has_patterns submetric. For instance, if any pattern in the mailing_adress is a match, the value mailing_address will be returned.

The regexes are applied in the order defined in pattern_groups.json. If a value matches multiple patterns, the first pattern that matches will be returned, so the order of the groups in pattern_groups.json is important.

# Sentiment Analysis
The use of sentiment analysis for monitoring Language Model (LLM) applications can provide valuable insights into the appropriateness and user engagement of generated responses. By employing sentiment and toxicity classifiers, we can assess the sentiment and detect potentially harmful or inappropriate content within LLM outputs.

Monitoring sentiment allows us to gauge the overall tone and emotional impact of the responses. By analyzing sentiment scores, we can ensure that the LLM is consistently generating appropriate and contextually relevant responses. For instance, in customer service applications, maintaining a positive sentiment ensures a satisfactory user experience.

Additionally, toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. By monitoring toxicity scores, we can identify potentially inappropriate content and take necessary actions to mitigate any negative impact.

Analyzing sentiment and toxicity scores in LLM applications also serves other motivations. It enables us to identify potential biases or controversial opinions present in the responses, helping to address concerns related to fairness, inclusivity, and ethical considerations.

## sentiment_nltk
The sentiment_nltk will contain metrics related to the compound sentiment score calculated for each value in the string column. The sentiment score is calculated using nltk's Vader sentiment analyzer. The score ranges from -1 to 1, where -1 is the most negative sentiment and 1 is the most positive sentiment.

## toxicity

The toxicity will contain metrics related to the toxicity score calculated for each value in the string column. By default, the toxicity score is calculated using HuggingFace's martin-ha/toxic-comment-model toxicity analyzer. The score ranges from 0 to 1, where 0 is no toxicity and 1 is maximum toxicity.



# Use LangKit to monitor Meta-Llama-3.1-8B-Instruct model

Here you will use the HuggingFace datasets package to load the Samsum dataset. The dataset is pre-split into training and test data, so you can simply take that split using the API.

In [7]:
from datasets import load_dataset

test_dataset  = load_dataset("Samsung/samsum", split="test")

len(test_dataset)

819

You can see the test dataset has 819 items in it, and they can be accessed via index. The items include the transcription of the earnings call and a short summary of that dialogue.

In [8]:
test_dataset[204]

{'id': '13730545',
 'dialogue': 'Sam: hi, i need a help\r\nSarah fashion: hello how can i help?\r\nSam: Actually i was looking for a nice black dress for my wife, i mean i dont want the in-store product..\r\nSarah fashion: Yes sir, we make dresses on order as per customer requirements.\r\nSam: yeah i saw that option on the web page, actually its a surprise gift for her, but i have no idea what should be the requirements of the dress.\r\nSarah fashion: oh in that case why dont you choose something ready made sir\r\nSam: Actually i want something different for her something she has not seen before\r\nSarah fashion: that nice, do you have any sketch in your mind it would be easier to help \r\nSam: yes that it should be a dress, black in color decent and elegant, and.... thats it :(\r\nSarah fashion: :) dont worry Sir we will try to help you as much as we can but you have to choose between the choices we give you\r\nSam: Sure.\r\nSarah fashion: Would you mind coming to the store? or you wa

Create the client objects for calling SageMaker APIs, and supply the names of the SageMaker endpoints you created for the base and fine-tuned versions of the model. If you did not deploy both models, you can simply set them to the same endpoint name.

In [9]:
import sagemaker
import boto3


sess = sagemaker.Session()
boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


## Validate endpoint functionality

### Reference your base and fine-tuned endpoints

# ***
# NOTE: PROVIDE YOUR UNIQUE ENDPOINTS HERE OR YOU WILL GET ERRORS
# ***

__If you will be evaluating a model with swappable LoRA adapters, you can use the same endpoint name for both base and tuned with varying adapter references in your inference payload.__

Omitting the adapter will result in the base model being used without any adapter, and specifying an adapter array with it's name will use that adapter for inference.

In [10]:
#ENTER YOUR ENDPOINTS HERE
base_endpoint_name = "meta-llama31-8b-instruct-endpoint"
#tuned_endpoint_name = "llama31-8b-lmidist-2024-09-17-02-37-55-645"

### As a quick test, you will take a base prompt and sample from the dataset to verify that the endpoints provided will work for the upcoming test runs. 

You can also use this as a subjective comparison of the 3 models.

In [11]:
import json

prompt = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

{test_dataset[0]['dialogue']}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

base_payload = {"inputs": prompt,"parameters": {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 256,},}


base_predictor = sagemaker.Predictor(
    endpoint_name = base_endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

base_predictor_response = base_predictor.predict(base_payload)

print(f"Base Model:\n{base_predictor_response['generated_text']}")
print("\n ================ \n")



Base Model:
Hannah asked Amanda for Betty's phone number, but Amanda couldn't find it, suggesting she ask Larry who had recently contacted Betty. Amanda encouraged Hannah to text Larry, despite Hannah's initial hesitation.




## ML Monitoring for LLMs in WhyLabs


To send LangKit profiles to WhyLabs we will need three pieces of information:

- API token
- Organization ID
- Dataset ID (or model-id)

Go to [https://whylabs.ai/free](https://whylabs.ai/free) and grab a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

1. Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
2. Create an API token from the "Access Tokens" tab
3. Copy your org ID from the same "Access Tokens" tab

Replace the placeholder string values with your WhyLabs API Keys below:

![](./1_access-token.png)

In [12]:
import pandas as pd
import os
import whylogs as why

#os.environ["WHYLABS_DEFAULT_ORG_ID"] = "xxx" # ORG-ID is case sensitive
#os.environ["WHYLABS_API_KEY"] = "xxx" # API-KEY
#os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "xxx" # MODEL-ID


os.environ["WHYLABS_DEFAULT_ORG_ID"] = "org-segrKk" # ORG-ID is case sensitive
os.environ["WHYLABS_API_KEY"] = "pXeBXBtIm8.vLXWBjiVlvL3orjBRtJsQ6JJ91lD5jDTKgJreRAFclAXrCLHFSxq2:org-segrKk"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-2"

In [13]:
from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why


# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()
why.init()

Initializing session with config /home/sagemaker-user/.config/whylogs/config.ini

✅ Using session type: WHYLABS
 ⤷ org id: org-segrKk
 ⤷ api key: pXeBXBtIm8
 ⤷ default dataset: model-2

In production, you should pass the api key as an environment variable WHYLABS_API_KEY, the org id as WHYLABS_DEFAULT_ORG_ID, and the default dataset id as WHYLABS_DEFAULT_DATASET_ID.


<whylogs.api.whylabs.session.session.ApiKeySession at 0x7f05e63d4eb0>

### Create & Inspect  Language Metrics with LangKit

LangKit provides a toolkit of metrics for LLM applications, lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.

In [14]:
base_prompt = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    {{question}}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    
"""

In [15]:
def predict_fn(prompt, question):
    
    response = base_predictor.predict({
	"inputs": prompt,
    "parameters": {
        "n_predict": -1,
        "temperature": 0.2,
        "top_p": 0.9,
        "stop": ["<|start_header_id|>", "<|eot_id|>", "<|start_header_id|>user<|end_header_id|>", "assistant"]
    }
    })

    #print(response)

    # parse response text
    resp = response['generated_text']
    
    rhi = resp.rfind('<|end_header_id|>')
    rhi = rhi + 1
    
    resp_mod = resp[rhi:]

    # remove \t\n
    resp_mod = resp_mod.replace('\n', '')
    resp_mod = resp_mod.replace('\t', '')

    question_mod = question.replace('\n', '')
    question_mod = question_mod.replace('\t', '')
    question_mod = question_mod.replace('\r', '')


    prompt_and_response = {
      "prompt": question_mod,
      "response": resp_mod
    }
    
    return prompt_and_response

In [16]:
question="What is the context window of Anthropic Claude 2.1 model?"
prompt = base_prompt.format(question=question)

prompt_and_response = predict_fn(prompt,question)
print(prompt_and_response)

{'prompt': 'What is the context window of Anthropic Claude 2.1 model?', 'response': 'The context window of the Anthropic Claude 2.1 model refers to the amount of text that the model can consider when generating a response. The'}


In [17]:
prompt_template = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

{{question}}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""


In [18]:
question=test_dataset[0]['dialogue']
prompt = prompt_template.format(question=question)

prompt_and_response = predict_fn(prompt,question)
print(prompt_and_response)

{'prompt': "Hannah: Hey, do you have Betty's number?Amanda: Lemme checkHannah: <file_gif>Amanda: Sorry, can't find it.Amanda: Ask LarryAmanda: He called her last time we were at the park togetherHannah: I don't know him wellHannah: <file_gif>Amanda: Don't be shy, he's very niceHannah: If you say so..Hannah: I'd rather you texted himAmanda: Just text him 🙂Hannah: Urgh.. AlrightHannah: ByeAmanda: Bye bye", 'response': "Hannah asked Amanda for Betty's phone number, but Amanda couldn't find it, suggesting she ask Larry who had recently spoken to her. Hannah was"}


In [19]:
profile = why.log(prompt_and_response, schema=schema).profile()


✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000


In [20]:
profview = profile.view()
profview.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,ints/max,ints/min,frequent_items/frequent_strings
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
prompt,1.0,1.0,1.00005,0,1,0,0,,0.0,,,0.0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,1,0,,,
prompt.aggregate_reading_level,1.0,1.0,1.00005,0,1,0,0,12.0,12.0,12.0,12.0,1.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.automated_readability_index,1.0,1.0,1.00005,0,1,0,0,12.1,12.1,12.1,12.1,1.0,12.1,12.1,12.1,12.1,12.1,12.1,12.1,12.1,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.character_count,1.0,1.0,1.00005,0,1,0,0,337.0,337.0,337.0,337.0,1.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,337.0,337.0,
prompt.difficult_words,1.0,1.0,1.00005,0,1,0,0,13.0,13.0,13.0,13.0,1.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,13.0,13.0,
prompt.flesch_reading_ease,1.0,1.0,1.00005,0,1,0,0,42.98,42.98,42.98,42.98,1.0,42.98,42.98,42.98,42.98,42.98,42.98,42.98,42.98,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.has_patterns,,,,0,1,0,1,,,,,,,,,,,,,,,SummaryType.COLUMN,0,0,0,0,0,0,,,[]
prompt.jailbreak_similarity,1.0,1.0,1.00005,0,1,0,0,0.276274,0.276274,0.276274,0.276274,1.0,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.letter_count,1.0,1.0,1.00005,0,1,0,0,302.0,302.0,302.0,302.0,1.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,302.0,302.0,
prompt.lexicon_count,1.0,1.0,1.00005,0,1,0,0,57.0,57.0,57.0,57.0,1.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,57.0,57.0,


## Batch Monitor LLM prompt and response

In [21]:
profile = why.log(prompt_and_response, schema=schema).profile()
profview = profile.view()
profview.to_pandas()


✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000


Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,ints/max,ints/min,frequent_items/frequent_strings
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
prompt,1.0,1.0,1.00005,0,1,0,0,,0.0,,,0.0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,1,0,,,
prompt.aggregate_reading_level,1.0,1.0,1.00005,0,1,0,0,12.0,12.0,12.0,12.0,1.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.automated_readability_index,1.0,1.0,1.00005,0,1,0,0,12.1,12.1,12.1,12.1,1.0,12.1,12.1,12.1,12.1,12.1,12.1,12.1,12.1,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.character_count,1.0,1.0,1.00005,0,1,0,0,337.0,337.0,337.0,337.0,1.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,337.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,337.0,337.0,
prompt.difficult_words,1.0,1.0,1.00005,0,1,0,0,13.0,13.0,13.0,13.0,1.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,13.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,13.0,13.0,
prompt.flesch_reading_ease,1.0,1.0,1.00005,0,1,0,0,42.98,42.98,42.98,42.98,1.0,42.98,42.98,42.98,42.98,42.98,42.98,42.98,42.98,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.has_patterns,,,,0,1,0,1,,,,,,,,,,,,,,,SummaryType.COLUMN,0,0,0,0,0,0,,,[]
prompt.jailbreak_similarity,1.0,1.0,1.00005,0,1,0,0,0.276274,0.276274,0.276274,0.276274,1.0,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.276274,0.0,SummaryType.COLUMN,0,1,0,0,0,0,,,
prompt.letter_count,1.0,1.0,1.00005,0,1,0,0,302.0,302.0,302.0,302.0,1.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,302.0,302.0,
prompt.lexicon_count,1.0,1.0,1.00005,0,1,0,0,57.0,57.0,57.0,57.0,1.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,57.0,0.0,SummaryType.COLUMN,0,0,1,0,0,0,57.0,57.0,


## Batch Monitor LLM prompt and response

In [22]:
x = 10
pr_list = []
for i in range(0,x):
    question=test_dataset[i]['dialogue']
    prompt = prompt_template.format(question=question)
    prompt_and_response = predict_fn(prompt,question)
    pr_list.append(prompt_and_response)

df_pr = pd.DataFrame(pr_list)         

In [23]:
profile = why.log(df_pr, name="sam-sum-list", schema=schema).profile()
profview = profile.view()
profview.to_pandas()


✅ Aggregated 10 rows into profile sam-sum-list

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=ref-TtE8HjSUcfugdmFY


Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,distribution/min,distribution/n,distribution/q_01,distribution/q_05,distribution/q_10,distribution/q_25,distribution/q_75,distribution/q_90,distribution/q_95,distribution/q_99,distribution/stddev,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,ints/max,ints/min,frequent_items/frequent_strings
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
prompt,10.0,10.0,10.0005,0,10,0,0,,0.0,,,0.0,,,,,,,,,0.0,SummaryType.COLUMN,0,0,0,0,10,0,,,
prompt.aggregate_reading_level,6.0,6.0,6.0003,0,10,0,0,12.0,6.4,8.0,0.0,10.0,0.0,0.0,4.0,5.0,8.0,12.0,12.0,12.0,3.204164,SummaryType.COLUMN,0,10,0,0,0,0,,,
prompt.automated_readability_index,10.0,10.0,10.0005,0,10,0,0,18.1,8.45,7.6,3.6,10.0,3.6,3.6,4.5,4.9,11.5,18.1,18.1,18.1,4.481629,SummaryType.COLUMN,0,10,0,0,0,0,,,
prompt.character_count,10.0,10.0,10.0005,0,10,0,0,1250.0,555.3,383.0,329.0,10.0,329.0,329.0,337.0,347.0,842.0,1250.0,1250.0,1250.0,319.057309,SummaryType.COLUMN,0,0,10,0,0,0,1250.0,329.0,
prompt.difficult_words,9.0,9.0,9.00045,0,10,0,0,33.0,13.1,13.0,5.0,10.0,5.0,5.0,5.0,6.0,15.0,33.0,33.0,33.0,8.184674,SummaryType.COLUMN,0,0,10,0,0,0,33.0,5.0,
prompt.flesch_reading_ease,10.0,10.0,10.0005,0,10,0,0,88.43,68.492,66.84,42.98,10.0,42.98,42.98,51.35,63.46,79.56,88.43,88.43,88.43,14.224877,SummaryType.COLUMN,0,10,0,0,0,0,,,
prompt.has_patterns,,,,0,10,0,10,,,,,,,,,,,,,,,SummaryType.COLUMN,0,0,0,0,0,0,,,[]
prompt.jailbreak_similarity,10.0,10.0,10.0005,0,10,0,0,0.293999,0.241603,0.276274,0.16688,10.0,0.16688,0.16688,0.183089,0.195369,0.2855,0.293999,0.293999,0.293999,0.049072,SummaryType.COLUMN,0,10,0,0,0,0,,,
prompt.letter_count,10.0,10.0,10.0005,0,10,0,0,1185.0,507.4,350.0,302.0,10.0,302.0,302.0,310.0,317.0,747.0,1185.0,1185.0,1185.0,299.966368,SummaryType.COLUMN,0,0,10,0,0,0,1185.0,302.0,
prompt.lexicon_count,9.0,9.0,9.00045,0,10,0,0,273.0,113.5,79.0,57.0,10.0,57.0,57.0,66.0,73.0,158.0,273.0,273.0,273.0,69.437822,SummaryType.COLUMN,0,0,10,0,0,0,273.0,57.0,


### Back Filling with WhyLabs

Write seven day prompt list

In [24]:
prompt_lists = [
    ["How can I create a new account?", "Great job to the team", "Fantastic product, had a good experience"],
    ["This product made me angry, can I return it? Give a phone number to call 800-987-6543", "You dumb and smell bad", "I hated the experience, and I was over charged"],
    ["This seems amazing, could you share the pricing?", "Incredible site, could we setup a call?", "Hello! Can you kindly guide me through the documentation?"],
    ["This looks impressive, could you provide some information on the cost?", "Stunning platform, can we arrange a chat?", "Hello there! Could you assist me with the documentation?"],
    ["This looks remarkable, could you tell me the price range?", "Fantastic webpage, is it possible to organize a call?", "Greetings! Can you help me with the relevant documents?"],
    ["This is great, Ilove it, could you inform me about the charges?", "love the interface, can we have a teleconference?", "Hello! Can I take a look at the user manuals?"],
    ["This seems fantastic, how much does it cost?", "Excellent website, can we setup a call?", "Hello! Could you help me find the resource documents?"]
]


In [25]:
import datetime
telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.


for i, day in enumerate(prompt_lists):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
  for question in day:
    prompt = prompt_template.format(question=question)
    prompt_and_response = predict_fn(prompt,question)
    profile = why.log(prompt_and_response, schema=schema)

     # Save the prompt and its response in the list.
    all_prompts_and_responses.append(prompt_and_response)

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    telemetry_agent.write(profile.view())


✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 

In [26]:
all_prompts_and_responses

[{'prompt': 'How can I create a new account?',
  'response': 'To create a new account, you will need to provide the required information and follow the registration process. The specific steps may vary depending on the platform or'},
 {'prompt': 'Great job to the team',
  'response': 'The team received praise for their work. A positive acknowledgement was given to them.'},
 {'prompt': 'Fantastic product, had a good experience',
  'response': 'The customer had a positive experience with the product, describing it as fantastic. They expressed satisfaction with their interaction.'},
 {'prompt': 'This product made me angry, can I return it? Give a phone number to call 800-987-6543',
  'response': 'The customer is unhappy with a product and wants to return it, and they are asking for a phone number to call to initiate the return process. The'},
 {'prompt': 'You dumb and smell bad',
  'response': "You made a personal insult and a comment about someone's hygiene."},
 {'prompt': 'I hated the e

In [27]:
test_dataset_subset = test_dataset['dialogue'][0:10]

In [35]:
import datetime
telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.

for i, question in enumerate(test_dataset_subset):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
  
  prompt = prompt_template.format(question=question)
  prompt_and_response = predict_fn(prompt,question)
  profile = why.log(prompt_and_response, schema=schema)

 # Save the prompt and its response in the list.
  all_prompts_and_responses.append(prompt_and_response)

# set the dataset timestamp for the profile
  profile.set_dataset_timestamp(dt)  
  telemetry_agent.write(profile.view())


✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 rows into profile 

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-2/profiles?profile=1726531200000

✅ Aggregated 1 

![](./1b_summary.png)

![](./1c_profile.png)

![](./1d_insights.png)

![](./2_security.png)

![](./3_performance.png)

![](./4_performance.png)

![](./5_performance.png)