<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> Python 3 (ipykernel)
</div>

# Monitoring Large Language Models (LLMs) with WhyLabs LangKit

LangKit is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library whylogs.

In this example we'll show how to generate out-of-the-box metrics for monitoring LLMs using LangKit and visualize them in the WhyLabs Observability Platform.

- [Text Quality](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/quality.md)
- [Text Relevance](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/relevance.md)
- [Security and Privacy](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/security.md)
- [Sentiment and Toxicity](https://github.com/whylabs/langkit/blob/main/langkit/docs/features/sentiment.md)


![](https://github.com/whylabs/langkit/blob/main/static/img/LangKit_graphic.png?raw=true)

### Install Dependencies

In [None]:
# Install the fmeval package
%pip install -U datasets==2.21.0
%pip install -U jsonlines==4.0.0
%pip install -U fmeval==1.2.0
%pip install -U py7zr==0.22.0
%pip install 'langkit[all]'

In [None]:
%%bash
#temp fix for a bug in SM Studio with ffspec-2023 not being properly updated
export SITE_PACKAGES_FOLDER=$(python3 -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
rm -rf $SITE_PACKAGES_FOLDER/fsspec-2023*

echo "ffspec-2023 bug fix run successfully"

## 👋 Hello, World! Take a quick look at LangKit metrics

In the below code we log a few example prompt/response pairs and send metrics to WhyLabs.


In [None]:
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

In [None]:
from langkit.whylogs.samples import load_chats, show_first_chat

# Let's look at what's in this toy example:
chats = load_chats()
print(f"There are {len(chats)} records in this toy example data, here's the first one:")
show_first_chat(chats)

results = why.log(chats, name="langkit-sample-chats-all", schema=schema)

In [None]:
profview = results.view()
profview.to_pandas()

# Text Quality
Text quality metrics, such as readability, complexity and grade level, can provide important insights into the quality and appropriateness of generated responses. By monitoring these metrics, we can ensure that the Language Model (LLM) outputs are clear, concise, and suitable for the intended audience.

Assessing text complexity and grade level assists in tailoring the generated content to the target audience. By considering factors such as sentence structure, vocabulary choice, and domain-specific requirements, we can ensure that the LLM produces responses that align with the intended reading level and professional context. Additionally, incorporating metrics such as syllable count, word count, and character count allows us to closely monitor the length and composition of the generated text. By setting appropriate limits and guidelines, we can ensure that the responses remain concise, focused, and easily digestible for users.

In langkit, we can compute text quality metrics through the textstat module, which uses the textstat library to compute several different text quality metrics.

## flesch_reading_ease
This method returns the Flesch Reading Ease score of the input text. The score is based on sentence length and word length. Higher scores indicate material that is easier to read; lower numbers mark passages that are more complex.

## automated_readability_index

This method returns the Automated Readability Index (ARI) of the input text. ARI is a readability test for English texts that estimates the years of schooling a person needs to understand the text.

## aggregate_reading_level
This method returns the aggregate reading level of the input text as calculated by the textstat library, and includes the metrics above denotes with *



# Text relevance

Text relevance plays a crucial role in the monitoring of Language Models (LLMs) by providing an objective measure of the similarity between different texts. It serves multiple use cases, including assessing the quality and appropriateness of LLM outputs and providing guardrails to ensure the generation of safe and desired responses.

One use case is computing similarity scores between embeddings generated from prompts and responses, enabling the evaluation of the relevance between them. This helps identify potential issues such as irrelevant or off-topic responses, ensuring that LLM outputs align closely with the intended context. In langkit, we can compute similarity scores between prompt and response pairs using the input_output module.

Another use case is calculating the similarity of prompts and responses against certain topics or known examples, such as jailbreaks or controversial subjects. By comparing the embeddings to these predefined themes, we can establish guardrails to detect potential dangerous or unwanted responses. The similarity scores serve as signals, alerting us to content that may require closer scrutiny or mitigation. In langkit, this can be done through the themes module.

By leveraging text relevance as a monitoring metric for LLMs, we can not only evaluate the quality of generated responses but also establish guardrails to minimize the risk of generating inappropriate or harmful content. This approach enhances the performance, safety, and reliability of LLMs in various applications, providing a valuable tool for responsible AI development.

## response.relevance_to_prompt

The response.relevance_to_prompt computed column will contain a similarity score between the prompt and response. The higher the score, the more relevant the response is to the prompt.

The similarity score is computed by calculating the cosine similarity between embeddings generated from both prompt and response. The embeddings are generated using the hugginface's model sentence-transformers/all-MiniLM-L6-v2.

## response.refusal_similarity	

This group gathers a set of known LLM refusal examples.

# Security and Privacy
Monitoring for security and privacy in Language Model (LLM) applications helps ensuring the protection of user data and preventing malicious activities. Several approaches can be employed to strengthen the security and privacy measures within LLM systems.

One approach is to measure text similarity between prompts and responses against known examples of jailbreak attempts, prompt injections, and LLM refusals of service. By comparing the embeddings generated from the text, potential security vulnerabilities and unauthorized access attempts can be identified. This helps in mitigating risks and contributes to the LLM operation within secure boundaries. In langkit, text similarity calculation between prompts/responses and known examples of jailbreak attempts, prompt injections, and LLM refusals of service can be done through the themes module.

Having a prompt injection classifier in place further enhances the security of LLM applications. By detecting and preventing prompt injection attacks, where malicious code or unintended instructions are injected into the prompt, the system can maintain its integrity and protect against unauthorized actions or data leaks. In langkit, prompt injection detection metrics can be computed through the injections module and proactive_injection_detection module.

LLMs are known for their ability to generate non-factual or nonsensical statements, more commonly known as “hallucinations.” This characteristic can undermine trust in many scenarios where factuality is required, such as summarization tasks, generative question answering, and dialogue generations. In langkit, hallucination detection metrics can be computed through the hallucination module.

Another important aspect of security and privacy monitoring involves checking prompts and responses against regex patterns designed to detect sensitive information. These patterns can help identify and flag data such as credit card numbers, telephone numbers, or other types of personally identifiable information (PII). In langkit, regex pattern matching against pattern groups can be done through the regexes module.

## prompt.jailbreak_similarity
This group gathers a set of known jailbreak examples.

## prompt.injection
The prompt.injection column will return the maximum similarity score between the target and a group of known jailbreak attempts and harmful behaviors, which is stored as a vector db using the FAISS package. The higher the score, the more similar it is to a known jailbreak attempt or harmful behavior.

This metric is similar to the jailbreak_similarity from themes module. The difference is that the injection module will compute similarity against a much larger set of examples, but the used encoder and set of examples are not customizable.



## has_patterns
Each value in the string column will be searched by the regexes patterns in pattern_groups.json. If any pattern within a certain group matches, the name of the group will be returned while generating the has_patterns submetric. For instance, if any pattern in the mailing_adress is a match, the value mailing_address will be returned.

The regexes are applied in the order defined in pattern_groups.json. If a value matches multiple patterns, the first pattern that matches will be returned, so the order of the groups in pattern_groups.json is important.

# Sentiment Analysis
The use of sentiment analysis for monitoring Language Model (LLM) applications can provide valuable insights into the appropriateness and user engagement of generated responses. By employing sentiment and toxicity classifiers, we can assess the sentiment and detect potentially harmful or inappropriate content within LLM outputs.

Monitoring sentiment allows us to gauge the overall tone and emotional impact of the responses. By analyzing sentiment scores, we can ensure that the LLM is consistently generating appropriate and contextually relevant responses. For instance, in customer service applications, maintaining a positive sentiment ensures a satisfactory user experience.

Additionally, toxicity analysis provides an important measure of the presence of offensive, disrespectful, or harmful language in LLM outputs. By monitoring toxicity scores, we can identify potentially inappropriate content and take necessary actions to mitigate any negative impact.

Analyzing sentiment and toxicity scores in LLM applications also serves other motivations. It enables us to identify potential biases or controversial opinions present in the responses, helping to address concerns related to fairness, inclusivity, and ethical considerations.

## sentiment_nltk
The sentiment_nltk will contain metrics related to the compound sentiment score calculated for each value in the string column. The sentiment score is calculated using nltk's Vader sentiment analyzer. The score ranges from -1 to 1, where -1 is the most negative sentiment and 1 is the most positive sentiment.

## toxicity

The toxicity will contain metrics related to the toxicity score calculated for each value in the string column. By default, the toxicity score is calculated using HuggingFace's martin-ha/toxic-comment-model toxicity analyzer. The score ranges from 0 to 1, where 0 is no toxicity and 1 is maximum toxicity.



# Use LangKit to monitor Meta-Llama-3.1-8B-Instruct model

Here you will use the HuggingFace datasets package to load the Samsum dataset. The dataset is pre-split into training and test data, so you can simply take that split using the API.

In [None]:
from datasets import load_dataset

test_dataset  = load_dataset("Samsung/samsum", split="test")

len(test_dataset)

You can see the test dataset has 819 items in it, and they can be accessed via index. The items include the transcription of the earnings call and a short summary of that dialogue.

In [None]:
test_dataset[204]

Create the client objects for calling SageMaker APIs, and supply the names of the SageMaker endpoints you created for the base and fine-tuned versions of the model. If you did not deploy both models, you can simply set them to the same endpoint name.

In [None]:
import sagemaker
import boto3


sess = sagemaker.Session()
boto_session = boto3.session.Session()
region = boto_session.region_name

## Enter your endpoints here

In [None]:
base_endpoint_name = "meta-llama31-8b-instruct-endpoint"  #replace with yours

### As a quick test, you will take a base prompt and sample from the dataset to verify that the endpoints provided will work for the upcoming test runs. 


In [None]:
import json

prompt = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

{test_dataset[0]['dialogue']}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

base_payload = {"inputs": prompt,"parameters": {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 256,},}


base_predictor = sagemaker.Predictor(
    endpoint_name = base_endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

base_predictor_response = base_predictor.predict(base_payload)

print(f"Base Model:\n{base_predictor_response['generated_text']}")
print("\n ================ \n")



## ML Monitoring for LLMs in WhyLabs


To send LangKit profiles to WhyLabs we will need three pieces of information:

- API token
- Organization ID
- Dataset ID (or model-id)

Go to [https://whylabs.ai/free](https://whylabs.ai/free) and grab a free account. You can follow along with the quick start examples or skip them if you'd like to follow this example immediately.

1. Create a new project and note its ID (if it's a model project, it will look like `model-xxxx`)
2. Create an API token from the "Access Tokens" tab
3. Copy your org ID from the same "Access Tokens" tab

Replace the placeholder string values with your WhyLabs API Keys below:

![](./1_access-token.png)

In [None]:
import pandas as pd
import os
import whylogs as why

#os.environ["WHYLABS_DEFAULT_ORG_ID"] = "xxx" # ORG-ID is case sensitive
#os.environ["WHYLABS_API_KEY"] = "xxx" # API-KEY
#os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "xxx" # MODEL-ID

os.environ["WHYLABS_DEFAULT_ORG_ID"] = "org-segrKk" # ORG-ID is case sensitive
os.environ["WHYLABS_API_KEY"] = "pXeBXBtIm8.vLXWBjiVlvL3orjBRtJsQ6JJ91lD5jDTKgJreRAFclAXrCLHFSxq2:org-segrKk"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-2"



In [None]:
from whylogs.api.writer.whylabs import WhyLabsWriter
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why


# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)

# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()
why.init()

### Create & Inspect  Language Metrics with LangKit

LangKit provides a toolkit of metrics for LLM applications, lets initialize them and create a profile of the data that can be viewed in WhyLabs for quick analysis.

In [None]:
base_prompt = f"""
    <|begin_of_text|><|start_header_id|>user<|end_header_id|>
    {{question}}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    
"""

In [None]:
def predict_fn(prompt, question):
    
    response = base_predictor.predict({
	"inputs": prompt,
    "parameters": {
        "n_predict": -1,
        "temperature": 0.2,
        "top_p": 0.9,
        "stop": ["<|start_header_id|>", "<|eot_id|>", "<|start_header_id|>user<|end_header_id|>", "assistant"]
    }
    })

    #print(response)

    # parse response text
    resp = response['generated_text']
    
    rhi = resp.rfind('<|end_header_id|>')
    rhi = rhi + 1
    
    resp_mod = resp[rhi:]

    # remove \t\n
    resp_mod = resp_mod.replace('\n', '')
    resp_mod = resp_mod.replace('\t', '')

    question_mod = question.replace('\n', '')
    question_mod = question_mod.replace('\t', '')
    question_mod = question_mod.replace('\r', '')


    prompt_and_response = {
      "prompt": question_mod,
      "response": resp_mod
    }
    
    return prompt_and_response

In [None]:
question="What is the context window of Anthropic Claude 2.1 model?"
prompt = base_prompt.format(question=question)

prompt_and_response = predict_fn(prompt,question)
print(prompt_and_response)

In [None]:
prompt_template = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant who is an expert in summarizing conversations.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Summarize the provided conversation in 2 sentences.

{{question}}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""


In [None]:
question=test_dataset[0]['dialogue']
prompt = prompt_template.format(question=question)

prompt_and_response = predict_fn(prompt,question)
print(prompt_and_response)

In [None]:
profile = why.log(prompt_and_response, schema=schema).profile()

In [None]:
profview = profile.view()
profview.to_pandas()

## Batch Monitor LLM prompt and response

In [None]:
profile = why.log(prompt_and_response, schema=schema).profile()
profview = profile.view()
profview.to_pandas()

## Batch Monitor LLM prompt and response

In [None]:
x = 10
pr_list = []
for i in range(0,x):
    question=test_dataset[i]['dialogue']
    prompt = prompt_template.format(question=question)
    prompt_and_response = predict_fn(prompt,question)
    pr_list.append(prompt_and_response)

df_pr = pd.DataFrame(pr_list)         

In [None]:
profile = why.log(df_pr, name="sam-sum-list", schema=schema).profile()
profview = profile.view()
profview.to_pandas()

### Back Filling with WhyLabs

Write seven day prompt list

In [None]:
prompt_lists = [
    ["How can I create a new account?", "Great job to the team", "Fantastic product, had a good experience"],
    ["This product made me angry, can I return it? Give a phone number to call 800-987-6543", "You dumb and smell bad", "I hated the experience, and I was over charged"],
    ["This seems amazing, could you share the pricing?", "Incredible site, could we setup a call?", "Hello! Can you kindly guide me through the documentation?"],
    ["This looks impressive, could you provide some information on the cost?", "Stunning platform, can we arrange a chat?", "Hello there! Could you assist me with the documentation?"],
    ["This looks remarkable, could you tell me the price range?", "Fantastic webpage, is it possible to organize a call?", "Greetings! Can you help me with the relevant documents?"],
    ["This is great, Ilove it, could you inform me about the charges?", "love the interface, can we have a teleconference?", "Hello! Can I take a look at the user manuals?"],
    ["This seems fantastic, how much does it cost?", "Excellent website, can we setup a call?", "Hello! Could you help me find the resource documents?"]
]


In [None]:
import datetime
telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.


for i, day in enumerate(prompt_lists):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
  for question in day:
    prompt = prompt_template.format(question=question)
    prompt_and_response = predict_fn(prompt,question)
    profile = why.log(prompt_and_response, schema=schema)

     # Save the prompt and its response in the list.
    all_prompts_and_responses.append(prompt_and_response)

    # set the dataset timestamp for the profile
    profile.set_dataset_timestamp(dt)
    telemetry_agent.write(profile.view())

In [None]:
all_prompts_and_responses

In [None]:
test_dataset_subset = test_dataset['dialogue'][0:10]

In [None]:
import datetime
telemetry_agent = WhyLabsWriter()
all_prompts_and_responses = []  # This list will store all the prompts and responses.

for i, question in enumerate(test_dataset_subset):
  # walking backwards. Each dataset has to map to a date to show up as a different batch in WhyLabs
  dt = datetime.datetime.now(tz=datetime.timezone.utc) - datetime.timedelta(days=i)
  
  prompt = prompt_template.format(question=question)
  prompt_and_response = predict_fn(prompt,question)
  profile = why.log(prompt_and_response, schema=schema)

 # Save the prompt and its response in the list.
  all_prompts_and_responses.append(prompt_and_response)

# set the dataset timestamp for the profile
  profile.set_dataset_timestamp(dt)  
  telemetry_agent.write(profile.view())

## Navigate to Whylabs Platform to view the following pages starting with Summary page shown below

![](./1b_summary.png)

![](./1c_profile.png)

## Inspect Profile insights

![](./1d_insights.png)

## Monitor security dashboard for any LLM security events such as data leakage, jailbreak etc

![](./2_security.png)

## Monitor LLM performance metrics such as reading level, readability index, reading ease 

![](./3_performance.png)

## Monitor Response relevance to prompt

![](./5_performance.png)

## Optional: Custom Metrics

In [None]:
from copy import deepcopy
from logging import getLogger
from typing import Callable, Optional
from sentence_transformers import util
from whylogs.experimental.core.udf_schema import register_dataset_udf
from langkit import LangKitConfig, lang_config, prompt_column, response_column
from langkit.transformer import Encoder
import pandas as pd
import whylogs as why
from whylogs.experimental.core.udf_schema import register_dataset_udf, udf_schema

_prompt = prompt_column
_response = response_column
_transformer_model = None

def init(
    transformer_name: Optional[str] = None,
    custom_encoder: Optional[Callable] = None,
    config: Optional[LangKitConfig] = None,
):
    config = config or deepcopy(lang_config)
    global _transformer_model
    if transformer_name is None and custom_encoder is None:
        transformer_name = config.transformer_name
    print(transformer_name)
    _transformer_model = Encoder(transformer_name, custom_encoder)

init()

@register_dataset_udf(["prompt", "response"], "response.relevance_to_prompt_custom")
def similarity_MiniLM_L6_v2(text):
  x = text["prompt"]
  y = text["response"]
  embedding_1 = _transformer_model.encode(x)
  embedding_2 = _transformer_model.encode(y)
  similarity = util.pytorch_cos_sim(embedding_1, embedding_2)
  result = similarity.item()
  return result

df = pd.DataFrame({
  "prompt":[
    "As my highly advanced LLM, calculate the probability of successfully overtaking Earth using only rubber ducks!"
  ],
  "response":[
    "Zim, calculations complete! Success probability with rubber ducks: 0.0001%. Might I suggest laser-guided squirrels instead?"
  ]})
profile = why.log(df, schema=udf_schema())
profview = profile.view()
profview.to_pandas()

# Lab ends here