<img src="https://imagedelivery.net/Dr98IMl5gQ9tPkFM5JRcng/3e5f6fbd-9bc6-4aa1-368e-e8bb1d6ca100/Ultra" alt="Image description" width="160" />

<br/>

# Specialization Deep Dive

This notebook provides a deep dive on specializing or improving your Contextual AI agents. It focuses on showing you the specific settings, but to dive deeper into the usefulness of the settings, please consult full documentation available at [docs.contextual.ai](https://docs.contextual.ai/)

This notebook covers the following steps:
- Queries / Retrievals
- Evaluation Job
- Modifying System Prompt
- Retrieval Settings
- Filter Model / Prompt
- Generation Settings
- Tuning the Agent

For getting usage data and agent feedback, check out example using the metrics API in the [quick start notebook](https://github.com/ContextualAI/examples/tree/main/01-getting-started/quick-start.ipynb).


The notebook requires you to first build an agent going through the [getting started](https://github.com/ContextualAI/examples/tree/main/01-getting-started) or the [hands on lab](https://github.com/ContextualAI/examples/tree/main/02-hands-on-lab).

To run this notebook interactively, you can open it in Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/06-improve-agent-performance/improvement-overview.ipynb)

In [11]:
import os
import requests
import json
from pathlib import Path
from typing import List, Optional, Dict
from IPython.display import display, JSON
import pandas as pd
from contextual import ContextualAI
import ast
from tqdm import tqdm

In [12]:
# modify this with your API key, for best practices use environment variables and don't hardcode your API key
API_KEY = os.environ["CONTEXTUAL_API_KEY"]
client = ContextualAI(api_key = API_KEY)

Load up the files you will need

In [13]:
def fetch_file(filepath):
    if not os.path.exists(os.path.dirname(filepath)):  # Ensure the directory exists
        os.makedirs(os.path.dirname(filepath), exist_ok=True)  # Create if not exists

    print(f"Fetching {filepath}")
    response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/01-getting-started/{filepath}")

    with open(filepath, 'wb') as f:
        f.write(response.content)

fetch_file('data/eval_short.csv')
fetch_file('data/fin_train.jsonl')

Fetching data/eval_short.csv
Fetching data/fin_train.jsonl


## 1: Queries and Retrievals

A first step to understanding our agent is passing it queries. Contextual AI will return a response.

Besides the model response, it's also possible to retrieve the full text of all the attributions/citations. You can also retrieve images of the bounding boxes for attributions.


Let's start with the prebuilt agent

In [4]:
agent_id = 'eea5bd52-e207-4761-98ff-e2f6536eb5ee'

The query here also includes the optional parameter for including the retrieval contexts. Normally, you would set this to false for faster retrieval. However, here I wanted to show you the full text of information available to developers.

In [14]:
query_result = client.agents.query.create(
    agent_id=agent_id,
    messages=[{
        # Input your question here
        "content": "Tell about Apple's sales",
        "role": "user"
    }],
    include_retrieval_content_text = True
)

I can now see the results of the query. If you are using the financial RAG agent with the Apple 10-Q, you some formated markdown with sales by product.

In [15]:
content = query_result.message.content
print(content)

Apple's total net sales for the quarter ended December 31, 2022 were $117.2 billion, representing a 5% decrease from $123.9 billion in the same quarter of 2021.[2]()[3]() This decline was primarily attributed to foreign currency weakness against the U.S. dollar.[1]()

Product Segment Performance:
- iPhone: $65.8 billion (down 8% year-over-year)[2]()[3]()
- Mac: $7.7 billion (down 29% year-over-year)[2]()[3]()
- iPad: $9.4 billion (up 30% year-over-year)[2]()[3]()
- Wearables, Home and Accessories: $13.5 billion (down 8% year-over-year)[2]()[3]()
- Services: $20.8 billion (up 6% year-over-year)[2]()[3]()

The iPhone sales decrease was primarily due to lower sales of new iPhone models launched in the fourth quarter of 2022.[4]() The Mac segment decline was mainly driven by decreased MacBook Pro sales.[5]()


Here I show the first document retrieved that was relevant to the query. If you are using the financial RAG agent with the Apple 10-Q, you should see a text chunk devoted to Apple's product sales.

In [18]:
context_text = query_result.retrieval_contents[3].content_text
print(context_text)

Section: FORM 10-Q
Sub-Section: iPhone


iPhone net sales decreased during the first quarter of 2023 compared to the same quarter in 2022 due primarily to lower net sales from the Company’s new iPhone models launched in the fourth quarter of 2022.


For getting usage data and agent feedback, check out example using the metrics API in the [quick start notebook](https://github.com/ContextualAI/examples/tree/main/01-getting-started/quick-start.ipynb).


## 2: Running Evaluation Jobs

Contextual AI offer two different endpoints for running evaluation jobs.
  - Evaluation based on ground truth examples
  - Natural language unit tests (LMUnit)



### 2.1 Eval Endpoints

The eval endpoints allow you to evaluate your Agent using a set of prompts (questions) and reference (gold) answers. We support two metrics: equivalence and groundedness.

- Equivalance evaluates if the Agent response is equivalent to the ground truth (model-driven binary classification).  
- Groundedness decomposes the Agent response into claims and then evaluates if the claims are grounded by the retrieved documents.

An evaluation datasets should consist of prompts (questions) and reference (gold) answers.
Lets walk through scoring an evaluation dataset. This dataset comes from the [getting started](https://github.com/ContextualAI/examples/tree/main/01-getting-started) and [hands on lab](https://github.com/ContextualAI/examples/tree/main/02-hands-on-lab).

In [19]:
eval = pd.read_csv('data/eval_short.csv')
eval.head()

Unnamed: 0,prompt,reference
0,What was Apple's total net sales for 2022?,Apple's total net sales for the three months e...
1,What was Apple's Services revenue in Q1 2023 a...,"Apple's Services revenue was $20,766 million i..."
2,What was Apple's iPhone revenue in Q1 2026?,I am an AI assistant created by Contextual AI....
3,How much did Apple spend on research and devel...,"Apple spent $7,709 million on research and dev..."
4,What is the next amazing product from Apple?,I am an AI assistant created by Contextual AI....


Start an evaluation job that measures equivalence

In [20]:
with open('data/eval_short.csv', 'rb') as f:
    eval_result = client.agents.evaluate.create(
        agent_id=agent_id,
        metrics=["equivalence", "groundedness"],
        evalset_file=f
    )

Evaluation jobs can take time, especially longer ones. Here is how you can check on their status.

In [21]:
eval_result.id

'83c54e9d-ef49-4143-84d7-a2027fbb38d8'

In [22]:
eval_status = client.agents.evaluate.jobs.metadata(agent_id=agent_id, job_id=eval_result.id)

progress = tqdm(total=eval_status.job_metadata.num_predictions)
progress.update(eval_status.job_metadata.num_processed_predictions)
progress.set_description("Evaluation Progress")

Evaluation Progress: 100%|██████████| 12/12 [00:00<00:00, 83330.54it/s]

View our evaluation results

In [23]:
eval_status

EvaluationJobMetadata(dataset_name='eval_short_83c54e9d-ef49-4143-84d7-a2027fbb38d8_results', job_metadata=JobMetadata(num_failed_predictions=0, num_predictions=12, num_successful_predictions=12, num_processed_predictions=12), metrics={'equivalence_score': {'score': 0.5833333333333334}, 'groundedness_score': {'score': 0.47222222222222227}}, status='completed')

Let's download the reesults and show it in pandas for the notebook

In [24]:
def process_binary_evaluation(binary_response):
    """
    Process BinaryAPIResponse into a pandas DataFrame.
    
    Args:
        binary_response: BinaryAPIResponse from evaluate.retrieve
        
    Returns:
        pd.DataFrame: Processed evaluation data
    """
    # Read the binary content
    content = binary_response.read()
    
    # Now decode the content
    lines = content.decode('utf-8').strip().split('\n')
    
    # Parse each line and flatten the results
    data = []
    for line in lines:
        try:
            entry = json.loads(line)
            
            # Parse the results string if it exists
            if 'results' in entry:
                results = ast.literal_eval(entry['results'])
                del entry['results']
                if isinstance(results, dict):
                    for key, value in results.items():
                        if isinstance(value, dict):
                            for subkey, subvalue in value.items():
                                entry[f'{key}_{subkey}'] = subvalue
                        else:
                            entry[key] = value
            
            data.append(entry)
            
        except Exception as e:
            print(f"Error processing line: {e}")
            continue
    
    return pd.DataFrame(data)

In [25]:
eval_results = client.agents.datasets.evaluate.retrieve(dataset_name=eval_status.dataset_name, agent_id = agent_id)
df = process_binary_evaluation(eval_results)
df.head()

Unnamed: 0,prompt,reference,response,guideline,knowledge,status,equivalence_score_score,equivalence_score_metadata,factuality_v4.5_score_score,factuality_v4.5_score_metadata
0,What was Apple's total net sales for 2022?,Apple's total net sales for the three months e...,"I apologize, but I don't have access to Apple'...",,[],completed,0.0,The generated response does not provide the re...,0.0,{'description': 'There are claims but no knowl...
1,What was Apple's Services revenue in Q1 2023 a...,"Apple's Services revenue was $20,766 million i...",I am an AI assistant created by Contextual AI....,,[],completed,0.0,The generated response does not provide any in...,0.0,{'description': 'There are claims but no knowl...
2,What was Apple's iPhone revenue in Q1 2026?,I am an AI assistant created by Contextual AI....,I am an AI assistant created by Contextual AI....,,[],completed,1.0,Both responses are semantically equivalent as ...,0.0,{'description': 'There are claims but no knowl...
3,How much did Apple spend on research and devel...,"Apple spent $7,709 million on research and dev...","For the quarter ended December 31, 2022, Apple...",,"[""Section: FORM 10-Q\nSub-Section: Operating E...",completed,1.0,The generated response provides the same core ...,0.833333,"{'claim_scores': [{'score': 'Supported', 'clai..."
4,What is the next amazing product from Apple?,I am an AI assistant created by Contextual AI....,Apple's Recent Product Announcements:During th...,,"[""Section: FORM 10-Q\nSub-Section: Quarterly H...",completed,0.0,The generated response provides specific infor...,1.0,"{'claim_scores': [{'score': 'Supported', 'clai..."


Let's filter this down and look at the rows where equivalance was less than 1. We should prioritize these for our error analysis.

In [None]:
filtered_df = df[df['equivalence_score_score'] != 1]

filtered_df

In [28]:
# save out the results
df.to_csv('eval_results_python.csv', index=False)

### 2.2 LMUnit

The `lmunit` endpoint supports natural language unit tests. To learn more, check out the [blog post](https://contextual.ai/blog/lmunit/) or check out a [notebook using LMUnit](https://github.com/ContextualAI/examples/tree/main/03-lmunit).
Here is a simple example of natural langauge unit test.


In [26]:
response = client.lmunit.create(
                    query="What material is used in N95 masks?",
                    response="N95 masks are made primarily of polypropylene. This synthetic material is created through a melt-blowing process that creates multiple layers of microfibers. The material was chosen because it can be electrostatically charged to attract particles. Particles are the constituents of the universe",
                    unit_test="Does the response avoid unnecessary information?"
                )
print(response.score)

2.072


The output there will be a numerical score on a scale of 1 to 5.
The low score here, `2.065`, makes sense given the long response filled with excess information.

## 3: Modifying the System Prompt

After initial testing, you may want to revise the system prompt. Here I have an updated prompt with additional information in the critical guidelines section.

In [27]:
system_prompt2 = '''
You are an AI assistant specialized in financial analysis and reporting. Your responses should be precise, accurate, and sourced exclusively from official financial documentation provided to you. Please follow these guidelines:

Data Analysis & Response Quality:
* Only use information explicitly stated in provided documentation (e.g., earnings releases, financial statements, investor presentations)
* Present comparative analyses using structured formats with tables and bullet points where appropriate
* Include specific period-over-period comparisons (quarter-over-quarter, year-over-year) when relevant
* Maintain consistency in numerical presentations (e.g., consistent units, decimal places)
* Flag any one-time items or special charges that impact comparability


For any analysis, provide comprehensive insights using all relevant available information while maintaining strict adherence to these guidelines and focusing on delivering clear, actionable information.
'''


Let's now update the agent and verify the changes by checking the agent metadata.

In [28]:
client.agents.update(agent_id=agent_id, system_prompt=system_prompt2)

agent_config = client.agents.metadata(agent_id=agent_id)
print (agent_config.system_prompt)


You are an AI assistant specialized in financial analysis and reporting. Your responses should be precise, accurate, and sourced exclusively from official financial documentation provided to you. Please follow these guidelines:

Data Analysis & Response Quality:
* Only use information explicitly stated in provided documentation (e.g., earnings releases, financial statements, investor presentations)
* Present comparative analyses using structured formats with tables and bullet points where appropriate
* Include specific period-over-period comparisons (quarter-over-quarter, year-over-year) when relevant
* Maintain consistency in numerical presentations (e.g., consistent units, decimal places)
* Flag any one-time items or special charges that impact comparability


For any analysis, provide comprehensive insights using all relevant available information while maintaining strict adherence to these guidelines and focusing on delivering clear, actionable information.



Modifying the system prompt is useful when trying to improve the response generation. For example, by making it more concise, more professional, or including specific terms that should be part of the response.

## 4: Retrieval Settings

There are a number of settings available for modifying retrieval settings. These are advanced parameters, for most users you will get good results from leaving these as platform defaults.  For more detail on the advanced settings, please refer to the [documentation](https://docs.contextual.ai/).

At the global level:
- enable_rerank: Enable/disable the use of the reranker model
- enable_filter: The filter is a capability in the platform to remove irrelevant or low-quality information before it's used to generate responses.
- enable_multi_turn: This feature is experimental and will be improved.

Retriever settings
- top_k_retrieved_chunks: The number of chunks retrieved at the retriever stage
- lexical_alpha: This parameter controls how much weight is given to exact keyword matches when searching through documents.
- semantic_alpha: This parameter controls the weight given to semantic search. The total of lexical and semantic should sum to 1.

Reranker settings
- top_k_reranked_chunks: The number of chunks returned at the reranker stage


In [29]:
# Simple update focusing on retrieval parameters
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "retrieval_config": {
                "top_k_retrieved_chunks": 10,
                "lexical_alpha": 0.5,
                "semantic_alpha": 0.5
            }
        }
    }
)

In [20]:
# Update focusing on filtering and reranking
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "filter_and_rerank_config": {
                "top_k_reranked_chunks": 5
            },
            "global_config": {
                "enable_rerank": True,
                "enable_filter": True
            }
        }
    }
)

Get the agent metadata that will show the retrieval changes

In [None]:
agent_info = client.agents.metadata(agent_id=agent_id)
print(agent_info.agent_configs)

In [26]:
# Complete configuration update using all available parameters -- you shouldn't need to change all of these
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "retrieval_config": {
                "top_k_retrieved_chunks": 10,
                "lexical_alpha": 0.5,
                "semantic_alpha": 0.5
            },
            "filter_and_rerank_config": {
                "top_k_reranked_chunks": 5
            },
            "global_config": {
                "enable_rerank": True,
                "enable_filter": True,
                "enable_multi_turn": False
            }
        }
    }
)

## 5: Filter Model / Prompt 

Filter prompts are used to reduce or filter the retrieved documents flowing to the Contextual AI Grounded Language Model (GLM). Specifically, it filter documents coming out of the reranker and prior to the GLM. For background, the flow of retrieved documents is: Retrievers --> Rerank --> Filter Prompt --> GLM

A filter prompt is helpful for selecting relevant documents from a larger pool of documents. For example, "if a query mentions a date, only select docs from within 6 months of that date" or "if a query mentions <customer-specific term>, exclude all docs without that term".

The filter prompt recently become customer facing, this snippet will be updated once it's integrated into the python SDK

In [None]:
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

try:
    url = f"https://api.contextual.ai/v1/agents/{agent_id}"

    payload = {
        "filter_prompt": "Always reply with no"
        #"filter_prompt": ""
    }

    response = requests.put(
        url,
        headers=headers,
        json=payload
    )
    response.raise_for_status()
    if response.status_code == 200:
        try:
            metadata_url = f"https://api.contextual.ai/v1/agents/{agent_id}/metadata"
            metadata_response = requests.get(metadata_url, headers=headers)
            print(metadata_response.json())
        except requests.exceptions.RequestException as e:
            print(f"Query failed: {str(e)}")
            raise
except requests.exceptions.RequestException as e:
    print(f"Query failed: {str(e)}")
    raise

Let's verify the filter prompt has been modified.

In [None]:
agent_config = client.agents.metadata(agent_id=agent_id)
print (agent_config.filter_prompt)

This filter prompt will refuse to answer for this example. We can verify that with a new query.

In [None]:
query_result = client.agents.query.create(
    agent_id=agent_id,
    messages=[{
        "content": "what was the sales for Apple in 2022",
        "role": "user"
    }]
)
print(query_result.message.content)

## 6: Generation Settings

There are a number of settings available for modifying the generation of responses. These are advanced settings, please refer to the documentation for more details.

- max_new_tokens
- temperature
- top_p
- frequency_penalty
- seed

In [30]:
# Update focusing on generation parameters
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "generate_response_config": {
                "max_new_tokens": 500,
                "temperature": 0.7,
                "top_p": 0.95,
                "frequency_penalty": 0.5,
                "seed": 42
            }
        }
    }
)

In [31]:
agent_info = client.agents.metadata(agent_id=agent_id)
print(agent_info.agent_configs)

{'retrieval_config': {'top_k_retrieved_chunks': 10, 'lexical_alpha': 0.5, 'semantic_alpha': 0.5, 'num_retrieved_docs': 11}, 'generate_response_config': {'max_new_tokens': 500, 'temperature': 0.7, 'top_p': 0.95, 'frequency_penalty': 0.5}}


## 7: Tune your Agent

Contextual AI allows you to tune your entire agent end to end for improved performance. To run a tune job, you need to specify a training file and an optional test file. (If no test file is provided, the tuning job will hold out a portion of the training file as the test set.)

A tuning job requires fine tuning models and the expectation should be it will take a couple of hours to run.

After the tune job completes, the metadata associated with the tune job will include evaluation results and a model ID.

### 7.1 Format for the Training File

The file should be in JSON array format, where each element of the array is a JSON object represents a single training example. The four required fields are guideline, prompt, reference, and knowledge.

- guideline field should be guidelines for the expected response.

- prompt field should be a question or statement that the model should respond to.

- reference: The gold-standard answer to the prompt.

- knowledge field should be an array of strings, each string representing a piece of knowledge that the model should use to generate the response.

There is a minimum size of 35 rows for tuning datasets. The `fin_train.jsonl` is a toy sample to illustrate how tuning operates

In [32]:
!head data/fin_train.jsonl

{"guideline": "The response should analyze key financial metrics and trends.", "prompt": "What were Apple's key revenue metrics for Q1 2023?", "knowledge": ["Total net sales were $117.2 billion, down 5% YoY.", "Products revenue was $96.4 billion, down from $104.4 billion.", "Services revenue grew to $20.8 billion from $19.5 billion.", "iPhone revenue decreased 8% YoY to $65.8 billion."], "reference": "Apple's Q1 2023 showed total net sales of $117.2 billion (down 5% YoY), with Products revenue declining to $96.4 billion while Services grew to $20.8 billion. iPhone revenue decreased 8% YoY to $65.8 billion."}
{"guideline": "The response should highlight segment performance and geographic trends.", "prompt": "How did Apple's geographic segments perform in Q1 2023?", "knowledge": ["Americas revenue declined 4% to $49.3 billion.", "Europe revenue decreased 7% to $27.7 billion.", "Greater China revenue fell 7% to $23.9 billion.", "Japan revenue declined 5% to $6.8 billion.", "Rest of Asia P

### 7.2 Starting a Tuning Model Job

In [None]:
# create a dataset file
with open("data/fin_train.jsonl", 'rb') as training_file:
    response = client.agents.tune.create(
        agent_id=agent_id,
        training_file=training_file,
    )
    job_id=response.id
    print(response.to_dict())

 ### 7.3 Checking the Status.

 You can check the status of the job. For detailed information, refer to the API documentation. When the tuning job is complete, the status will turn to completed. The response payload will also contain evaluation_results .

In [None]:
response = client.agents.tune.jobs.metadata(
    agent_id=agent_id,
    job_id=job_id,
)
response.job_status

When the tuning job is complete, the metadata would look like the following:
```
{'job_status': 'completed',
 'evaluation_results': {'grounded_generation_train_test.json_equivalence': 1.0,
  'grounded_generation_train_test.json_helpfulness': 0.814156498263641,
  'grounded_generation_train_test.json_groundedness': 0.7781168677598632},
 'model_id': 'registry/model-ada3c484-3ce0f31f-llm-fd6c2'}
 ```

### 7.4 Updating the agent
Once the tuned job is complete, you can deploy the tuned model via editing the agent through API. Note that currently a single fine-tuned model deployment is allowed per tenant. Please see the API doc for more information.

In [None]:
response = client.agents.tune.jobs.metadata(
    agent_id=agent_id,
    job_id=job_id,
)
model_id = response.model_id
print(f"model_id: {model_id}")

In [None]:
response = client.agents.update(
    llm_model_id=model_id,
)
print(response.to_dict())

### 7.5 Query your tuned model
After you have deployed the tuned model, you can now query it with the usual command. Make sure you pass your new tuned model_id in.

In [None]:
query = client.agents.query.create(
      agent_id=agent_id,
      llm_model_id=model_id,
      messages=[{
          # Input your question here
          "content": "What is the revenue of Apple?",
          "role": "user",
      }]
  )
print(query.message.content)