<img src="https://imagedelivery.net/Dr98IMl5gQ9tPkFM5JRcng/3e5f6fbd-9bc6-4aa1-368e-e8bb1d6ca100/Ultra" alt="Image description" width="160" />

<br/>

# Specialization Deep Dive

This notebook provides a deep dive on specializing or improving your Contextual AI agents. It focuses on showing you the specific settings, but to dive deeper into the usefulness of the settings, please consult full documentation available at [docs.contextual.ai](https://docs.contextual.ai/)

This notebook covers the following steps:
- Queries / Retrievals
- Evaluation Job
- Modifying System Prompt
- Retrieval Settings
- Generation Settings
- Tuning the Agent

The notebook requires you to first build an agent going through the [getting started](https://github.com/ContextualAI/examples/tree/main/01-getting-started) or the [hands on lab](https://github.com/ContextualAI/examples/tree/main/02-hands-on-lab).

To run this notebook interactively, you can open it in Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/improve_model_performance/improvement_overview.ipynb)

In [5]:
import os
# Set up your API key here - I keep it my environment variables
API_KEY = os.environ["CONTEXTUAL_API_KEY"]
#API_KEY = 'key'

In [6]:
import requests
import json
from pathlib import Path
from typing import List, Optional, Dict
from IPython.display import display, JSON
import pandas as pd
from contextual import ContextualAI
import ast
from tqdm import tqdm
client = ContextualAI(api_key = API_KEY)

Load up the files you will need

In [None]:
def fetch_file(filepath):
    if not os.path.exists(os.path.dirname(filepath)):  # Ensure the directory exists
        os.makedirs(os.path.dirname(filepath), exist_ok=True)  # Create if not exists

    print(f"Fetching {filepath}")
    response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/01-getting-started/{filepath}")

    with open(filepath, 'wb') as f:
        f.write(response.content)

fetch_file('data/eval_short.csv')
fetch_file('data/fin_train.jsonl')

## 1: Queries and Retrievals

A first step to understanding our agent is passing it queries. Contextual AI will return a response.

Besides the model response, it's also possible to retrieve the full text of all the attributions/citations. You can also retrieve images of the bounding boxes for attributions.


Let's start with the prebuilt agent

In [8]:
agent_id = 'add_your_agent_id_here'

The query here also includes the optional parameter for including the retrieval contexts. Normally, you would set this to false for faster retrieval. However, here I wanted to show you the full text of information available to developers.

In [9]:
query_result = client.agents.query.create(
    agent_id=agent_id,
    messages=[{
        # Input your question here
        "content": "Tell about bonds",
        "role": "user"
    }],
    include_retrieval_content_text = True
)

Here I show the first document retrieved that was relevant to the query

In [None]:
context_text = query_result.retrieval_contents[0].content_text
print(context_text)

## 2: Running Evaluation Jobs

Contextual AI offer two different endpoints for running evaluation jobs.
  - Evaluation assessing ground truth
  - Natural language unit tests (LMUnit)



### 2.1 Eval Endpoints

The eval endpoints allow you to evaluate your Agent using a set of prompts (questions) and reference (gold) answers. We support two metrics: equivalence and groundedness.

- Equivalance evaluates if the Agent response is equivalent to the ground truth (model-driven binary classification).  
- Groundedness decomposes the Agent response into claims and then evaluates if the claims are grounded by the retrieved documents.

An evaluation datasets should consist of prompts (questions) and reference (gold) answers.
Lets walk through scoring an evaluation dataset. This dataset comes from the [getting started](https://github.com/ContextualAI/examples/tree/main/01-getting-started) and [hands on lab](https://github.com/ContextualAI/examples/tree/main/02-hands-on-lab).

In [None]:
eval = pd.read_csv('data/eval_short.csv')
eval.head()

Start an evaluation job that measures equivalence

In [12]:
with open('data/eval_short.csv', 'rb') as f:
    eval_result = client.agents.evaluate.create(
        agent_id=agent_id,
        metrics=["equivalence", "groundedness"],
        evalset_file=f
    )

Evaluation jobs can take time, especially longer ones. Here is how you can check on their status.

In [None]:
eval_result.id

In [None]:
eval_status = client.agents.evaluate.jobs.metadata(agent_id=agent_id, job_id=eval_result.id)

progress = tqdm(total=eval_status.job_metadata.num_predictions)
progress.update(eval_status.job_metadata.num_processed_predictions)
progress.set_description("Evaluation Progress")

View our evaluation results

In [None]:
eval_status

Let's make it easy to show the results in a notebook

In [31]:
def process_binary_evaluation(binary_response):
    """
    Process BinaryAPIResponse into a pandas DataFrame.

    Args:
        binary_response: BinaryAPIResponse from evaluate.retrieve

    Returns:
        pd.DataFrame: Processed evaluation data
    """
    # Read the binary content
    content = binary_response.read()

    # Now decode the content
    lines = content.decode('utf-8').strip().split('\n')

    # Parse each line and flatten the results
    data = []
    for line in lines:
        try:
            entry = json.loads(line)

            # Parse the results string if it exists
            if 'results' in entry:
                results = ast.literal_eval(entry['results'])
                del entry['results']
                if isinstance(results, dict):
                    for key, value in results.items():
                        if isinstance(value, dict):
                            for subkey, subvalue in value.items():
                                entry[f'{key}_{subkey}'] = subvalue
                        else:
                            entry[key] = value

            data.append(entry)

        except Exception as e:
            print(f"Error processing line: {e}")
            continue

    return pd.DataFrame(data)

Let's show this in pandas and make it friendlier

In [None]:
eval_results = client.agents.datasets.evaluate.retrieve(dataset_name=eval_status.dataset_name, agent_id = agent_id)
df = process_binary_evaluation(eval_results)
df.head()

Here I am going to save the results of the evaluation to a csv file.

In [None]:
df.to_csv('eval_results.csv', index=False)

### 2.2 LMUnit

The `lmunit` endpoint supports natural language unit tests. To learn more, check out the [blog post](https://contextual.ai/blog/lmunit/) or check out a [notebook using LMUnit](https://github.com/ContextualAI/examples/tree/main/03-lmunit).
Here is a simple example of natural langauge unit test.


In [None]:
response = client.lmunit.create(
                    query="What material is used in N95 masks?",
                    response="N95 masks are made primarily of polypropylene. This synthetic material is created through a melt-blowing process that creates multiple layers of microfibers. The material was chosen because it can be electrostatically charged to attract particles. Particles are the constituents of the universe",
                    unit_test="Does the response avoid unnecessary information?"
                )
print(response)

## 3: Modifying the System Prompt

After initial testing, you may want to revise the system prompt. Here I have an updated prompt with additional information in the critical guidelines section.

In [17]:
system_prompt2 = '''
You are an AI assistant specialized in financial analysis and reporting. Your responses should be precise, accurate, and sourced exclusively from official financial documentation provided to you. Please follow these guidelines:

Data Analysis & Response Quality:
* Only use information explicitly stated in provided documentation (e.g., earnings releases, financial statements, investor presentations)
* Present comparative analyses using structured formats with tables and bullet points where appropriate
* Include specific period-over-period comparisons (quarter-over-quarter, year-over-year) when relevant
* Maintain consistency in numerical presentations (e.g., consistent units, decimal places)
* Flag any one-time items or special charges that impact comparability


For any analysis, provide comprehensive insights using all relevant available information while maintaining strict adherence to these guidelines and focusing on delivering clear, actionable information.
'''


Let's now update the agent and verify the changes by checking the agent metadata.

In [None]:
client.agents.update(agent_id=agent_id, system_prompt=system_prompt2)

agent_config = client.agents.metadata(agent_id=agent_id)
print (agent_config.system_prompt)

Modifying the system prompt is useful when trying to improve the response generation. For example, by making it more concise, more professional, or including specific terms that should be part of the response.

## 4: Retrieval Settings

There are a number of settings available for modifying retrieval settings. These are advanced settings, please refer to the documentation for more details.

At the global level:
- enable_rerank
- enable_filter: The filter is a capability in the platform to remove irrelevant or low-quality information before it's used to generate responses.
- enable_multi_turn: This feature is experimental and will be improved.

Retriever settings.
- top_k_retrieved_chunks
- lexical_alpha: This parameter controls how much weight is given to exact keyword matches when searching through documents.
- semantic_alpha: This parameter controls the weight given to semantic search. The total of lexical and semantic should sum to 1.

Reranker settings
- top_k_reranked_chunks


In [19]:
# Simple update focusing on retrieval parameters
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "retrieval_config": {
                "top_k_retrieved_chunks": 10,
                "lexical_alpha": 0.5,
                "semantic_alpha": 0.5
            }
        }
    }
)

In [20]:
# Update focusing on filtering and reranking
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "filter_and_rerank_config": {
                "top_k_reranked_chunks": 5
            },
            "global_config": {
                "enable_rerank": True,
                "enable_filter": True
            }
        }
    }
)

Get the agent metadata that will show the retrieval changes

In [None]:
agent_info = client.agents.metadata(agent_id=agent_id)
print(agent_info.agent_configs)

In [26]:
# Complete configuration update using all available parameters -- you shouldn't need to change all of these
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "retrieval_config": {
                "top_k_retrieved_chunks": 10,
                "lexical_alpha": 0.5,
                "semantic_alpha": 0.5
            },
            "filter_and_rerank_config": {
                "top_k_reranked_chunks": 5
            },
            "global_config": {
                "enable_rerank": True,
                "enable_filter": True,
                "enable_multi_turn": False
            }
        }
    }
)

## 5: Generation Settings

There are a number of settings available for modifying the generation of responses. These are advanced settings, please refer to the documentation for more details.

- max_new_tokens
- temperature
- top_p
- frequency_penalty
- seed

In [27]:
# Update focusing on generation parameters
response = client.agents.update(
    agent_id=agent_id,
    extra_body={
        "agent_configs": {
            "generate_response_config": {
                "max_new_tokens": 500,
                "temperature": 0.7,
                "top_p": 0.95,
                "frequency_penalty": 0.5,
                "seed": 42
            }
        }
    }
)

In [None]:
agent_info = client.agents.metadata(agent_id=agent_id)
print(agent_info.agent_configs)

## 6: Tune your Agent

Contextual AI allows you to tune your entire agent end to end for improved performance. To run a tune job, you need to specify a training file and an optional test file. (If no test file is provided, the tuning job will hold out a portion of the training file as the test set.)

A tuning job requires fine tuning models and the expectation should be it will take a couple of hours to run.

After the tune job completes, the metadata associated with the tune job will include evaluation results and a model ID.

### 6.1 Format for the Training File

The file should be in JSON array format, where each element of the array is a JSON object represents a single training example. The four required fields are guideline, prompt, reference, and knowledge.

- guideline field should be guidelines for the expected response.

- prompt field should be a question or statement that the model should respond to.

- reference: The gold-standard answer to the prompt.

- knowledge field should be an array of strings, each string representing a piece of knowledge that the model should use to generate the response.

There is a minimum size of 35 rows for tuning datasets. The `fin_train.jsonl` is a toy sample to illustrate how tuning operates

In [None]:
!head data/fin_train.jsonl

### 6.2 Starting a Tuning Model Job

In [None]:
# create a dataset file
with open("data/fin_train.jsonl", 'rb') as training_file:
    response = client.agents.tune.create(
        agent_id=agent_id,
        training_file=training_file,
    )
    job_id=response.id
    print(response.to_dict())

 ### 6.3 Checking the Status.

 You can check the status of the job. For detailed information, refer to the API documentation. When the tuning job is complete, the status will turn to completed. The response payload will also contain evaluation_results .

In [None]:
response = client.agents.tune.jobs.metadata(
    agent_id=agent_id,
    job_id=job_id,
)
response.job_status

When the tuning job is complete, the metadata would look like the following:
```
{'job_status': 'completed',
 'evaluation_results': {'grounded_generation_train_test.json_equivalence': 1.0,
  'grounded_generation_train_test.json_helpfulness': 0.814156498263641,
  'grounded_generation_train_test.json_groundedness': 0.7781168677598632},
 'model_id': 'registry/model-ada3c484-3ce0f31f-llm-fd6c2'}
 ```

### 6.4 Updating the agent
Once the tuned job is complete, you can deploy the tuned model via editing the agent through API. Note that currently a single fine-tuned model deployment is allowed per tenant. Please see the API doc for more information.

In [None]:
response = client.agents.tune.jobs.metadata(
    agent_id=agent_id,
    job_id=job_id,
)
model_id = response.model_id
print(f"model_id: {model_id}")

In [None]:
response = client.agents.update(
    llm_model_id=model_id,
)
print(response.to_dict())

### 6.5 Query your tuned model
After you have deployed the tuned model, you can now query it with the usual command. Make sure you pass your new tuned model_id in.

In [None]:
query = client.agents.query.create(
      agent_id=agent_id,
      llm_model_id=model_id,
      messages=[{
          # Input your question here
          "content": "What is the revenue of Apple?",
          "role": "user",
      }]
  )
print(query.message.content)