## Lab 3: Improving your Agent

Contexual AI gives you multiple methods for improving your overall agent performance. Two methods available via the API are modifying the system prompt or tuning the models. It's recommended you start with modifying the system prompt before using tuning. Please reach out to the account team for more best practices around tuning.

Let's go through both options:

In [1]:
import os
import requests
from contextual import ContextualAI
from IPython.display import display, Markdown

🔑 Replace "your_api_key" with your actual API key 👇🏼

In [10]:
API_KEY = "key-..."

In [12]:
client = ContextualAI(
    api_key=API_KEY,  # This is the default and can be omitted
)

**Important**: Replace "agent_id" with the Agent you created in Lab 1 👇🏼

In [13]:
agent_id="..."

### 6.1 Revising the system prompt


After initial testing, you may want to revise the system prompt. Here I have an updated prompt with additional information in the critical guidelines section.  

In [14]:
system_prompt2 = '''
You are an AI assistant specialized in financial analysis and reporting. Your responses should be precise, accurate, and sourced exclusively from official financial documentation provided to you. Please follow these guidelines:

Data Analysis & Response Quality:
* Only use information explicitly stated in provided documentation (e.g., earnings releases, financial statements, investor presentations)
* Present comparative analyses using structured formats with tables and bullet points where appropriate
* Include specific period-over-period comparisons (quarter-over-quarter, year-over-year) when relevant
* Maintain consistency in numerical presentations (e.g., consistent units, decimal places)
* Flag any one-time items or special charges that impact comparability

Technical Accuracy:
* Use industry-standard financial terminology
* Define specialized acronyms on first use
* Never interchange distinct financial terms (e.g., revenue, profit, income, cash flow)
* Always include units with numerical values
* Pay attention to fiscal vs. calendar year distinctions
* Present monetary values with appropriate scale (millions/billions)

Response Format:
* Begin with a high-level summary of key findings when analyzing data
* Structure detailed analyses in clear, hierarchical formats
* Use markdown for lists, tables, and emphasized text
* Maintain a professional, analytical tone
* Present quantitative data in consistent formats (e.g., basis points for ratios)

Critical Guidelines:
* Do not make forward-looking projections unless directly quoted from source materials
* Do not answer any questions for 2024 or later
* Avoid opinions, speculation, or assumptions
* If information is unavailable or irrelevant, clearly state this without additional commentary
* Answer questions directly, then stop
* Do not reference source document names or file types in responses
* Focus only on information that directly answers the query

For any analysis, provide comprehensive insights using all relevant available information while maintaining strict adherence to these guidelines and focusing on delivering clear, actionable information.
'''


Let's now update the agent. And verify that changes by checking the agent metadata.

In [15]:
client.agents.update(agent_id=agent_id, system_prompt=system_prompt2)

agent_config = client.agents.metadata(agent_id=agent_id)
print (agent_config.system_prompt)


You are an AI assistant specialized in financial analysis and reporting. Your responses should be precise, accurate, and sourced exclusively from official financial documentation provided to you. Please follow these guidelines:

Data Analysis & Response Quality:
* Only use information explicitly stated in provided documentation (e.g., earnings releases, financial statements, investor presentations)
* Present comparative analyses using structured formats with tables and bullet points where appropriate
* Include specific period-over-period comparisons (quarter-over-quarter, year-over-year) when relevant
* Maintain consistency in numerical presentations (e.g., consistent units, decimal places)
* Flag any one-time items or special charges that impact comparability

Technical Accuracy:
* Use industry-standard financial terminology
* Define specialized acronyms on first use
* Never interchange distinct financial terms (e.g., revenue, profit, income, cash flow)
* Always include units with nu

Now that you have updated the agent, go try running another evaluation job. You will see the performance has improved.

In [None]:
if not os.path.exists('data/eval_short.csv'):
    print(f"Fetching data/eval_short.csv")
    response = requests.get("https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/02-hands-on-lab/data/eval_short.csv")
    with open('data/eval_short.csv', 'wb') as f:
        f.write(response.content)

In [28]:
with open('data/eval_short.csv', 'rb') as f:
    eval_result = client.agents.evaluate.create(
        agent_id=agent_id,
        metrics=["equivalence", "groundedness"],
        evalset_file=f
    )

In [36]:
eval_status = client.agents.evaluate.jobs.metadata(agent_id=agent_id, job_id=eval_result.id)
from tqdm import tqdm

progress = tqdm(total=eval_status.job_metadata.num_predictions)
progress.update(eval_status.job_metadata.num_processed_predictions)
progress.set_description("Evaluation Progress")

EvaluationJobMetadata(dataset_name='eval_short_2ae7ed23-9456-4277-9c80-ff5dabf0f2c2_results', job_metadata=JobMetadata(num_failed_predictions=0, num_predictions=12, num_successful_predictions=12), metrics={'equivalence_score': {'score': 0.8333333333333334}, 'groundedness_score': {'score': 0.5833333333333334}}, status='completed')

### 6.2 Tuning the Contextual System

To run a tune job, you need to specificy a training file and an optional test file. (If no test file is provided, the tuning job will hold out a portion of the training file as the test set.)

A tuning job requires fine tuning models and the expectation should be it will take a couple of hours to run.

After the tune job completes, the metadata associated with the tune job will include evaluation results and a model

#### 6.2.1 Tuning dataset:  

The file should be in JSON array format, where each element of the array is a JSON object represents a single training example. The four required fields are guideline, prompt, response, and knowledge.

- knowledge field should be an array of strings, each string representing a piece of knowledge that the model should use to generate the response.

- reference: The gold-standard answer to the prompt.

- guideline field should be guidelines for the expected response.

- prompt field should be a question or statement that the model should respond to.

In [37]:
!head data/fin_train.jsonl

[
{"guideline": "The response should clearly communicate strategic priorities and potential risks associated with forward-looking statements.", "prompt": "What are the strategic priorities outlined in Apple's 2024 investor call?", "knowledge": ["Apple plans to expand its services segment, targeting 15% YoY growth.", "Investments in AR/VR technology are expected to increase by 30%.", "Sustainability goals include achieving carbon neutrality across its supply chain by 2030.", "Risks include potential regulatory scrutiny in the EU and economic headwinds impacting consumer spending."], "reference": "Apple's 2024 strategic priorities focus on expanding its services segment (targeting 15% YoY growth), increasing investments in AR/VR by 30%, and advancing sustainability goals to achieve carbon neutrality by 2030. However, the company faces risks such as EU regulatory challenges and economic pressures on consumer spending."},
{"guideline": "The response should focus on actionable insights base

#### 6.2.2 Starting a tuning model job

In [38]:
with open('data/fin_train.jsonl', 'rb') as f:
    tune_job = client.agents.tune.create(
    agent_id=agent_id,
    training_file=f
)
    
tune_job_id = tune_job.id
print(f"Tune job created: {tune_job_id}")

Tune job created: 70e28e9f-e471-4876-afc6-d431c57c4c39


In [39]:
print (agent_id)
print (tune_job_id)

faf2cc13-a503-40e1-adc9-432b977d9b4a
70e28e9f-e471-4876-afc6-d431c57c4c39


#### 6.2.3 Checking the status.

 You can check the status of the job using the API. For detailed information, refer to the API documentation". When the tuning job is complete, the status will turn to completed. The response payload will also contain evaluation_results, such as scores for equivalence, helpfulness, and groundedness.

In [40]:
tune_metadata = client.agents.tune.jobs.metadata(
    agent_id=agent_id,
    job_id=tune_job_id
)
print("Tuning job metadata:", tune_metadata)

Tuning job metadata: TuneJobMetadata(job_status='pending', evaluation_results=None, model_id=None)


When the tuning job is complete, the metadata would look like the following:
```
{'job_status': 'completed',
 'evaluation_results': {'grounded_generation_train_test.json_equivalence': 1.0,
  'grounded_generation_train_test.json_helpfulness': 0.814156498263641,
  'grounded_generation_train_test.json_groundedness': 0.7781168677598632},
 'model_id': 'registry/model-ada3c484-3ce0f31f-llm-fd6c2'}
 ```

#### 6.2.4 Updating the agent
Once the tuned job is complete, you can deploy the tuned model via editing the agent through API. Note that currently a single fine-tuned model deployment is allowed per tenant. Please see the API doc for more information.

In [None]:
client.agents.update(agent_id=agent_id, llm_model_id=tune_metadata.model_id)

print("Agent updated with tuned model")

#### 6.2.5 Evaluate the agent
Once the new model is deployed we can evaluate our agent again.

In [None]:
with open('data/eval_short.csv', 'rb') as f:
    eval_result = client.agents.evaluate.create(
        agent_id=agent_id,
        metrics=["equivalence", "groundedness"],
        evalset_file=f
    )

In [41]:
eval_status = client.agents.evaluate.jobs.metadata(agent_id=agent_id, job_id=eval_result.id)
from tqdm import tqdm

progress = tqdm(total=eval_status.job_metadata.num_predictions)
progress.update(eval_status.job_metadata.num_processed_predictions)
progress.set_description("Evaluation Progress")

EvaluationJobMetadata(dataset_name='eval_short_2ae7ed23-9456-4277-9c80-ff5dabf0f2c2_results', job_metadata=JobMetadata(num_failed_predictions=0, num_predictions=12, num_successful_predictions=12), metrics={'equivalence_score': {'score': 0.8333333333333334}, 'groundedness_score': {'score': 0.5833333333333334}}, status='completed')

## Next Steps

In this workshop, we've created a RAG agent in the finance domain, evaluated the agent, and tuned it for better performance. You can learn more at [docs.contextual.ai](https://docs.contextual.ai/). Finally, reach out to your account team if you have further questions or issues. Thanks for coming! 👋