still working on this rewrite later, wip
This project contains a simple workflow for evaluating AI agents. The goal is to systematically assess and refine an AI agent's performance by testing, analyzing outputs, adjusting parameters, and retesting. This example implements a text summarization agent using:
- OpenAI API for text summarization
- Transformers library for semantic similarity analysis
agent.py
β Initial implementation of the text summarization agenttest_workflow.py
β First round of testing with sample inputsmetrics.py
β Evaluation metrics, including semantic similarity and readabilityreadability.py
β Calculates readability scoressemantic_similarity.py
β Computes semantic similarity between original text and summariesedited_parameters.py
β Adjusted agent settings after analyzing initial resultsedited_eval.py
β Retesting after modifying agent behavior
Input Text:
Climate change is a major contemporary challenge, characterized by rising global temperatures that cause extreme weather, melting ice, and ecosystem disruptions. Human activities like deforestation and industrial pollution exacerbate these effects. Scientists stress the urgency of reducing greenhouse gas emissions to mitigate environmental impacts.
Initial Summary Output:
Climate change leads to extreme weather, melting ice, and ecosystem disruptions. Human activities worsen the problem. Scientists urge reducing emissions.
- Semantic Similarity Score: 0.90
- Flesch Reading Ease: -2.68 (complex)
- SMOG Index: 16.30 (difficult to read)
- The summary retained key points but was still complex
- The readability score indicated high difficulty
- The agent needed adjustments for more accessible summaries
- Adjusted temperature and max_tokens to simplify language
- Applied post-processing for clarity
New Summary Output:
Climate change is a big problem today. It causes higher temperatures, extreme weather, and melting ice. This affects nature and wildlife. Human actions like cutting down trees and pollution make it worse. Scientists say we must act now to cut down on greenhouse gases.
- Semantic Similarity Score: 0.88
- Flesch Reading Ease: 71.00 (easy to read)
- SMOG Index: 7.60 (much simpler language)
- The summary remained accurate while becoming more readable
- Better balance between information retention and accessibility
- The agent was successfully refined through iterative testing
This project demonstrates how to evaluate an agentic AI workflow using a structured testing process:
- Generate initial outputs β Assess AI performance
- Measure metrics β Semantic similarity, readability, etc.
- Identify areas for improvement β Adjust prompts, parameters, or processing
- Retest and compare β Observe performance changes
This approach is useful for any AI-driven agent, from summarization to decision-making systems, ensuring continuous improvement and alignment with intended objectives.
git clone https://github.com/ashleysally00/agent_eval_testing_workflow.git
cd agent_eval_testing_workflow
python test_workflow.py
python metrics.py
python edited_parameters.py
python edited_eval.py
This workflow offers a clear method for testing and refining AI agents. By using evaluation metrics and making iterative improvements, we can enhance their performance and create more user-friendly AI outputs.