(Example inspiration: TODO/plan executors such as Cursor — plan → execute → log)
Build a small AI agent that helps users tackle complex goals by breaking them into actionable steps and executing them.
Demo video: https://drive.google.com/file/d/1GfZFA9w_loGDRyLv-4hhla3V6mYl2jlR/view?usp=sharing
- Python 3.12+
- uv package manager
git clone <repo-url>
cd testtask_aiagent
uv syncCopy the example env file and add your API keys:
cp .env.example .envRequired keys:
OPENAI_API_KEY— OpenAI API keyTAVILY_API_KEY— Tavily API key for web search
uv run python -m agentuv run pytest # run tests
uv run ruff check src/ # lint- What type of goals or domain to focus on - General Research Assistant
- How the AI interaction works (chat, CLI, minimal UI, etc.) - Rich chat in CLI
- What level of automation vs. user confirmation you provide:
- User states the task
- The plan is built
- User rejects the plan and adds clarifications
- The new plan is built and accepted
- The research is done (by calling assistants internally)
- The final report is provided to the user
- The user adds some new demands/questions/instructions
- The new plan is built based on the previous context and new demands
- ...
- Persistence - main user conversation can be resumed from json
I spent about 2 days it total:
- 2-3 hours planning
- 1h finalized the blueprint with Claude
- 5h run Claude to implement the basic structures
- 5h clean up and make it work
- 2h add token counter and context handling
- 3h running final demos (adjusted one prompt) and writing the report
- Clear prompt structure and instructions -
prompts/templates - Thoughtful context selection (what to keep vs. drop) - assistants contexts are separate, only reports are shared throughout the system
- Basic handling of longer conversations or state / Avoiding prompt bloat - see
task_executor,plan_executor, and__main__, see the demo transcript of a very short context window handlingdemo_transcripts/1_eu_diesel_ban_5k_context_window_85e9aed3
- High‑level goal → structured TODO list -
planner - Simple execution loop (select task → execute → update status) -
plan_executor - Integration of at least one real tool (web search, document reading, API call, vector search, etc.) - web search & extraction (Tavily API) -
tools - Transparent logging of what the agent is doing - full debug logging to file, rich-formatted transcripts of the main conversation and task flows, see
demo_transcripts, basic MLflow tracing
- Clear explanation of how you would test or evaluate the system - that's a task probably bigger than the programming:
- Basic python unit tests, some are already in
tests - Gold standard dataset with a) basic ruled checks, b) llm-based evaluations
-
The interaction with user is described abouve, and it's an important part of the design, too
-
Within assistant:
- tasks are executed sequentially as planned
- reports of previous tasks are shared, so each assistant sees the main goal, work done so far, and its own task
- all reports are then passed to the main researcher to produce final answer
-
Withing task:
- tools are called until the model stops to require more tool calls and gives the final text answer
- if tool call limit or context window limit is reached, the model is asked to provide the final report
- the context window limit is handled intelligently, with an offset to leave space for the model to give the final report
-
In general, see
demo_transcripts, they show the process clearly
-
Of course, at first, a minimum gold standard dataset should be gathered, to improve more or less reliably
-
The main flaw of the current design is a fixed plan. That may be good for a coding assistant, but inappropriate for new information researh. I would make assistants creation dynamic, depending on what has been found so far. Of course, that would require context handling in the main researcher itself
-
Web page extraction should be wrapped into a separate summarization/extraction LLM call, so that more web pages can be extracted within a single assistant run without hitting the context limit
-
As a main researcher, a better model should be used, not gpt-5-mini