This project consist on creating two agents who can generate code in python which can be executed in a sandbox
Implementation of the ReAct Patter over observation until the final_answer function is called.
The sandbox is built to satisfy the security requirements of Section 4.2:
- Isolation: Code runs in a dedicated
subprocessinstance via the runner. - Resource Limits: * Memory: Restricted to 512MB using
resource.setrlimit. - Time: Hard 30s timeout per execution block managed by the controller.
- Persistence: A
globalsdictionary is maintained in the runner subprocess, allowing variables and functions to persist or until a timeout or crash occurs. - Import Allowlist: Only authorized modules (e.g.,
math,re,itertools) can be imported.
The agent follows a Thought -> Act -> Observe cycle:
- Thought: The LLM reasons about the task and determines the next step.
- Act: The LLM calls an MCP tool or writes a Python code block.
- Observe: The sandbox executes the code and returns the
stdoutor error traceback. - Iterate: The agent uses the observation to refine its solution until
final_answeris called.
- Model Context Protocol Documentation
- SWE-bench Paper
- AI Usage:
An AI Open Source Software was used to generate a robust learning path divided in 6 modules (PersonalGuru):
- The ReAct Reasoning Pattern for LLM Agents
- Building Secure Python Execution Sandboxes
- Model Context Protocol (MCP) Architecture
- Multi-Provider LLM Abstraction and Token Management
- SWE-bench and MBPP Benchmarking for Agents
- Advanced Prompt Engineering for Tool-Use Agents The software used AI to generate Chapters divided into sub-topics which presented quizes and exercises to learn all the different concepts# agent-smith I also used AI to discuss different approaches on the Sandbox design, especially understanding wether if the mcp tools should be executed inside or outside the runner, decided that the controller would be in charge of executing the mcp tools and passing the result to the runner. This way keeping the mcp isolated to the runner. I used AI to designt he dashboard with React, all code was reviewed by me, since I have a lot of experience with WebApps it was easy for me to prompt what I wanted and I knew exactly what should be used, at the end it was a very good choice and saved me like 1 day of work!
uv run sandbox --mcp-stdio "python sandbox/tools/mcp_tools_mbpp.py --stdio" --verbose
uv run sandbox/tools/mcp_tools_mbpp.py # server http mode (default params)
uv run sandbox/tools/mcp_tools_mbpp.py --stdio # stdio mode
uv run sandbox/tools/mcp_tools_mbpp.py --server <url> # server http mode custom url
uv run sandbox/tools/mcp_tools_mbpp.py --server "localhost:9999" # Customized server exmple
uv run sandbox --mcp-server http://localhost:8000/mcp --verbose # run sandbox with server mcp
uv run sandbox --verbose # run sandbox normally
uv run python -m agents --task-file ../42Org/AgentSmith/tasks/mbpp_task3.json --output solution.json --api-keys ${OPENAI_KEY} --verbose --mcp-stdio "python sandbox/tools/mcp_tools_mbpp.py --stdio" --provider openai --model "gpt-5-mini-2025-08-07" # run a task
(base) ➜ MoulinetteAgentSmith cd ../MoulinetteAgentSmith
(base) ➜ MoulinetteAgentSmith uv sync
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task1.json
Task 284 dumped to: task1.json
Task saved to: task1.json
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task2.json
Task 477 dumped to: task2.json
Task saved to: task2.json
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task3.json
Task 12 dumped to: task3.json
Task saved to: task3.json
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval dump mbpp --task-id 42 --output task.json
Traceback (most recent call last):
File "/home/pulgamecanica/MoulinetteAgentSmith/.venv/bin/moulinette_eval", line 10, in <module>
sys.exit(main())
File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/__main__.py", line 241, in main
cmd_dump_task(args)
File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/__main__.py", line 53, in cmd_dump_task
evaluator.dump_task(int(args.task_id), args.output)
File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/mbpp_eval.py", line 108, in dump_task
task_info = self.moulinette.get_task(task_id)
File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_mbpp/InteractMBPP.py", line 130, in get_task
task = _get_task_by_id(task_id, with_tests=True)
File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_mbpp/InteractMBPP.py", line 71, in _get_task_by_id
raise ValueError(f"Task ID {task_id} not found")
ValueError: Task ID 42 not found
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval dump mbpp --task-id 142 --output task.json
Task 142 dumped to: task.json
Task saved to: task.json
(base) ➜ MoulinetteAgentSmith uv run moulinette_eval validate mbpp task.json solution.json
(base) ➜ MoulinetteAgentSmith TODO:
Database recording models for better visibility (Perhaps implement a decorator for the agent or an interface which all models should use, to record for example the current models + the request and response of each step and general overview) We would need to create yet one more model, to represent a TaskInstance, the recording of a task executed (all steps) so later I can query the database to see for a given task id, how many models have ran and do benchmarks, perhaps also we run the same task with same provider for different models or even the same one; etc
Better CLI for the agent, perhaps in the same interface or with a decorator we can deal with showing an interactive CLI, perhaps we would need to create a thread where the model runs and on the main we run a loop which updates the ASCII CLI screen depending on how the run task is going?
Website or graphana or dashboard to visualize the database, even just a database visualizer to make queries and generate graphs.
Dockerize project (make up -> Spawns an MCP server -> spawns a container running inifinitly where we can excec the project (it's installed corerctly))
SWE-Bench