Agent Smith

This project consist on creating two agents who can generate code in python which can be executed in a sandbox Implementation of the ReAct Patter over observation until the final_answer function is called.

Sandbox Design

The sandbox is built to satisfy the security requirements of Section 4.2:

Isolation: Code runs in a dedicated subprocess instance via the runner.
Resource Limits: * Memory: Restricted to 512MB using resource.setrlimit.
Time: Hard 30s timeout per execution block managed by the controller.
Persistence: A globals dictionary is maintained in the runner subprocess, allowing variables and functions to persist or until a timeout or crash occurs.
Import Allowlist: Only authorized modules (e.g., math, re, itertools) can be imported.

Agent Loop

The agent follows a Thought -> Act -> Observe cycle:

Thought: The LLM reasons about the task and determines the next step.
Act: The LLM calls an MCP tool or writes a Python code block.
Observe: The sandbox executes the code and returns the stdout or error traceback.
Iterate: The agent uses the observation to refine its solution until final_answer is called.

Resources

Model Context Protocol Documentation
SWE-bench Paper
AI Usage: An AI Open Source Software was used to generate a robust learning path divided in 6 modules (PersonalGuru):
- The ReAct Reasoning Pattern for LLM Agents
- Building Secure Python Execution Sandboxes
- Model Context Protocol (MCP) Architecture
- Multi-Provider LLM Abstraction and Token Management
- SWE-bench and MBPP Benchmarking for Agents
- Advanced Prompt Engineering for Tool-Use Agents The software used AI to generate Chapters divided into sub-topics which presented quizes and exercises to learn all the different concepts# agent-smith I also used AI to discuss different approaches on the Sandbox design, especially understanding wether if the mcp tools should be executed inside or outside the runner, decided that the controller would be in charge of executing the mcp tools and passing the result to the runner. This way keeping the mcp isolated to the runner. I used AI to designt he dashboard with React, all code was reviewed by me, since I have a lot of experience with WebApps it was easy for me to prompt what I wanted and I knew exactly what should be used, at the end it was a very good choice and saved me like 1 day of work!

uv run sandbox --mcp-stdio "python sandbox/tools/mcp_tools_mbpp.py --stdio" --verbose
uv run sandbox/tools/mcp_tools_mbpp.py   # server http mode (default params)
uv run sandbox/tools/mcp_tools_mbpp.py --stdio # stdio mode
uv run sandbox/tools/mcp_tools_mbpp.py --server <url>  # server http mode custom url
uv run sandbox/tools/mcp_tools_mbpp.py --server "localhost:9999" # Customized server exmple
uv run sandbox --mcp-server http://localhost:8000/mcp --verbose # run sandbox with server mcp
uv run sandbox --verbose # run sandbox normally
uv run python -m agents --task-file ../42Org/AgentSmith/tasks/mbpp_task3.json --output solution.json --api-keys ${OPENAI_KEY} --verbose --mcp-stdio "python sandbox/tools/mcp_tools_mbpp.py --stdio" --provider openai --model "gpt-5-mini-2025-08-07" # run a task

(base) ➜  MoulinetteAgentSmith cd ../MoulinetteAgentSmith
(base) ➜  MoulinetteAgentSmith uv sync                   
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task1.json
Task 284 dumped to: task1.json
Task saved to: task1.json
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task2.json
Task 477 dumped to: task2.json
Task saved to: task2.json
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval dump mbpp --output task3.json
Task 12 dumped to: task3.json
Task saved to: task3.json
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval dump mbpp --task-id 42 --output task.json

Traceback (most recent call last):
  File "/home/pulgamecanica/MoulinetteAgentSmith/.venv/bin/moulinette_eval", line 10, in <module>
    sys.exit(main())
  File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/__main__.py", line 241, in main
    cmd_dump_task(args)
  File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/__main__.py", line 53, in cmd_dump_task
    evaluator.dump_task(int(args.task_id), args.output)
  File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_eval/mbpp_eval.py", line 108, in dump_task
    task_info = self.moulinette.get_task(task_id)
  File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_mbpp/InteractMBPP.py", line 130, in get_task
    task = _get_task_by_id(task_id, with_tests=True)
  File "/home/pulgamecanica/MoulinetteAgentSmith/moulinette_mbpp/InteractMBPP.py", line 71, in _get_task_by_id
    raise ValueError(f"Task ID {task_id} not found")
ValueError: Task ID 42 not found
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval dump mbpp --task-id 142 --output task.json

Task 142 dumped to: task.json
Task saved to: task.json
(base) ➜  MoulinetteAgentSmith uv run moulinette_eval validate mbpp task.json solution.json 
(base) ➜  MoulinetteAgentSmith

TODO:

Database recording models for better visibility (Perhaps implement a decorator for the agent or an interface which all models should use, to record for example the current models + the request and response of each step and general overview) We would need to create yet one more model, to represent a TaskInstance, the recording of a task executed (all steps) so later I can query the database to see for a given task id, how many models have ran and do benchmarks, perhaps also we run the same task with same provider for different models or even the same one; etc

Better CLI for the agent, perhaps in the same interface or with a decorator we can deal with showing an interactive CLI, perhaps we would need to create a thread where the model runs and on the main we run a loop which updates the ASCII CLI screen depending on how the run task is going?

Website or graphana or dashboard to visualize the database, even just a database visualizer to make queries and generate graphs.

Dockerize project (make up -> Spawns an MCP server -> spawns a container running inifinitly where we can excec the project (it's installed corerctly))

SWE-Bench

Refs

https://openrouter.ai/activity

https://platform.claude.com/dashboard

https://platform.openai.com/usage

https://aistudio.google.com/usage

uv run sandbox --mcp-server "https://gitmcp.io/modelcontextprotocol/servers/tree/main/src/filesystem" --verbose

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
agent_mbpp		agent_mbpp
agent_swebench		agent_swebench
agents		agents
common		common
dashboard		dashboard
sandbox		sandbox
web		web
.gitignore		.gitignore
.python-version		.python-version
BENCHMARK_REPORT.md		BENCHMARK_REPORT.md
Makefile		Makefile
README.md		README.md
agent-smith.log		agent-smith.log
mcp_tools_mbpp.py		mcp_tools_mbpp.py
mcp_tools_swebench.py		mcp_tools_swebench.py
pulgai_history.db		pulgai_history.db
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent Smith

Sandbox Design

Agent Loop

Resources

Refs

About

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Agent Smith

Sandbox Design

Agent Loop

Resources

Refs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages