AI agent with self-improvement, compressed memory, and an OpenClaw-style skill system. Runs locally with Ollama. Vibecoded with Claude Opus 4.6 and tested with codegemma:latest [[DO NOT EXPOSE THIS APPLICATION TO THE INTERNET]]
aria-agent/
├── main.py # Entry point
├── config.json # Configuration
├── core/
│ ├── agent.py # Main orchestrator (Thread 1 + 2)
│ ├── config.py # Configuration loader
│ ├── prompts.py # All prompts & strings (PL/EN)
│ ├── memory.py # Compressed memory (ST → LT) + RL neural network + pending tasks
│ ├── skills_manager.py # Skill manager (SKILL.md format) + error logs
│ ├── computer.py # System tools (shell, files, Python)
│ ├── llm.py # Ollama/OpenAI LLM client (no timeouts, chunked responses)
│ ├── reflection.py # Background reflection thread + system exploration + task processing
│ └── logger.py # Central event logging
├── web/
│ ├── server.py # WebUI HTTP server + API
│ └── static/index.html # React SPA dashboard
├── skills/ # Skills (OpenClaw/AgentSkills format)
└── memory/ # Persistent memory (auto-generated)
├── memory.json
├── memory_network.json # RL neural network weights
├── pending_tasks.json # User tasks waiting for completion
├── reflections.jsonl
├── self_model.json
└── logs/
└── aria.log # Full event log (rotating, 10MB)
- Interactive REPL (CLI) or WebUI with real-time dashboard
- Understands slash commands (
/exec,/python,/read,/write, etc.) - JSON-structured Chain of Thought: Analyze → Memory → Skill Selection → Execute → Plan → Answer
- Categorizes and weighs importance of every interaction (RL neural network scoring)
- Detects when it can't answer and adds task to pending queue
- Long responses auto-chunked across multiple messages (no truncation)
- Runs continuously in the background, independently from conversations
- Cycles through phases: introspection, pattern analysis, skill planning, skill building, skill testing, self-improvement, knowledge synthesis, exploration, pending tasks, system exploration
- Autonomously creates, improves, tests, and fixes skills (prefers improving over duplicating)
- System exploration: safely discovers system environment via read-only shell commands (double safety check: hardcoded blocklist + LLM verification)
- Proactive messaging: sends messages to user on any topic at any time (skill creation, task completion, observations, greetings)
- Pending task processing: retries user tasks that failed, notifies user when completed
- Skill error log clearing: clears error history after successful fix cycles
- Updates the agent's self-model
ARIA's LLM client has zero timeouts for Ollama — it relies entirely on Ollama's structured done flag to know when a response is complete. This eliminates mid-word truncation issues on slow hardware.
- Streaming: reads chunks until
{"done": true}— no socket/connection timeout - Non-streaming: internally uses streaming to avoid blocking
- Truncation detection: checks
done_reason == "length"and auto-continues chat_long(): automatically splits responses truncated by token limit into multiple parts (up to 3), sending "Continue" to the model between parts
Thread 2's skill building phase now prefers improving existing skills over creating new duplicates:
- LLM receives full code of all existing skills as context
- LLM must choose
action: "improve"oraction: "create"with justification - Duplicate detection: Jaccard similarity check on name+description — rejects >50% overlap with existing skills
- Improvement workflow: backup old code → write new → test → rollback on failure
- Creation workflow: only when truly new functionality is needed
Thread 2 periodically explores the host system to understand its environment:
- LLM plans 3-5 read-only shell commands based on what it already knows
- Double safety check before each command:
- Hardcoded blocklist:
rm,sudo,dd,mkfs,shutdown,kill,chmod -R,> /, etc. - LLM safety evaluation: assesses whether the command modifies files, installs packages, or could harm the system
- Hardcoded blocklist:
- Safe commands execute with 15s timeout
- Results stored in memory as
system_discoverycategory - Knowledge accumulates across cycles — LLM sees previous discoveries to avoid redundant exploration
When the agent detects it can't fulfill a user request (e.g., missing a required skill), it:
- Adds the task to
pending_tasks.jsonwith the reason and missing capability - Notifies the user that the task has been queued
- Thread 2 periodically checks pending tasks during the
pending_tasksphase - When the capability is available (skill created/fixed), Thread 2 executes the task
- User receives a proactive message with the result
- Tasks are automatically cleaned up after 48 hours
Each skill maintains an error log (in .stats.json):
- Errors are recorded automatically when scripts fail
- Thread 2 reads error logs to prioritize fixes
- Error logs are cleared after successful fix cycles
- View errors via
/skill <name>or the Skills panel in WebUI
All internal reasoning steps use strict JSON format to prevent CoT artifacts from leaking into user-facing responses:
Step 1: Analyze → {"intent": "...", "skills": [...], "can_answer": true/false, ...}
Step 2: Memory recall → relevant context
Step 3: Skill selection → {"selections": [{"name": "...", "args": [...], "reason": "..."}]}
Step 4: Execute skills → raw output
Step 5: Plan → {"key_info": "...", "format": "...", "approach": "..."}
Step 6: Interpret → {"summary": "...", "useful": true/false}
Step 7: Final answer → natural language response (no CoT artifacts)
Memory importance is scored by a lightweight reinforcement learning neural network that runs entirely on CPU:
Architecture: 28 features → 16 hidden (ReLU) → 1 output (sigmoid)
Training: REINFORCE-style policy gradient, single-step updates
Persistence: weights saved to memory/memory_network.json
Features extracted (28 total):
- Text metrics: length, word count, lexical diversity, question/exclamation density
- Content signals: has code, has URL, digit ratio, multiline, structured content
- Category one-hot encoding (8 categories)
- Metadata: interaction number, skill mentions, error mentions, source type
RL Rewards:
- Entry recalled by user query: +1.0
- User explicitly says "remember/zapamiętaj": +2.0
- Entry with high access count during compression: +0.5
- Entry never accessed with low importance: -0.5
Scoring blend: final_importance = 0.4 × heuristic + 0.6 × neural_network
The network improves over time — frequently useful memories get higher scores, low-value content gets filtered out.
New interaction → RL importance scoring → Short-term memory
↓ (buffer full)
LLM-generated summaries
↓
Long-term compressed blocks
↓
Episodic memory (key moments)
- Short-term: Full details, max 20 entries, scored by RL network
- Long-term: LLM-generated summaries with max character limit (falls back to rule-based if LLM unavailable)
- Episodic: Breakthrough moments (new skills, discoveries)
- Pending tasks: User requests awaiting completion
- RL training: happens during compression (reward for accessed entries, penalty for never-used)
- Semantic search:
/recall <query>(also triggers RL reward for recalled entries)
All events are logged to memory/logs/aria.log (rotating, 10MB max, 3 backups):
2025-01-15 14:23:01 | INFO | USER [5] | jak wygląda mój system?
2025-01-15 14:23:01 | DEBUG | COT/analyze | {"intent": "system info", ...}
2025-01-15 14:23:02 | INFO | SKILL system-monitor/main.py exit=0 | CPU: 45%, RAM: 3.2GB...
2025-01-15 14:23:03 | INFO | AGENT [5] skills=['system-monitor'] | Twój system wygląda...
2025-01-15 14:23:30 | INFO | T2/system_exploration | Running: df -h (disk space)
2025-01-15 14:23:45 | INFO | T2->USER | Ulepszyłem umiejętność: system-monitor...
2025-01-15 14:24:00 | INFO | TASK/completed task-1705312981-0 | Free space: 45GB...
Log categories: USER, AGENT, COT/*, T2/*, T2->USER, SKILL, SKILL_ERR, TASK/*, CMD, ERROR, WARN
Each skill is a folder with SKILL.md:
---
name: my-skill
description: When and why to use this skill
---
# Instructions in Markdown
...- Manually:
/create-skill {json}with name, description, instructions, script_code - Automatically: Thread 2 creates or improves skills based on user interaction patterns
- Programmatically:
skills_manager.create_skill(name, desc, instructions, scripts)
ARIA supports Polish (pl) and English (en). All prompts, LLM instructions, UI strings, and command descriptions switch based on language setting.
Set language via:
- Environment variable:
ARIA_LANG=en(takes priority) - config.json:
"agent": { "language": "en" } - Docker Compose:
ARIA_LANG=enin environment section
# Prerequisites: Ollama running on host with a model pulled
ollama pull llama3.2
# Start ARIA (Polish, default)
docker compose up -d
# Start ARIA in English
ARIA_LANG=en docker compose up -d
# Start with a different model
ARIA_MODEL=qwen3:8b docker compose up -d
# Open WebUI
open http://localhost:8080
# View agent logs
docker compose logs -f aria
# Or check the detailed log file: memory/logs/aria.log# Install Ollama and pull a model
ollama pull llama3.2
# CLI mode
python main.py
# CLI with live Thread 2 output
python main.py --reflection
# WebUI on port 8080
python main.py --web
# WebUI on custom port
python main.py --web --port 3000| Variable | Default | Description |
|---|---|---|
ARIA_LANG |
pl |
Language: pl or en |
ARIA_MODEL |
codegemma:latest |
Ollama model name |
OLLAMA_HOST |
http://host.docker.internal:11434 |
Ollama server URL |
{
"llm": {
"base_url": "http://localhost:11434",
"model": "llama3.2",
"temperature": 0.7,
"max_tokens": 4096,
"context_length": "auto",
"api_type": "ollama",
"timeout": 300
},
"reflection_llm": {
"base_url": "http://localhost:11434",
"model": "llama3.2",
"temperature": 0.9,
"max_tokens": 4096,
"context_length": "auto",
"api_type": "ollama",
"timeout": 180
},
"agent": {
"language": "en",
"reflection_interval": 3,
"stream_responses": true,
"short_term_limit": 20
}
}Note: timeout values in config are retained for non-Ollama endpoints (OpenAI-compatible). Ollama communication uses no timeouts — it relies on the structured done flag.
| Command | Description |
|---|---|
/help |
List all commands |
/status |
Agent status + Thread 2 info |
/memory |
Memory overview |
/recall <query> |
Search memory |
/skills |
List skills |
/skill <n> |
Skill details |
/run <skill> |
Run a skill script |
/create-skill <json> |
Create a new skill |
/tasks |
View pending tasks queue |
/thread2 [n] |
Show last n Thread 2 cycles |
/reflect |
Force a reflection cycle |
/exec <cmd> |
Execute shell command |
/python <code> |
Execute Python code |
/ls [path] |
List directory |
/read <path> |
Read file |
/write <path> <content> |
Write to file |
/sysinfo |
System information |
/compress |
Force memory compression |
/selfmodel |
Agent self-model |
/models |
List Ollama models |
/model <n> |
Switch active model |
/pull [model] |
Pull model from Ollama |
/ollama |
Ollama connection status |
- Real-time chat with JSON-structured Chain of Thought (expandable)
- Thread 2 live thought stream panel
- Proactive messages from Thread 2 (skill creation/improvement, task completion, system discoveries)
- Status dashboard with memory/skills/interaction/task stats
- Command palette with autocomplete (type
/) - Skill browser with run buttons and error counts
- Memory inspector
- Reflection timeline
- SSE-based real-time updates
Architecture inspired by OpenClaw:
-
SKILL.md-based skill system (AgentSkills format)
-
Local command and script execution
-
Persistent memory across sessions
-
Modular, extensible architecture