General Computer Use Agent

A multi-agent system that autonomously operates an Ubuntu Linux desktop to perform engineering design tasks. The system uses Google Gemini's Computer Use API (vision model) to see the screen, reason about what to do, and control the mouse/keyboard to drive applications like FreeCAD and Google Chrome.

Built for the Gemini API Developer Competition 2025.

Overview
Key Insight: Model Sophistication Matters
System Architecture
Demo
Deployment
- One-Command GCP Deploy
- Local / Existing VM Setup
Quick Start
Agents
Multi-Agent Workflow
The Agentic Loop
Macro Execution Engine
Planner Intelligence
Skill Learning Pipeline
Coordinate System & Executors
Project Structure
Configuration Reference
Known Limitations & Lessons Learned
Future Work
Troubleshooting

Overview

This project demonstrates a general-purpose computer use agent that can:

Design 3D parts in FreeCAD -- The CAD agent sees the FreeCAD GUI through screenshots, clicks menus, draws sketches, applies constraints, and performs Part Design operations (Pad, Pocket, Thickness, Fillet, etc.). It can also run Python macros directly in FreeCAD's console for precision geometry.
Research information online -- The Research agent opens a browser via Playwright, searches DuckDuckGo, reads pages, and extracts structured data with confidence scores and source URLs.
Generate professional reports -- The Documentation agent converts raw research data into formatted Word (.docx) and PDF documents with tables, sections, and citations.
Chain agents together -- The Planner can route a request like "Make a bracket for an M6 bolt" through Research (find bolt specs) -> Documentation (save report) -> CAD (build the 3D model with real dimensions).

Key Technologies

Component	Technology
Vision Model (Computer Use)	Google Gemini 3 Flash Preview (`gemini-3-flash-preview`)
Planning Model (Text-only)	Google Gemini 3.1 Pro Preview (`gemini-3.1-pro-preview`)
Desktop Control	xdotool (X11 input automation)
Macro Engine	Python -> FreeCAD console (via xclip paste)
Screenshots	scrot + PIL (resize to 1440x900)
Browser Control	Playwright (Chromium)
CAD Application	FreeCAD 1.0
Document Generation	fpdf2 (PDF), python-docx (Word)
VM Environment	Ubuntu Linux, XFCE desktop, X11 display server

Key Insight: Model Sophistication Matters

Through extensive testing, we discovered that complex engineering workflows are absolutely possible with Computer Use -- but the quality of the output depends heavily on the model's reasoning capability, not just its ability to see and click.

What We Learned

Capability	Gemini 3 Flash (current)	Gemini 3.1 Pro (needed for CU)
Open FreeCAD, navigate menus	Works reliably	Works reliably
Create simple shapes (cube, cylinder)	Works (~80% success)	Works
Draw sketches with constraints	Inconsistent	Reliable
Generate correct FreeCAD Python macros	Wrong API names, wrong face refs	Correct API usage
Multi-step designs (L-bracket + holes)	Fails silently, burns turns	Can reason through steps
Error recovery (undo, retry)	Repeats same mistake	Recognizes and adapts
Spatial reasoning (which face is "top"?)	Guesses Face6, Face12	Understands geometry

The Core Problem

The vision model (gemini-3-flash-preview) can see the screen perfectly and click accurately. But CAD design requires reasoning about 3D geometry from 2D screenshots -- understanding which face is the "top face," how a pocket changes the shape, and what the correct FreeCAD Python API property name is. This is a reasoning task, not a vision task.

A more sophisticated model like gemini-3.1-pro-preview (when it supports Computer Use) would unlock:

Correct macro generation -- right property names, right face selection logic
Multi-feature designs in one shot -- L-brackets with holes, gears with teeth
Self-correction -- recognizing when a macro failed and fixing the code
Spatial planning -- knowing that after a Pocket, the face indices change

Design Philosophy

Our architecture anticipates this: the Planner (already running on gemini-3.1-pro) generates detailed step-by-step workflows with correct FreeCAD API examples. The CAD agent (running on Flash) follows them. When a stronger model becomes available for Computer Use, the agent can handle more complex reasoning directly, and the Planner can send simpler instructions.

Less instruction = better performance with current vision models. The CAD agent uses a minimal system instruction (~130 lines) for desktop navigation. All task-specific intelligence comes from the Planner's action plans.

System Architecture

                          +----------------+
                          |    main.py     |
                          |   (CLI/REPL)   |
                          +-------+--------+
                                  | user request
                                  v
                          +----------------+
                          |    Planner     |  <- Gemini 3.1 Pro (text-only)
                          | (Router +      |     classifies task, extracts
                          |  Plan Builder) |     params, generates workflow
                          +-------+--------+
                     +------------+------------+
                     v            v            v
              +--------+  +----------+  +----------------+
              |  CAD   |  | Research |  | Research -> CAD |
              | Agent  |  |  Agent   |  |   (chained)    |
              +---+----+  +----+-----+  +----------------+
                  |             |
                  v             v
           +-----------+ +------------+
           |  Desktop  | |  Browser   |
           | Executor  | | Executor   |
           | (xdotool) | |(Playwright)|
           +-----------+ +------------+
                  |             |
          +-------+-------+    |
          v               v    v
   +-----------+ +------+ +--------+
   |  FreeCAD  | |Macro | | Chrome |
   |   (GUI)   | |Engine| |(DuckDG)|
   +-----------+ +------+ +--------+

    +-------------------------------------------+
    |           Shared Agentic Loop             |
    |  screenshot -> Gemini -> function calls   |
    |  -> executor -> screenshot -> repeat...   |
    +-------------------------------------------+

Data Flow: Research -> CAD Pipeline

User: "Make a bracket for an M6 bolt"
  |
  +- Planner._plan() -> Gemini classifies as "research+cad"
  |
  +- Phase 1: ResearchAgent
  |   +- Opens DuckDuckGo -> searches "M6 bolt dimensions"
  |   +- Reads multiple websites, extracts data points
  |   +- Returns: {data_points: [{fact: "clearance hole", value: "6.6", unit: "mm"}, ...]}
  |   +- DocumentationAgent auto-generates Word + PDF report
  |
  +- Phase 2: Planner._extract_dimensions()
  |   +- Gemini extracts CAD params: {hole_diameter: "6.6mm", wall_thickness: "3mm"}
  |
  +- Phase 3: Planner._generate_cad_goal()
  |   +- Gemini 3.1 Pro generates step-by-step FreeCAD workflow
  |   +- Includes correct API examples, face-finding patterns
  |
  +- Phase 4: CADAgent
      +- Gets enriched description + workflow + FreeCAD tips
      +- Runs macros and/or drives GUI to build the model

Demo

Demo Commands

# Full pipeline: research M6 bolt specs, then build bracket in FreeCAD
python3 main.py "Make a bracket for an M6 bolt"

# CAD only (skip research -- fast for testing/demos)
python3 main.py --cad "Make a bracket for an M6 bolt"

# CAD only with explicit dimensions
python3 main.py --cad --dims hole_diameter=6.6mm wall_thickness=3mm "L-bracket with bolt holes"

# Research only: find specs, generate report
python3 main.py "Research the standard dimensions of an M8 hex bolt"

# Simple CAD (auto-routes to CAD, no research needed)
python3 main.py "Create a 30mm cube in FreeCAD"

What to Expect

Research tasks (~3-5 minutes):

Planner routes to Research agent
Browser opens DuckDuckGo, searches multiple queries
Agent visits 2-4 websites, extracts structured data points
Documentation agent generates Word + PDF reports in outputs/research_results/

CAD tasks (~5-15 minutes depending on complexity):

Planner generates a step-by-step FreeCAD workflow
CAD agent opens FreeCAD (or uses existing window)
Agent runs Python macros for precise geometry AND/OR uses GUI clicking
Stage budgets track progress (setup -> base_shape -> features -> cleanup)
Agent calls task_complete() when finished

Chained tasks (Research -> CAD, ~8-20 minutes):

Research phase finds real-world specifications
Planner extracts concrete dimensions from research data
CAD phase builds the model using researched specs

Deployment

One-Command GCP Deploy

Provision a fully configured Google Cloud VM with everything installed:

# Prerequisites: gcloud CLI installed + authenticated + project set
export GEMINI_API_KEY="your-key"

# Create VM with Ubuntu + XFCE + Xvfb + VNC + FreeCAD + agent
./scripts/deploy.sh --gcp

# Custom options
./scripts/deploy.sh --gcp --name my-agent --zone us-east1-b --machine-type e2-standard-8

What it creates:

e2-standard-4 Ubuntu 22.04 VM with 50GB SSD
XFCE desktop on a virtual display (Xvfb at 1280x800)
VNC server on port 5900 (watch the agent work remotely)
FreeCAD, Chromium, Python venv, all dependencies
Your API key injected into the environment
agent alias for quick start

After ~5-10 minutes setup:

# SSH in
gcloud compute ssh engineering-agent-v2 --zone=us-central1-a

# Start
agent
python main.py --cad "Make a bracket for an M6 bolt"

# Watch via VNC (separate terminal)
gcloud compute ssh engineering-agent-v2 --zone=us-central1-a -- -L 5900:localhost:5900
# Then connect VNC client to localhost:5900

Local / Existing VM Setup

VM Requirements

Requirement	Specification
OS	Ubuntu 22.04+ LTS
Desktop	XFCE (lightweight, uses X11)
Display Server	X11 (NOT Wayland -- xdotool requires X11)
Screen Resolution	1280x800 (16:10 aspect ratio)
RAM	4 GB minimum, 8 GB recommended
Disk	20 GB minimum
Network	Internet access for Gemini API + web research

Automated Setup

git clone https://github.com/Louisomeg/General_Computer_Use_Agent-.git
cd General_Computer_Use_Agent-
git checkout design

# Full automated setup (system deps + python + verify)
chmod +x scripts/deploy.sh
./scripts/deploy.sh

Manual Setup

# 1. System dependencies
sudo apt install -y python3 python3-pip python3-venv scrot xdotool xclip ffmpeg git
sudo apt install -y freecad chromium-browser

# 2. Python environment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m playwright install chromium

# 3. API key
export GEMINI_API_KEY="your-key"

# 4. Verify
./scripts/deploy.sh --verify

Why These Display Settings?

1280x800 resolution: Google recommends 1440x900 for Computer Use. 1280x800 is the closest available VM resolution with the same 16:10 aspect ratio. Screenshots are resized to 1440x900 before sending -- no distortion.
XFCE desktop: Lightweight, X11-native, predictable UI layout. The system instruction teaches XFCE-specific navigation.
X11 display server: Required by xdotool for mouse/keyboard input. Wayland is not supported.

Quick Start

Interactive Mode

python main.py

Agentic Planner -- type a request or 'quit' to exit
  Prefix with --cad to skip research

>>> Create a 50mm tall cylinder with radius 15mm
>>> --cad Make a bracket for an M6 bolt
>>> Research M6 bolt dimensions
>>> quit

Direct CLI Mode

# Full pipeline (planner decides if research is needed)
python main.py "Make a bracket for an M6 bolt"

# CAD only -- skip research for fast iteration
python main.py --cad "Make a bracket for an M6 bolt"

# CAD only with explicit dimensions
python main.py --cad --dims hole_diameter=6.6mm wall_thickness=3mm "L-bracket"

# Research only
python main.py "What are the standard dimensions of an M8 hex bolt?"

Agents

CAD Agent

File: agents/cad_agent.py

The CAD agent drives FreeCAD through a combination of GUI interaction (clicking menus, drawing geometry) and Python macro execution (precise programmatic geometry). It sees the FreeCAD window through screenshots, reasons about what to do, and acts.

Capabilities:

Create 3D parts from descriptions and dimensions
Run Python macros in FreeCAD for precise geometry
Draw 2D sketches with geometry and constraints via GUI
Apply Part Design operations (Pad, Pocket, Hole, Thickness, Fillet, Chamfer)
Navigate FreeCAD menus and dialogs
Handle error recovery (undo mistakes, close unexpected dialogs)

Stage Budgets -- prevents the agent from burning all turns on one step:

Stage	Budget	Description
setup	10 turns	Open FreeCAD, create body, enter first sketch
base_shape	25 turns	Draw base profile, constrain, Pad
features	50 turns	Holes, pockets, fillets, chamfers
cleanup	10 turns	Fit view, verify, save
reserve	25 turns	Recovery budget for undo/retry

Key Design Decisions:

Menu-driven interaction: Always use menu bar (large text targets) instead of toolbar icons (~24px). This was one of our most impactful design decisions -- dramatically improves click accuracy.
Macro-first for geometry: execute_freecad_macro(code) gives exact dimensions. GUI clicking is fallback.
One macro per feature: Never put the entire design in one macro. If line 5 fails, lines 6-20 fail silently.
Dynamic face finding: Never hardcode Face6 or Face12. Find faces by position: max(shape.Faces, key=lambda f: f.CenterOfMass.z).
Verification gate: Agent must examine the screenshot before task_complete() is accepted.

Research Agent

File: agents/research_agent.py

The Research agent browses the web using Playwright (Chromium). It uses DuckDuckGo exclusively -- most reliable for automated browsing (no CAPTCHAs, no cookie consent walls).

Capabilities:

Web search via DuckDuckGo
Multi-page browsing and data extraction
Structured data output with confidence scoring and source URLs
Automatic report generation via Documentation agent

Output Format:

{
  "query": "M6 bolt dimensions",
  "findings": {
    "summary": "M6 bolts have a nominal diameter of 6mm...",
    "data_points": [
      {"fact": "Clearance hole (medium)", "value": "6.6", "unit": "mm", "source": "https://..."},
      {"fact": "Head width across flats", "value": "10", "unit": "mm", "source": "https://..."}
    ],
    "confidence": "high",
    "sources": ["https://...", "https://..."]
  }
}

Documentation Agent

File: agents/documentation_agent.py

Converts raw research JSON into professionally formatted documents. Automatically called by the Research agent after data collection.

Output Formats:

Word Document (.docx): Full report with headers, tables, styled text, citations
PDF Document (.pdf): Matching PDF with tables and formatted sections

Output Location: outputs/research_results/

Multi-Agent Workflow

The Planner (core/agentic_planner.py) orchestrates multi-agent workflows using Gemini 3.1 Pro (text-only) to classify requests and generate plans.

Routing Logic

Request Type	Route	Example
Exact dimensions provided	`cad` only	"Create a 50mm cylinder with 15mm radius"
Simple everyday object	`cad` only	"Make a box for storing pens"
Information lookup	`research` only	"What are M8 bolt specifications?"
Design needing real-world specs	`research+cad`	"Make a bracket for an M6 bolt"
Desktop operation	`cad` only	"Open FreeCAD and create a new body"
`--cad` flag used	`cad` only (skip research)	Any request

Quality Gates

The pipeline includes quality gates between phases:

Research quality gate: Checks if research produced data points with sufficient confidence. Fails fast if research returned nothing.
Dimension extraction: Gemini picks ONE concrete value per dimension (not ranges).
Verification gate: CAD agent must visually verify the result before declaring done.

The Agentic Loop

File: core/agentic_loop.py

The core engine shared by all agents. Implements a multi-turn vision cycle:

+----------------------------------------------+
|              Agentic Loop Cycle               |
|                                               |
|  1. Capture screenshot (scrot -> PNG bytes)   |
|  2. Send to Gemini (screenshot + history)     |
|  3. Gemini returns function calls             |
|  4. Execute function calls via Executor       |
|  5. Append results to conversation history    |
|  6. Check stage budgets + turn warnings       |
|  7. Repeat until task_complete or max_turns   |
+----------------------------------------------+

Error Recovery

Scenario	Recovery
400 INVALID_ARGUMENT	Reset history to initial prompt + fresh screenshot
Empty model response	Update screenshot in-place, retry (up to 3x)
Text-only response (no actions)	Inject nudge: "You must call a function"
Stage over budget	Inject warning to move to next stage
Max turns approaching	Inject countdown warnings (5, 4, 3, 2, 1)
Max consecutive API errors (5)	Return "api_error" status

Macro Execution Engine

File: core/freecad_functions.py

The macro executor bridges the gap between the agent's reasoning and FreeCAD's Python API. Instead of relying solely on GUI clicking (imprecise for exact dimensions), the agent can run Python code directly in FreeCAD's embedded console.

How It Works

Agent calls execute_freecad_macro(code) with FreeCAD Python code
Code is wrapped in try/except that writes errors to /tmp/agent_macro_log.txt
Macro is saved to /tmp/agent_macro.py
FreeCAD window is found dynamically via xdotool search --name FreeCAD
Window geometry is used to calculate Python console input position
The run command is pasted via xclip (fast) or typed via xdotool (fallback)
Log file is read back to detect success or Python errors
Errors are returned to the agent so it can self-correct

Error Capture

All user code is wrapped in error capture:

import traceback as _tb
_log = open('/tmp/agent_macro_log.txt', 'w')
try:
    # user's FreeCAD code here
    _log.write('OK\n')
except Exception as _e:
    _log.write(f'ERROR: {_e}\n')
    _log.write(_tb.format_exc())
finally:
    _log.close()

Possible return values:

{"success": true} -- macro ran without errors
{"error": "FreeCAD macro error: ..."} -- Python exception with traceback
{"success": false, "warning": "Macro produced no output..."} -- console wasn't focused

Correct FreeCAD API Patterns

The macro tool description includes correct API examples so the model generates valid code:

# Find faces by position (NEVER hardcode Face6, Face12)
top_face = max(body.Shape.Faces, key=lambda f: f.CenterOfMass.z)
face_idx = body.Shape.Faces.index(top_face) + 1

# Clearance hole (circle + Pocket ThroughAll -- more reliable than PartDesign::Hole)
hole_sk = body.newObject('Sketcher::SketchObject', 'HoleSketch')
hole_sk.AttachmentSupport = [(body.Tip, f'Face{face_idx}')]
hole_sk.MapMode = 'FlatFace'
hole_sk.addGeometry(Part.Circle(FreeCAD.Vector(cx,cy,0), FreeCAD.Vector(0,0,1), 3.3))
doc.recompute()
hole_cut = body.newObject('PartDesign::Pocket', 'HoleCut')
hole_cut.Profile = hole_sk
hole_cut.Type = 1  # ThroughAll
doc.recompute()

Planner Intelligence

File: core/agentic_planner.py

The Planner does more than routing -- it generates detailed FreeCAD workflows using Gemini 3.1 Pro. This is critical because the vision model works best with clear, step-by-step instructions.

LLM-Generated Workflows

Instead of hardcoded shape templates, the Planner uses Gemini 3.1 Pro to generate workflows dynamically:

Input:  "Make a bracket for an M6 bolt"
        + Research data: {hole_diameter: 6.6mm, wall_thickness: 3mm}

Output: Step 1: Create sketch on XY plane, 30x20mm rectangle, Pad 30mm
        Step 2: Create sketch on top face, 27x20mm rectangle offset 3mm, Pocket 27mm
        Step 3: Find top face of horizontal leg, circle r=3.3mm, Pocket ThroughAll
        Step 4: Find outer face of vertical leg, circle r=3.3mm, Pocket ThroughAll
        Step 5: Fillet inner corner 1mm radius

Parameter Normalization

Research data comes with inconsistent key names. The Planner normalizes them:

PARAM_ALIASES = {
    "length": "depth",
    "total_width": "width",
    "leg_thickness": "wall_thickness",
    "bolt_hole_diameter": "hole_diameter",
    "clearance_hole": "hole_diameter",
    ...
}

Available Operations Vocabulary

The Planner tells Gemini 3.1 Pro exactly what FreeCAD operations exist, so it generates valid workflows:

Pad, Pocket, Hole, Fillet, Chamfer, Thickness, Mirrored,
Sketcher: Rectangle, Circle, Line, Arc, Point,
Constraints: Coincident, Horizontal, Vertical, DistanceX, DistanceY

Skill Learning Pipeline

Directory: pipeline/

Converts YouTube FreeCAD tutorial videos into structured YAML skill files. Currently disabled in the CAD agent (minimal-instruction approach works better with Flash), but the infrastructure is ready for stronger models.

Pipeline Stages

YouTube URL -> Download (yt-dlp) -> Transcribe (VTT/Whisper) -> Keyframes (OpenCV MOG2)
           -> Label (Gemini Vision) -> Filter (quality scoring) -> Build (YAML + PNG)

# Full pipeline
python -m pipeline.run_pipeline --url "https://www.youtube.com/watch?v=VIDEO_ID"

# Specific stages only
python -m pipeline.run_pipeline --url "VIDEO_ID" --stages keyframes,label,filter,build

Pipeline Configuration

Parameter	Default	Description
`threshold`	15000	MOG2 keyframe detection sensitivity
`min_score`	3	Minimum quality score (0-5) for filtering
`api_delay`	1.0s	Delay between Gemini API calls
Video format	h264, <=720p	AV1 codec not supported on most VMs

Coordinate System & Executors

Coordinate System

Gemini Computer Use outputs coordinates on a normalized 0-1000 grid:

screen_x = int(normalized_x / 1000 * SCREEN_WIDTH)   # 500/1000 * 1280 = 640
screen_y = int(normalized_y / 1000 * SCREEN_HEIGHT)   # 500/1000 * 800  = 400

Desktop Executor Functions

Function	Description
`click_at(x, y)`	Left-click at normalized coordinates
`type_text_at(x, y, text)`	Click field, optionally clear, type text
`key_combination(keys)`	Key combo (e.g., "ctrl+z", "g+r" for rectangle)
`execute_freecad_macro(code)`	Run Python in FreeCAD console
`right_click_at(x, y)`	Right-click for context menus
`double_click_at(x, y)`	Double-click to open/select
`scroll_at(x, y, direction)`	Scroll at position
`drag_and_drop(...)`	Drag between two points
`task_complete(summary)`	Signal task completion

Project Structure

General_Computer_Use_Agent-/
|
+-- main.py                      # Entry point (CLI + interactive, --cad flag)
+-- requirements.txt             # Python dependencies
+-- scripts/
|   +-- deploy.sh                # Automated deployment (local + GCP VM)
|
+-- agents/                      # Agent implementations
|   +-- registry.py              # @register decorator + get_agent() factory
|   +-- cad_agent.py             # FreeCAD CAD agent (GUI + macros)
|   +-- research_agent.py        # Web research agent (DuckDuckGo + Playwright)
|   +-- documentation_agent.py   # Word + PDF report generator
|
+-- core/                        # Shared infrastructure
|   +-- agentic_loop.py          # Multi-turn vision loop (stage budgets, verification)
|   +-- agentic_planner.py       # Task router + LLM workflow generation
|   +-- executor.py              # Abstract Executor base class
|   +-- desktop_executor.py      # xdotool desktop executor
|   +-- browser_executor.py      # Playwright browser executor
|   +-- screenshot.py            # scrot capture + PIL resize
|   +-- settings.py              # Global config + system instruction
|   +-- models.py                # Task, TaskStatus data models
|   +-- custom_tools.py          # FreeCAD macro FunctionDeclaration
|   +-- freecad_functions.py     # Macro engine + xdotool wrappers
|   +-- skill_retrieval.py       # Demo skill matching (disabled)
|
+-- pipeline/                    # YouTube -> Skill learning pipeline
|   +-- run_pipeline.py          # Pipeline orchestrator
|   +-- crawl.py                 # yt-dlp download
|   +-- transcribe.py            # Subtitle extraction
|   +-- extract_keyframes.py     # OpenCV keyframe detection
|   +-- label_actions.py         # Gemini action labeling
|   +-- filter_quality.py        # Quality scoring
|   +-- build_skills.py          # YAML skill assembly
|
+-- skills/                      # FreeCAD knowledge base (reference)
+-- outputs/                     # Generated outputs
    +-- research_results/        # Research JSON + Word + PDF reports
    +-- cad_exports/             # Exported CAD files

Configuration Reference

`core/settings.py`

Setting	Default	Description
`SCREEN_WIDTH`	1280	VM screen width in pixels
`SCREEN_HEIGHT`	800	VM screen height in pixels
`MODEL_SCREEN_WIDTH`	1440	Screenshot width sent to Gemini
`MODEL_SCREEN_HEIGHT`	900	Screenshot height sent to Gemini
`ACTION_DELAY`	0.5s	Pause after each action
`TYPING_DELAY`	30ms	Delay between keystrokes (xdotool)
`CLICK_DELAY`	0.3s	Pause after mouse clicks
`DEFAULT_MODEL`	`gemini-3-flash-preview`	Computer Use model
`PLANNING_MODEL`	`gemini-3.1-pro-preview`	Text-only planning model

Environment Variables

Variable	Required	Description
`GEMINI_API_KEY`	Yes	Google Gemini API key (get one here)
`DISPLAY`	Yes (auto)	X11 display (usually `:0`, set automatically)

Known Limitations & Lessons Learned

What Works Well

Research agent is reliable -- DuckDuckGo avoids CAPTCHAs, structured data extraction works consistently
Documentation agent produces professional reports every time
Simple CAD shapes (cubes, cylinders, basic pads) succeed ~80% of the time
Menu-driven interaction dramatically outperforms toolbar clicking
Macro execution gives precise dimensions when the console is focused correctly
Stage budgets prevent the agent from wasting all turns on one step
Two-model architecture works well -- use the smartest model (3.1 Pro) for planning/reasoning, and the Computer Use model (Flash) for execution

What Struggles

FreeCAD Python API knowledge: The vision model generates macros with wrong property names, wrong face references, and incorrect constraint syntax. The Planner mitigates this by providing correct API examples, but the model still hallucinates.
Face selection after topology changes: After a Pocket or Pad, face indices change. The model guesses Face6 or Face12 instead of using the dynamic face-finding pattern we provide.
Silent macro failures: Even with error capture, some failures are subtle -- a sketch might be created but with wrong constraints, or a Pad might succeed but with wrong dimensions.
Long sessions (80+ turns): The vision model's accuracy degrades as the context window fills with screenshots.
Rate limits: Heavy Computer Use sessions (20-80+ turns) can exhaust free tier quotas. Typically resets within 30-60 minutes.

Architecture Insights

Macro + GUI hybrid: Best results come from using macros for precise geometry and GUI clicking for navigation/menus. Neither approach alone is sufficient.
One feature per macro: Critical lesson. A single macro creating sketch + pad + pocket + hole will silently fail at the first error. One feature per call with screenshot verification between each.
Thickness over Pocket for hollowing: For hollow shapes, the Thickness tool (~24 turns) massively outperforms Pocket workflows (65+ turns, low success rate).

Future Work

When Gemini 3.1 Pro Gets Computer Use

The architecture is ready. When a reasoning-class model supports Computer Use:

Complex parts in one session: L-brackets with holes, gears, multi-body assemblies
Self-correcting macros: Model reads error output, fixes API calls, retries
Spatial reasoning: Understands which face is "top" after a pocket changes geometry
150+ turn sessions: Maintains precision and coherence across long designs
Reduced Planner dependency: Model can figure out the workflow itself

Expanding to Other Tools

KiCad for PCB design
Blender for 3D modeling and rendering
LibreOffice for document editing
General desktop automation beyond engineering

Other Improvements

Web interface for remote agent control and monitoring
Better parallelism -- running multiple agents simultaneously
More robust error recovery with learning from failure patterns
Leveraging the Skill Learning Pipeline with stronger models
Multi-part assemblies using FreeCAD's Assembly workbench

Troubleshooting

Common Issues

"ERROR: Set GEMINI_API_KEY first!"

export GEMINI_API_KEY="your-key"

"FreeCAD window not found" (macro execution)

FreeCAD must be running and visible
Check: xdotool search --name FreeCAD should return a window ID

"Macro produced no output"

The Python console input wasn't focused. The agent should retry.
Verify Python console is visible: View -> Panels -> Python console in FreeCAD
Install xclip for more reliable paste: sudo apt install xclip

FreeCAD "Document Recovery" dialog

rm -rf ~/.local/share/FreeCAD/recovery/*
rm -rf ~/.FreeCAD/recovery/*

Research agent fails

Should not happen with DuckDuckGo. Agent auto-retries on the same search engine.

400 INVALID_ARGUMENT from Gemini API

Handled automatically by history reset in the agentic loop

Rate limit errors (429)

Free tier quotas reset within 30-60 minutes
Use --cad flag to skip research and reduce API calls

CAD agent hits max_turns without finishing

Complex designs may need more turns
Increase max_turns in agents/cad_agent.py (default: 120)
Use --cad with explicit --dims for faster iteration

Authors

Louis -- , desktop executor, CAD agent, , macro engine
Emmanuel -- Research agent, browser executor, documentation agent, parallel research
George -- Core framework, agentic loop, planner

License

This project is part of the Gemini API Developer Competition 2025. See repository for license details.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.idea		.idea
agents		agents
core		core
outputs/research_results		outputs/research_results
pipeline		pipeline
scripts		scripts
server		server
skills/freecad		skills/freecad
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
architecture.drawio		architecture.drawio
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
test_executor.py		test_executor.py
test_research.py		test_research.py

Folders and files

Latest commit

History

Repository files navigation

General Computer Use Agent

Table of Contents

Overview

Key Technologies

Key Insight: Model Sophistication Matters

What We Learned

The Core Problem

Design Philosophy

System Architecture

Data Flow: Research -> CAD Pipeline

Demo

Demo Commands

What to Expect

Deployment

One-Command GCP Deploy

Local / Existing VM Setup

VM Requirements

Automated Setup

Manual Setup

Why These Display Settings?

Quick Start

Interactive Mode

Direct CLI Mode

Agents

CAD Agent

Research Agent

Documentation Agent

Multi-Agent Workflow

Routing Logic

Quality Gates

The Agentic Loop

Error Recovery

Macro Execution Engine

How It Works

Error Capture

Correct FreeCAD API Patterns

Planner Intelligence

LLM-Generated Workflows

Parameter Normalization

Available Operations Vocabulary

Skill Learning Pipeline

Pipeline Stages

Pipeline Configuration

Coordinate System & Executors

Coordinate System

Desktop Executor Functions

Project Structure

Configuration Reference

core/settings.py

Environment Variables

Known Limitations & Lessons Learned

What Works Well

What Struggles

Architecture Insights

Future Work

When Gemini 3.1 Pro Gets Computer Use

Expanding to Other Tools

Other Improvements

Troubleshooting

Common Issues

Authors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`core/settings.py`

Packages