# See-Think-Act Agent Demo for Windows 11

This notebook demonstrates the autonomous AI Agent that can see, think, and act to perform tasks on Windows 11 using Ollama with Qwen3-VL.

## Features
- **See**: Captures screenshots of the desktop using efficient Windows APIs
- **Think**: Analyzes screenshots with Qwen3-VL model via Ollama
- **Act**: Executes mouse, keyboard, and system actions autonomously
- **Loop**: Continues until task completion or max iterations reached

## Prerequisites
1. Ollama installed and running
2. Qwen3-VL model pulled: `ollama run qwen3-vl:235b-cloud`
3. Required Python packages installed (see requirements.txt)

## Setup and Installation

In [None]:
# Install required packages
!pip install mss
!pip install pyautogui
!pip install ollama
!pip install pillow
!pip install qwen-agent

## Import Libraries and Initialize Agent

In [None]:
import sys
import logging
from pathlib import Path

# Add parent directory to path to import our modules
sys.path.append(str(Path.cwd()))

from see_think_act_agent import SeeThinkActAgent

# Configure logging for notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

print("✓ Imports successful")

## Test Ollama Connection

In [None]:
from utils.ollama_client import OllamaVisionClient

# Test connection to Ollama
client = OllamaVisionClient(model="qwen3-vl:235b-cloud")

print("Testing Ollama connection...")
if client.test_connection():
    print("\n✓ Ollama is running and model is available!")
    print("✓ Ready to start the agent")
else:
    print("\n✗ Could not connect to Ollama or model not found")
    print("\nPlease ensure:")
    print("1. Ollama is installed and running")
    print("2. Model is pulled: ollama run qwen3-vl:235b-cloud")

## Initialize the See-Think-Act Agent

In [None]:
# Initialize the agent
agent = SeeThinkActAgent(
    model="qwen3-vl:235b-cloud",
    max_iterations=30,  # Maximum number of actions before stopping
    save_screenshots=True,  # Save screenshots for debugging
    screenshot_dir="agent_screenshots",
    log_level="INFO"
)

print("✓ Agent initialized successfully!")
print(f"  Model: qwen3-vl:235b-cloud")
print(f"  Max iterations: 30")
print(f"  Screenshots will be saved to: agent_screenshots/")
print(f"  Screen size: {agent.action_executor.screen_width}x{agent.action_executor.screen_height}")

## Example 1: Simple Task - Open Notepad and Type Text

This example demonstrates a simple task where the agent will:
1. Find and click on the Start menu
2. Search for Notepad
3. Open Notepad
4. Type some text

**Note:** Make sure your desktop is visible and not covered by other windows.

In [None]:
import json

# Define a simple task
task1 = "Open Notepad and type 'Hello from the AI Agent! This is a test of autonomous computer control.'"

print(f"Task: {task1}")
print("\nStarting agent...")
print("The agent will now take control and complete the task autonomously.\n")

# Run the agent
result = agent.run(task1)

# Display results
print("\n" + "=" * 80)
print("TASK COMPLETED!")
print("=" * 80)
print(json.dumps(result, indent=2))

## Example 2: Web Browsing Task

This example demonstrates a more complex task involving web browsing.

In [None]:
# Define a web browsing task
task2 = "Open Microsoft Edge browser and search for 'Ollama AI'"

print(f"Task: {task2}")
print("\nStarting agent...\n")

# Run the agent
result = agent.run(task2)

# Display results
print("\n" + "=" * 80)
print("TASK COMPLETED!")
print("=" * 80)
print(json.dumps(result, indent=2))

## Example 3: Custom Task

Run your own custom task! The agent will autonomously figure out how to complete it.

In [None]:
# Define your custom task here
custom_task = "Open Calculator and calculate 123 + 456"

print(f"Task: {custom_task}")
print("\nStarting agent...\n")

# Run the agent
result = agent.run(custom_task)

# Display results
print("\n" + "=" * 80)
print("TASK COMPLETED!")
print("=" * 80)
print(json.dumps(result, indent=2))

## View Saved Screenshots

The agent saves screenshots at each iteration. You can view them to see what the agent saw.

In [None]:
from PIL import Image
import os
from IPython.display import display

# List all screenshots
screenshot_dir = "agent_screenshots"
if os.path.exists(screenshot_dir):
    screenshots = sorted([f for f in os.listdir(screenshot_dir) if f.endswith('.png')])
    
    print(f"Found {len(screenshots)} screenshots\n")
    
    # Display the first few screenshots
    for i, screenshot_file in enumerate(screenshots[:5]):  # Show first 5
        print(f"\n--- Screenshot {i+1}: {screenshot_file} ---")
        img = Image.open(os.path.join(screenshot_dir, screenshot_file))
        # Resize for display
        img.thumbnail((800, 600))
        display(img)
else:
    print(f"No screenshots found in {screenshot_dir}/")

## How It Works

The See-Think-Act Agent operates in a continuous loop:

1. **SEE** 👁️
   - Captures a screenshot of the current desktop state
   - Uses the `mss` library for efficient screen capture on Windows

2. **THINK** 🧠
   - Sends the screenshot to Qwen3-VL model via Ollama
   - The model analyzes the image and decides the next action
   - Uses function calling to structure the response

3. **ACT** 🎯
   - Executes the action decided by the model
   - Uses `pyautogui` for mouse/keyboard control
   - Actions include: click, type, scroll, key press, etc.

4. **REPEAT** 🔄
   - Captures a new screenshot to see the result
   - Continues until the task is marked as complete
   - Maximum iterations prevent infinite loops

## Key Features

- **Autonomous Operation**: The agent decides all actions independently
- **Visual Grounding**: Uses computer vision to understand the UI
- **Function Calling**: Structured actions via tool use
- **Error Recovery**: Can adapt if actions don't work as expected
- **Screenshot History**: Saves all screenshots for debugging
- **Safe Operation**: Includes failsafe (move mouse to corner to stop)

## Tips for Best Results

1. **Clear Desktop**: Start with a clear, unobstructed desktop
2. **Specific Tasks**: Give clear, specific instructions
3. **Reasonable Scope**: Start with simple tasks and build up
4. **Monitor Progress**: Watch the agent work (it's educational!)
5. **Interrupt if Needed**: Press Ctrl+C or move mouse to top-left corner

## Troubleshooting

- **Model not found**: Run `ollama pull qwen3-vl:235b-cloud`
- **Ollama not running**: Start Ollama service
- **Actions too fast**: Increase wait times in action_executor.py
- **Mouse control issues**: Adjust screen size in agent initialization
- **Permission errors**: Run with appropriate permissions for screen capture

In [None]:
# See-Think-Act Agent Demo for Windows 11
This notebook demonstrates the autonomous AI Agent that can see, think, and act to perform tasks on Windows 11 using Ollama with Qwen3-VL.

## Features
- **See**: Captures screenshots of the desktop using efficient Windows APIs
- **Think**: Analyzes screenshots with Qwen3-VL model via Ollama
- **Act**: Executes mouse, keyboard, and system actions autonomously
- **Loop**: Continues until task completion or max iterations reached

## Prerequisites
1. Ollama installed and running
2. Qwen3-VL model pulled: `ollama run qwen3-vl:235b-cloud`
3. Required Python packages installed (see requirements.txt)