# NB1 — Minimal Coding Agent with E2B (OpenAI LLM)

**Goal:** one-shot coding agent that asks OpenAI to write a small Python program, then executes it safely inside an **E2B** sandbox (Firecracker microVM). Includes a single self-heal retry if execution fails.

**What you’ll see**
1) Generate code with OpenAI’s **Responses API**
2) Start an **E2B** sandbox and run the code with `run_code`
3) Capture stdout/stderr, and retry once with error feedback

**Why E2B?** Sandboxed execution is perfect for coding agents: it’s isolated, ephemeral, and lets the agent install packages and run arbitrary commands without touching your host.

> Docs: OpenAI Responses API · E2B Quickstart / Python `run_code`


## Prereqs

- Python ≥3.9
- Environment variables set:
  - `OPENAI_API_KEY`
  - `E2B_API_KEY`

*(Optional)* Create a `.env` and export from your shell, but **don’t store real secrets inside your repo**.

### Install packages (run once per environment)
We pin known-good versions. Feel free to upgrade later.

```bash
%pip install -U e2b-code-interpreter openai==1.108.1 python-dotenv>=1.0 tenacity>=8.2 pydantic>=2.7
```

If you're on corporate proxies, set `PIP_INDEX_URL` accordingly.

In [90]:
# --- Imports & env checks ---
import os, re, textwrap, json
from typing import Optional, Tuple

from openai import OpenAI
from e2b_code_interpreter import Sandbox
from tenacity import retry, stop_after_attempt, wait_exponential

# --- Load env from repo-local file (without clobbering OS env) ---
from pathlib import Path
try:
    from dotenv import load_dotenv
except ImportError as e:
    raise ImportError("Install python-dotenv: pip install python-dotenv") from e

ENV_PATH = Path(os.getenv("NB1_ENV_PATH", "../infra/.env")).resolve()

if ENV_PATH.exists():
    load_dotenv(ENV_PATH, override=False)  # keep exported OS vars authoritative
else:
    # Optional: keep silent in CI; or raise if you want hard fail
    print(f"[NB1] Note: {ENV_PATH} not found; relying on OS env only.")


missing = [v for v in ("OPENAI_API_KEY", "E2B_API_KEY") if not os.getenv(v)]
if missing:
    raise EnvironmentError(
        f"Set required environment variables: {missing}.\n"
        "Example (bash): export OPENAI_API_KEY=sk-...; export E2B_API_KEY=e2b_...; export OPENAI_ORGANIZATION=org-..."
    )

MODEL = os.getenv("NB1_OPENAI_MODEL", "gpt-5-nano")  # keep cost low; upgrade as you like
# OPENAI_ORGANIZATION = os.getenv("OPENAI_ORGANIZATION")
OPENAI_PROJECT_ID = os.getenv("OPENAI_PROJECT_ID")
# print({"model": MODEL, "org": OPENAI_ORGANIZATION, "project": OPENAI_PROJECT_ID})

In [91]:
# --- Small utilities: parse code blocks and pretty printing ---
def extract_code_blocks(text: str, language: str = "python") -> str:
    """Return the first ```python ...``` fenced block, or raw text if none."""
    fence = re.compile(rf"```{language}\n(.*?)```", re.DOTALL | re.IGNORECASE)
    m = fence.search(text)
    if m:
        return m.group(1).strip()
    # try any fenced block
    any_fence = re.compile(r"```([\s\S]*?)```", re.MULTILINE)
    m2 = any_fence.search(text)
    return (m2.group(1).strip() if m2 else text).strip()

def summarize_execution(execution) -> dict:
    """Best-effort extraction of stdout/stderr/text depending on SDK version."""
    out = {}
    
    # Handle None result
    if execution is None:
        return {"text": None, "stdout": [], "stderr": []}
    
    # Handle E2B Code Interpreter SDK Execution objects
    if hasattr(execution, 'logs'):
        logs = execution.logs
        if hasattr(logs, 'stdout'):
            out["stdout"] = logs.stdout if isinstance(logs.stdout, list) else [str(logs.stdout)]
        if hasattr(logs, 'stderr'):
            out["stderr"] = logs.stderr if isinstance(logs.stderr, list) else [str(logs.stderr)]
    
    # Check for text attribute
    if hasattr(execution, 'text'):
        out["text"] = execution.text
    
    # Check for direct stdout/stderr attributes (fallback)
    if "stdout" not in out and hasattr(execution, 'stdout'):
        stdout = execution.stdout
        if isinstance(stdout, list):
            out["stdout"] = stdout
        else:
            out["stdout"] = [str(stdout)] if stdout else []
    
    if "stderr" not in out and hasattr(execution, 'stderr'):
        stderr = execution.stderr
        if isinstance(stderr, list):
            out["stderr"] = stderr
        else:
            out["stderr"] = [str(stderr)] if stderr else []
    
    # Check for other common attributes
    for attr in ("output", "result"):
        if hasattr(execution, attr):
            out[attr] = getattr(execution, attr)
    
    # If we don't have stdout/stderr from above, set defaults
    if "stdout" not in out:
        out["stdout"] = []
    if "stderr" not in out:
        out["stderr"] = []
    
    return out

In [92]:
# --- LLM call: OpenAI Responses API ---
client = OpenAI() # project=OPENAI_PROJECT_ID)  # uses OPENAI_API_KEY

SYSTEM_PROMPT = (
    "You are a disciplined coding agent. Write a *single* runnable Python script.\n"
    "Constraints:\n"
    "- No external files; everything in one script.\n"
    "- Print clear results to stdout.\n"
    "- If tests are needed, use simple asserts in __main__ instead of pytest.\n"
    "Output strictly as a fenced code block:```python ...``` and nothing else."
)

def llm_generate_code(task: str) -> str:
    """Ask the model to produce a single Python script as a fenced block."""
    resp = client.responses.create(
        model=MODEL,
        input=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": task},
        ],
    )
    # Extract text per the Responses API object shape
    try:
        text = resp.output[0].content[0].text
    except Exception:
        # Fallback if SDK shape differs
        text = getattr(resp, "output_text", str(resp))
    return extract_code_blocks(text, language="python")

In [93]:
# --- E2B execution helpers ---
def run_in_e2b(code: str) -> dict:
    """Create a temporary sandbox, run the code, return logs."""
    # Sandbox is auto-terminated after leaving the context
    with Sandbox.create() as sbx:
        exec_result = sbx.run_code(code)
        return summarize_execution(exec_result)

def failed(exe_summary: dict) -> bool:
    stderr = exe_summary.get("stderr")
    if stderr and str(stderr).strip():
        return True
    # fallbacks for older SDKs
    text = exe_summary.get("text")
    if isinstance(text, str) and any(tok in text.lower() for tok in ("traceback", "error", "exception")):
        return True
    return False


In [94]:
# --- The simplest coding agent (one retry) ---
def coding_agent(task: str, max_retries: int = 1) -> dict:
    code = llm_generate_code(task)
    first = run_in_e2b(code)
    attempt = 0
    while failed(first) and attempt < max_retries:
        attempt += 1
        feedback = (
            "The previous script failed. Here are the logs (trimmed).\n\n"
            f"STDERR:\n{str(first.get('stderr',''))[:2000]}\n\n"
            f"TEXT:\n{str(first.get('text',''))[:2000]}\n\n"
            "Please FIX the bug and output ONLY a single ```python``` block containing the full corrected script."
        )
        code = llm_generate_code(task + "\n\n" + feedback)
        first = run_in_e2b(code)
    return {"code": code, "execution": first, "retries": attempt}


## Demo: a tiny programming task
We’ll ask the agent to implement a small script that:
1) Generates the first _n_ Fibonacci numbers
2) Prints them and their sum
3) Asserts a couple of quick checks in `__main__`

You can change the task to anything that’s safe to run in a sandbox.

In [95]:
TASK = (
    "Write a single Python script that defines a function fibonacci(n) -> list[int]"
    ", prints the first 10 numbers and their sum, and includes a few asserts in __main__."
    " Avoid external dependencies."
)
result = coding_agent(TASK, max_retries=1)
print("\n--- Generated Code ---\n")
print(result["code"])
print("\n--- Execution Summary ---\n")
print(json.dumps(result["execution"], indent=2, default=str))
print({"retries": result["retries"]})


--- Generated Code ---

from typing import List

def fibonacci(n: int) -> List[int]:
    """Return the first n Fibonacci numbers starting with 0, 1."""
    if n <= 0:
        return []
    seq: List[int] = []
    a, b = 0, 1
    for _ in range(n):
        seq.append(a)
        a, b = b, a + b
    return seq

if __name__ == "__main__":
    fib10 = fibonacci(10)
    print("First 10 Fibonacci numbers:", fib10)
    total = sum(fib10)
    print("Sum of first 10 Fibonacci numbers:", total)

    # Simple assertions to validate behavior
    assert len(fib10) == 10
    assert fib10[:5] == [0, 1, 1, 2, 3]
    assert total == 88
    assert fibonacci(0) == []
    assert fibonacci(1) == [0]
    assert fib10[-1] == 34

--- Execution Summary ---

{
  "stdout": [
    "First 10 Fibonacci numbers: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]\nSum of first 10 Fibonacci numbers: 88\n"
  ],
  "stderr": [],
  "text": null
}
{'retries': 0}


## Notes & Next Steps
- **Timeout/TTL:** a default sandbox lifetime is short (minutes). For longer sessions, keep it open and reuse it.
- **Packages:** you can install packages at runtime with `commands.run('pip install ...')` or build a custom template image.
- **Shell commands:** prefer `sandbox.commands.run()` for bash-style steps.
- **Iteration:** For NB2, we’ll add loop control, state tracking, and traces.

**Links** *(open when connected to the internet)*:
- OpenAI Responses API Quickstart & Reference
- E2B Quickstart (start sandbox, env var, run code)
- E2B Python `run_code` + commands docs


# Continuation — Artifacts & Sandbox Management

In this section we:
1) **Persist** a sandbox to capture multiple files/folders created by code.
2) **List** the sandbox filesystem as a tree.
3) **Download** artifacts preserving folder structure (either all files or a single compressed tarball).
4) **Monitor** active sandboxes and **shut them down**.


In [96]:
# --- Persistent sandbox + filesystem helpers ---
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from e2b_code_interpreter import Sandbox
import os, io, tarfile, time, json, shlex
from pathlib import Path

PERSIST_TIMEOUT_SECONDS = int(os.getenv("NB1_PERSIST_TIMEOUT", "600"))  # 10 min

@dataclass
class FileEntry:
    path: str
    is_dir: bool
    size: Optional[int] = None

def extract_output_from_execution(execution) -> str:
    """Extract text output from E2B execution result."""
    if execution is None:
        return ""
    
    # The E2B SDK returns Execution objects with logs.stdout as a list
    if hasattr(execution, 'logs') and hasattr(execution.logs, 'stdout'):
        stdout_list = execution.logs.stdout
        if isinstance(stdout_list, list):
            return '\n'.join(stdout_list)
        else:
            return str(stdout_list) if stdout_list else ""
    
    # Fallback checks
    if hasattr(execution, 'text') and execution.text:
        return execution.text
    
    if hasattr(execution, 'stdout'):
        stdout = execution.stdout
        if isinstance(stdout, list):
            return '\n'.join(stdout)
        else:
            return str(stdout) if stdout else ""
    
    return str(execution) if execution else ""

def list_tree(sbx: Sandbox, root: str = "/home/user") -> List[Dict[str, Any]]:
    """
    Return a list[dict] with entries in 'root' (recursive).
    Uses a Python script executed via run_code to walk the filesystem.
    """
    # Use run_code to execute Python directly in the sandbox
    code = f'''
import os, json
root = {json.dumps(root)}
results = []
try:
    for dp, dns, fns in os.walk(root):
        for name in dns:
            p = os.path.join(dp, name)
            try:
                st = os.lstat(p)
                results.append({{"path": p, "type": "dir", "size": st.st_size, "mtime": int(st.st_mtime)}})
            except Exception as e:
                results.append({{"path": p, "type": "dir", "error": str(e)}})
        for name in fns:
            p = os.path.join(dp, name)
            try:
                st = os.lstat(p)
                results.append({{"path": p, "type": "file", "size": st.st_size, "mtime": int(st.st_mtime)}})
            except Exception as e:
                results.append({{"path": p, "type": "file", "error": str(e)}})
    for r in results:
        print(json.dumps(r))
except Exception as e:
    print(json.dumps({{"error": f"Failed to walk {{root}}: {{str(e)}}"}}))
'''
    result = sbx.run_code(code)
    output = extract_output_from_execution(result)
    
    entries = []
    if output:
        for line in output.strip().split('\n'):
            if line.strip():
                try:
                    entries.append(json.loads(line.strip()))
                except json.JSONDecodeError:
                    continue
    return entries

def print_tree(entries: List[Dict[str, Any]], root: str = "/home/user"):
    """
    Pretty-print entries from list_tree output.
    """
    from os.path import relpath

    for e in sorted(entries, key=lambda x: x.get("path", "")):
        rel = relpath(e["path"], root) if e.get("path") else "?"
        suffix = f"  [ERR {e.get('error')}]" if e.get("error") else ""
        size_info = f" ({e.get('size', 0)} bytes)" if e.get('type') == 'file' else ""
        print(f"{(e.get('type') or '?'):4}  {rel}{size_info}{suffix}")

def download_all_as_tar(sbx: Sandbox, remote_root: str = "/home/user", local_tar_path: str = None) -> str:
    """Create a tar.gz in the sandbox and stream it locally. Preserves structure."""
    # Use a safe, writable local path in the current working directory
    if local_tar_path is None:
        local_tar_path = "artifacts/e2b_demo_project.tar.gz"
    
    local_tar_path = Path(local_tar_path)
    local_tar_path.parent.mkdir(parents=True, exist_ok=True)

    # Create a gzip tar inside the sandbox using run_code with subprocess
    remote_tar = "/tmp/bundle.tar.gz"
    tar_code = f'''
import subprocess
import os
try:
    result = subprocess.run([
        "tar", "-czf", "{remote_tar}", 
        "-C", "{remote_root}", "."
    ], capture_output=True, text=True, check=True)
    print(f"Tar created successfully: {{result.returncode}}")
except subprocess.CalledProcessError as e:
    print(f"Tar creation failed: {{e.stderr}}")
    raise
except Exception as e:
    print(f"Error: {{str(e)}}")
    raise
'''
    tar_result = sbx.run_code(tar_code)
    tar_output = extract_output_from_execution(tar_result)
    print("Tar creation result:", tar_output)

    # Read the tar file using run_code
    read_code = f'''
with open("{remote_tar}", "rb") as f:
    import base64
    data = f.read()
    encoded = base64.b64encode(data).decode()
    print("BASE64_START")
    print(encoded)
    print("BASE64_END")
'''
    read_result = sbx.run_code(read_code)
    output = extract_output_from_execution(read_result)
    
    if not output:
        raise RuntimeError("Failed to read tar data from sandbox - no output received")
    
    # Extract base64 encoded data
    lines = output.strip().split('\n')
    start_idx = -1
    end_idx = -1
    for i, line in enumerate(lines):
        if line.strip() == "BASE64_START":
            start_idx = i + 1
        elif line.strip() == "BASE64_END":
            end_idx = i
            break
    
    if start_idx != -1 and end_idx != -1:
        import base64
        encoded_data = ''.join(lines[start_idx:end_idx])
        data = base64.b64decode(encoded_data)
        local_tar_path.write_bytes(data)
        return str(local_tar_path)
    else:
        raise RuntimeError("Failed to extract tar data from sandbox")

def download_folder_recursive(sbx: Sandbox, remote_root: str, local_root: str = "artifacts/e2b_demo_project") -> str:
    """Recursively mirror files from sandbox -> local path. Use when you need direct files.
    Prefer tar for large trees."""
    local_root = Path(local_root)
    local_root.mkdir(parents=True, exist_ok=True)
    
    entries = list_tree(sbx, root=remote_root)
    for entry in entries:
        if entry.get("error"):
            print(f"Skipping {entry['path']} due to error: {entry['error']}")
            continue
            
        rel_path = os.path.relpath(entry["path"], remote_root)
        local_path = local_root / rel_path
        
        if entry["type"] == "dir":
            local_path.mkdir(parents=True, exist_ok=True)
        elif entry["type"] == "file":
            local_path.parent.mkdir(parents=True, exist_ok=True)
            try:
                # Read file using run_code
                read_code = f'''
import base64
try:
    with open("{entry["path"]}", "rb") as f:
        data = f.read()
        encoded = base64.b64encode(data).decode()
        print("FILE_START")
        print(encoded)
        print("FILE_END")
except Exception as e:
    print(f"Error reading file: {{str(e)}}")
'''
                read_result = sbx.run_code(read_code)
                output = extract_output_from_execution(read_result)
                
                if output:
                    # Extract base64 encoded data
                    lines = output.strip().split('\n')
                    start_idx = -1
                    end_idx = -1
                    for i, line in enumerate(lines):
                        if line.strip() == "FILE_START":
                            start_idx = i + 1
                        elif line.strip() == "FILE_END":
                            end_idx = i
                            break
                    
                    if start_idx != -1 and end_idx != -1:
                        import base64
                        encoded_data = ''.join(lines[start_idx:end_idx])
                        data = base64.b64decode(encoded_data)
                        local_path.write_bytes(data)
                    else:
                        print(f"Failed to extract data for {entry['path']}")
                else:
                    print(f"No output received for {entry['path']}")
                    
            except Exception as e:
                print(f"Failed to download {entry['path']}: {e}")
    
    return str(local_root)

def new_persistent_sandbox(timeout_seconds: int = PERSIST_TIMEOUT_SECONDS) -> Sandbox:
    sbx = Sandbox.create(timeout=timeout_seconds)
    print({"sandboxId": sbx.sandbox_id, "timeout_s": timeout_seconds})
    return sbx

In [97]:
# --- Demo: create nested files, then list & download ---
PERSIST_SBX = new_persistent_sandbox()

# Generate a small project structure from code
code = r'''
import os, json
base = '/home/user/demo_project'
os.makedirs(base + '/pkg/utils', exist_ok=True)
open(base + '/README.md', 'w').write('# Demo Project\n')
open(base + '/pkg/__init__.py', 'w').write('')
open(base + '/pkg/utils/helpers.py', 'w').write('def add(a,b): return a+b\n')
open(base + '/main.py', 'w').write('from pkg.utils.helpers import add\nprint(add(2,3))\n')
print('Wrote project to', base)
'''
result = PERSIST_SBX.run_code(code)
print("=== Project creation result ===")
output = extract_output_from_execution(result)
print("Code execution result:", output)

# List the created files
entries = list_tree(PERSIST_SBX, root='/home/user')
print("\n--- File Tree ---")
print_tree(entries, root='/home/user')

# Download: as a tarball (using relative path in current directory)
print("\n--- Downloading as tarball ---")
tar_path = download_all_as_tar(PERSIST_SBX, remote_root='/home/user/demo_project', local_tar_path='artifacts/e2b_demo_project.tar.gz')
print({'tar_saved_to': tar_path})

# Download: direct mirror (optional, using relative path)
print("\n--- Downloading as direct mirror ---")
mirror_dir = download_folder_recursive(PERSIST_SBX, remote_root='/home/user/demo_project', local_root='artifacts/e2b_demo_project_mirror')
print({'mirrored_to': mirror_dir})

# Verify the tar contents
print("\n--- Verifying tar contents ---")
import tarfile
try:
    with tarfile.open(tar_path, 'r:gz') as tar:
        print("Files in tarball:", tar.getnames())
except Exception as e:
    print(f"Error reading tar: {e}")

# List local mirror contents
print("\n--- Local mirror contents ---")
import os
if os.path.exists(mirror_dir):
    for root, dirs, files in os.walk(mirror_dir):
        level = root.replace(mirror_dir, '').count(os.sep)
        indent = ' ' * 2 * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 2 * (level + 1)
        for file in files:
            print(f"{subindent}{file}")
else:
    print(f"Mirror directory not found: {mirror_dir}")

{'sandboxId': 'i4pgplvnimx5zyt93a29g', 'timeout_s': 600}
=== Project creation result ===
Code execution result: Wrote project to /home/user/demo_project


--- File Tree ---
file  .bash_logout (220 bytes)
file  .bashrc (3526 bytes)
file  .profile (807 bytes)
dir   demo_project
file  demo_project/README.md (15 bytes)
file  demo_project/main.py (50 bytes)
dir   demo_project/pkg
file  demo_project/pkg/__init__.py (0 bytes)
dir   demo_project/pkg/utils
file  demo_project/pkg/utils/helpers.py (25 bytes)

--- Downloading as tarball ---
Tar creation result: Tar created successfully: 0

{'tar_saved_to': 'artifacts/e2b_demo_project.tar.gz'}

--- Downloading as direct mirror ---
{'mirrored_to': 'artifacts/e2b_demo_project_mirror'}

--- Verifying tar contents ---
Files in tarball: ['.', './pkg', './pkg/utils', './pkg/utils/helpers.py', './pkg/__init__.py', './README.md', './main.py']

--- Local mirror contents ---
e2b_demo_project_mirror/
  main.py
  README.md
  pkg/
    __init__.py
    utils/
   

In [98]:
# --- Monitor & shutdown sandboxes ---
from e2b_code_interpreter import Sandbox
from typing import Iterable

def list_running_or_paused(limit: int = 100) -> list:
    """List running/paused sandboxes using E2B SDK."""
    try:
        # Use E2B's list method with proper pagination
        paginator = Sandbox.list()
        print(f"Paginator type: {type(paginator)}")
        
        items = []
        
        # Check if it has nextItems method (proper E2B pagination)
        if hasattr(paginator, 'nextItems'):
            try:
                # Get first page
                first_page = paginator.nextItems()
                items.extend(first_page)
                print(f"Got {len(first_page)} items from first page")
                
                # Get remaining pages if hasNext is True
                while hasattr(paginator, 'hasNext') and paginator.hasNext:
                    next_page = paginator.nextItems()
                    items.extend(next_page)
                    print(f"Got {len(next_page)} items from next page")
            except Exception as e:
                print(f"Pagination failed: {e}")
        
        # Fallback: try direct iteration
        elif hasattr(paginator, '__iter__'):
            try:
                items = list(paginator)
                print(f"Got {len(items)} items via iteration")
            except Exception as e:
                print(f"Iteration failed: {e}")
                
        return items
    except Exception as e:
        print(f"[list_running_or_paused] error: {e}")
        return []

def pretty_sbx_info(items: Iterable) -> None:
    try:
        items_list = list(items) if hasattr(items, '__iter__') else [items]
    except TypeError:
        print(f"Cannot iterate over items: {type(items)}")
        return
        
    for it in items_list:
        if it is None:
            continue
            
        try:
            # Extract sandbox information using E2B SDK methods
            if hasattr(it, 'get_info'):
                # Use get_info() method if available
                info = it.get_info()
                print(f"Sandbox info: {info}")
            else:
                # Try direct attribute access
                print(f"Item type: {type(it)}")
                sid = getattr(it, 'sandbox_id', getattr(it, 'id', 'unknown'))
                state = getattr(it, 'state', 'unknown')
                metadata = getattr(it, 'metadata', {})
                print({'sandboxId': sid, 'state': state, 'metadata': metadata})
        except Exception as e:
            print(f"Error processing sandbox info: {e}, item: {it}")

def kill_by_id(sandbox_id: str) -> bool:
    """Kill a sandbox by its ID using E2B's static kill method."""
    try:
        return Sandbox.kill(sandbox_id)
    except Exception as e:
        print(f"[kill_by_id] error: {e}")
        return False

def kill_all_running() -> None:
    """Kill all running sandboxes."""
    items = list_running_or_paused()
    for it in items:
        if it is None:
            continue
            
        try:
            # Try to get sandbox ID
            sid = getattr(it, 'sandbox_id', getattr(it, 'id', None))
            if not sid:
                print(f"No sandbox ID found for item: {it}")
                continue
                
            ok = Sandbox.kill(sid)
            print({'killed': sid, 'ok': ok})
        except Exception as e:
            print({'killed': getattr(it, 'sandbox_id', 'unknown'), 'ok': False, 'error': str(e)})

# Check if we have an active sandbox first
print("=== Current sandbox status ===")
if 'PERSIST_SBX' in locals():
    print(f"PERSIST_SBX type: {type(PERSIST_SBX)}")
    if hasattr(PERSIST_SBX, 'sandbox_id'):
        print(f"Sandbox ID: {PERSIST_SBX.sandbox_id}")
    
    # Test if sandbox is still active by running a simple command
    try:
        test_result = PERSIST_SBX.run_code('print("Sandbox is alive!")')
        if test_result:
            output = extract_output_from_execution(test_result)
            print(f"Sandbox test result: {output}")
        else:
            print("Sandbox test returned None - may be terminated")
    except Exception as e:
        print(f"Sandbox test failed: {e}")
        
    # Try to get sandbox info
    try:
        if hasattr(PERSIST_SBX, 'get_info'):
            info = PERSIST_SBX.get_info()
            print(f"Sandbox info: {info}")
    except Exception as e:
        print(f"Failed to get sandbox info: {e}")

# Show currently running/paused sandboxes
print("\n=== Listing all sandboxes ===")
try:
    items = list_running_or_paused()
    print(f"Found {len(items)} sandboxes")
    if items:
        pretty_sbx_info(items)
    else:
        print("No sandboxes found - they may have auto-terminated after timeout")
except Exception as e:
    print(f"Error listing sandboxes: {e}")

=== Current sandbox status ===
PERSIST_SBX type: <class 'e2b_code_interpreter.code_interpreter_sync.Sandbox'>
Sandbox ID: i4pgplvnimx5zyt93a29g
Sandbox test result: Sandbox is alive!

Sandbox info: SandboxInfo(sandbox_id='i4pgplvnimx5zyt93a29g', sandbox_domain=None, template_id='nlhz8vlwyupq845jsdg9', name='code-interpreter-v1', metadata={}, started_at=datetime.datetime(2025, 9, 20, 23, 13, 53, 139222, tzinfo=tzutc()), end_at=datetime.datetime(2025, 9, 20, 23, 23, 53, 139222, tzinfo=tzutc()), state=<SandboxState.RUNNING: 'running'>, cpu_count=2, memory_mb=1024, envd_version='0.2.10', _envd_access_token='b214551d45d3c8c4aee8408276f8e5e58001854d111fc6d46152ab86367e10d0')

=== Listing all sandboxes ===
Paginator type: <class 'e2b.sandbox_sync.paginator.SandboxPaginator'>
Found 0 sandboxes
No sandboxes found - they may have auto-terminated after timeout


### Usage tips
- Prefer **tar download** for large/many files; it’s faster and preserves structure.
- You can **pause** long-lived sandboxes and resume later (beta). While paused, files and memory persist. See E2B docs.
- To **shut down**: `kill_by_id('<sandbox_id>')` or `kill_all_running()`.
- If you used `with Sandbox.create() as sbx: ...`, that sandbox auto-terminates at the end of the context.


In [99]:
# --- Cleanup and close the persistent sandbox ---
if 'PERSIST_SBX' in locals():
    try:
        # E2B uses kill() method to terminate sandboxes
        PERSIST_SBX.kill()
        print("Sandbox terminated successfully with kill()")
    except Exception as e:
        print(f"Error terminating sandbox: {e}")
        # Print available methods for debugging
        print(f"Available methods: {[method for method in dir(PERSIST_SBX) if not method.startswith('_')]}")
else:
    print("No persistent sandbox to close")

Sandbox terminated successfully with kill()


# Human-in-the-Loop (HITL) Coding Agent with OpenTelemetry & LangSmith

This section demonstrates a complete **Human-in-the-Loop** coding workflow with:
- **LangGraph** state management with interrupts
- **OpenTelemetry** tracing with GenAI semantic conventions
- **LangSmith** integration for trace visualization
- **E2B** sandboxed execution
- **Tar loading** for project artifacts

## Features

1. **Code Review Stage**: Human approves/edits/rejects generated code
2. **Execution Review Stage**: Optional human review after execution
3. **Full Tracing**: Every step traced with OTEL spans
4. **LangSmith Integration**: Traces appear in your LangSmith project
5. **Artifact Management**: Load/save project files via tar

The workflow follows the pattern shown in the HITL Usage Guide but runs entirely within this E2B notebook environment.

In [100]:
# --- HITL Dependencies ---
import uuid
import time
import asyncio
from typing import Dict, Any, List, Optional, Literal, TypedDict, Annotated
from dataclasses import dataclass
from enum import Enum

# LangGraph imports
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from langgraph.types import Command, interrupt

# OpenTelemetry imports  
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.semconv.trace import SpanAttributes
from opentelemetry.trace import Status, StatusCode

# LangSmith imports
from langsmith import Client as LangSmithClient, traceable
from langsmith.run_helpers import tracing_context

print("✅ HITL dependencies installed")

✅ HITL dependencies installed


In [101]:
# --- Configure OpenTelemetry + LangSmith Tracing ---

# Initialize OpenTelemetry
resource = Resource.create({
    "service.name": "hitl-coding-agent",
    "service.version": "0.1.0",
    "environment": "notebook"
})

# Set up tracer provider
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

# Configure exporters
langsmith_api_key = os.getenv("LANGSMITH_API_KEY")
langsmith_project = os.getenv("LANGSMITH_PROJECT", "agents-demo")

if langsmith_api_key:
    # OTLP exporter for LangSmith
    otlp_exporter = OTLPSpanExporter(
        endpoint="https://api.smith.langchain.com/otel",
        headers={
            "x-api-key": langsmith_api_key,
            "Langsmith-Project": langsmith_project
        }
    )
    tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    print(f"✅ OTLP exporter configured for LangSmith project: {langsmith_project}")
else:
    print("⚠️  LANGSMITH_API_KEY not set, skipping LangSmith integration")

# Always add console exporter for local debugging
console_exporter = ConsoleSpanExporter()
tracer_provider.add_span_processor(BatchSpanProcessor(console_exporter))

# Get tracer
tracer = trace.get_tracer(__name__)

# Initialize LangSmith client
langsmith_client = LangSmithClient(api_key=langsmith_api_key) if langsmith_api_key else None
if langsmith_client:
    print(f"✅ LangSmith client initialized")

print("✅ OpenTelemetry + LangSmith tracing configured")

Overriding of current TracerProvider is not allowed


✅ OTLP exporter configured for LangSmith project: pr-majestic-codling-98
✅ LangSmith client initialized
✅ OpenTelemetry + LangSmith tracing configured


In [102]:
# --- HITL State Definition ---

class HITLDecision(str, Enum):
    APPROVE = "approve"
    EDIT = "edit" 
    REJECT = "reject"

class HITLStage(str, Enum):
    CODE_REVIEW = "code_review"
    EXECUTION_REVIEW = "execution_review"

@dataclass
class ApprovalPayload:
    code: str
    task: str
    suggestion: str
    options: List[str]
    stage: HITLStage

@dataclass
class DecisionPayload:
    decision: HITLDecision
    code: Optional[str] = None
    reason: Optional[str] = None

class HITLState(TypedDict):
    user_id: str
    thread_id: str
    user_query: str
    generated_code: str
    execution_result: Dict[str, Any]
    final_result: str
    
    # HITL-specific fields
    stage: HITLStage
    approval_payload: Optional[ApprovalPayload]
    human_decision: Optional[DecisionPayload]
    
    # Metrics
    total_cost: float
    token_usage: Dict[str, int]
    error_message: Optional[str]
    retries: int

print("✅ HITL state models defined")

✅ HITL state models defined


In [103]:
# --- HITL Node Implementations (Robust State Handling) ---

@traceable(name="generate_code_node")
def generate_code_node(state: HITLState) -> HITLState:
    """Generate code using OpenAI with full tracing."""
    with tracer.start_as_current_span("generate_code_node") as span:
        span.set_attribute("node", "generate_code")
        
        # Robust state access with fallbacks
        user_query = state.get("user_query", "Write a simple Python script")
        span.set_attribute("user_query", user_query)
        
        try:
            # Use existing LLM generation from earlier cells
            start_time = time.time()
            code = llm_generate_code(user_query)
            latency_ms = (time.time() - start_time) * 1000
            
            # Set span attributes for GenAI
            span.set_attribute("gen_ai.system", "openai")
            span.set_attribute("gen_ai.operation.name", "responses")
            span.set_attribute("gen_ai.request.model", MODEL)
            span.set_attribute("gen_ai.prompt", user_query[:500])  # Truncated
            span.set_attribute("gen_ai.completion", code[:500])  # Truncated
            span.set_attribute("latency_ms", latency_ms)
            
            # Update state
            state["generated_code"] = code
            state["stage"] = HITLStage.CODE_REVIEW
            
            # Create approval payload
            state["approval_payload"] = ApprovalPayload(
                code=code,
                task=user_query,
                suggestion="Please review the code and choose: approve, edit, or reject",
                options=["approve", "edit", "reject"],
                stage=HITLStage.CODE_REVIEW
            )
            
            span.set_status(Status(StatusCode.OK))
            print(f"📝 Code generated ({len(code)} chars)")
            
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            state["error_message"] = str(e)
            
    return state

@traceable(name="code_review_node")
def code_review_node(state: HITLState) -> HITLState:
    """Handle code review stage - interrupts for human input."""
    with tracer.start_as_current_span("code_review_node") as span:
        span.set_attribute("node", "code_review")
        span.set_attribute("hitl.stage", HITLStage.CODE_REVIEW.value)
        
        # This will interrupt the graph for human input
        return Command(
            update={"stage": HITLStage.CODE_REVIEW},
            goto=interrupt("Please review the generated code")
        )

@traceable(name="execute_code_node")
def execute_code_node(state: HITLState) -> HITLState:
    """Execute code in E2B sandbox with tracing."""
    with tracer.start_as_current_span("execute_code_node") as span:
        span.set_attribute("node", "execute_code")

        try:
            start_time = time.time()

            # Get code to execute (prefer edited code if available)
            generated_code = state.get("generated_code", "print('No code to execute')")
            
            # Use the persistent sandbox for execution to preserve files for artifact creation
            if 'PERSIST_SBX' in globals() and PERSIST_SBX:
                try:
                    exec_result = PERSIST_SBX.run_code(generated_code)
                    execution = summarize_execution(exec_result)
                except Exception as e:
                    print(f"⚠️  Persistent sandbox failed: {e}, falling back to temporary")
                    execution = run_in_e2b(generated_code)
            else:
                # Fallback to temporary sandbox
                execution = run_in_e2b(generated_code)

            latency_ms = (time.time() - start_time) * 1000

            # Set sandbox attributes
            span.set_attribute("sandbox.success", not failed(execution))
            span.set_attribute("latency_ms", latency_ms)

            state["execution_result"] = execution

            if failed(execution):
                state["error_message"] = str(execution.get("stderr", "Unknown error"))
                span.set_attribute("sandbox.error", state["error_message"])
            else:
                state["final_result"] = "\n".join(execution.get("stdout", []))
                span.set_status(Status(StatusCode.OK))

            print(f"🏃 Code executed in E2B ({latency_ms:.1f}ms)")

        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            state["error_message"] = str(e)

    return state

@traceable(name="save_artifacts_node")
def save_artifacts_node(state: HITLState) -> HITLState:
    """Save project artifacts as tar for download."""
    with tracer.start_as_current_span("save_artifacts_node") as span:
        span.set_attribute("node", "save_artifacts")
        
        try:
            # Get required state values with fallbacks
            thread_id = state.get("thread_id", f"unknown_{uuid.uuid4().hex[:8]}")
            user_query = state.get("user_query", "Unknown task")
            generated_code = state.get("generated_code", "# No code generated")
            
            # Only save artifacts if we have a persistent sandbox and execution succeeded
            if 'PERSIST_SBX' in globals() and PERSIST_SBX and not state.get("error_message"):
                try:
                    # Write the generated code to a file in the persistent sandbox
                    write_code = f'''
import os
project_dir = "/home/user/hitl_project_{thread_id}"
os.makedirs(project_dir, exist_ok=True)

# Write the main code
with open(os.path.join(project_dir, "main.py"), "w") as f:
    f.write({repr(generated_code)})

# Write a README
with open(os.path.join(project_dir, "README.md"), "w") as f:
    f.write(f"# HITL Generated Project\\n\\nTask: {repr(user_query)}\\n\\nGenerated on: {{__import__('datetime').datetime.now()}}\\n")

print(f"Project saved to {{project_dir}}")
'''
                    result = PERSIST_SBX.run_code(write_code)
                    output = extract_output_from_execution(result)
                    print(f"📁 Project structure created: {output}")
                    
                    # Create tar archive
                    tar_path = download_all_as_tar(
                        PERSIST_SBX, 
                        remote_root=f"/home/user/hitl_project_{thread_id}",
                        local_tar_path=f"artifacts/hitl_project_{thread_id}.tar.gz"
                    )
                    
                    span.set_attribute("artifacts.tar_path", tar_path)
                    print(f"📦 Artifacts saved to: {tar_path}")
                    
                except Exception as artifact_error:
                    print(f"⚠️  Artifact saving failed: {artifact_error}")
                    span.set_attribute("artifacts.error", str(artifact_error))
            else:
                reason = "no persistent sandbox" if 'PERSIST_SBX' not in globals() or not PERSIST_SBX else "execution failed"
                print(f"⚠️  Skipping artifact save: {reason}")
                
            span.set_status(Status(StatusCode.OK))
            
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            state["error_message"] = str(e)
            print(f"❌ Error saving artifacts: {e}")
            
    return state

print("✅ HITL nodes implemented (with robust state handling)")

✅ HITL nodes implemented (with robust state handling)


In [104]:
# --- HITL Graph Construction ---

def create_hitl_graph() -> StateGraph:
    """Create the HITL coding workflow graph."""
    
    # Create the graph
    graph = StateGraph(HITLState)
    
    # Add nodes
    graph.add_node("generate_code", generate_code_node)
    graph.add_node("code_review", code_review_node)
    graph.add_node("execute_code", execute_code_node)
    graph.add_node("save_artifacts", save_artifacts_node)
    
    # Define routing logic
    def decide_next_node(state: HITLState) -> str:
        """Route to next node based on current state and human decisions."""
        
        # Check for human decision
        if state.get("human_decision"):
            decision = state["human_decision"]
            
            if decision.decision == HITLDecision.APPROVE:
                return "execute_code"
            elif decision.decision == HITLDecision.EDIT:
                # Update code with human edits if provided
                if decision.code:
                    state["generated_code"] = decision.code
                return "execute_code"
            elif decision.decision == HITLDecision.REJECT:
                return END
        
        # Default routing based on stage
        current_stage = state.get("stage")
        if current_stage == HITLStage.CODE_REVIEW:
            return "execute_code"
        else:
            return END
    
    # Add edges
    graph.add_edge(START, "generate_code")
    graph.add_edge("generate_code", "code_review")
    
    # Conditional routing after code review
    graph.add_conditional_edges(
        "code_review", 
        decide_next_node,
        {
            "execute_code": "execute_code",
            END: END
        }
    )
    
    graph.add_edge("execute_code", "save_artifacts")
    graph.add_edge("save_artifacts", END)
    
    return graph

# Create the compiled graph with checkpointer
checkpointer = MemorySaver()
hitl_graph = create_hitl_graph().compile(
    checkpointer=checkpointer,
    interrupt_before=["code_review"]  # Interrupt before human review
)

print("✅ HITL graph created with checkpointer")

✅ HITL graph created with checkpointer


In [105]:
# --- HITL Helper Functions (Fixed State Management) ---

def create_hitl_session(user_query: str, user_id: str = "notebook_user") -> str:
    """Create a new HITL session and return thread_id."""
    thread_id = f"hitl_{uuid.uuid4().hex[:8]}"
    
    initial_state = HITLState(
        user_id=user_id,
        thread_id=thread_id,
        user_query=user_query,
        generated_code="",
        execution_result={},
        final_result="",
        stage=HITLStage.CODE_REVIEW,
        approval_payload=None,
        human_decision=None,
        total_cost=0.0,
        token_usage={},
        error_message=None,
        retries=0
    )
    
    return thread_id, initial_state

def start_hitl_workflow(user_query: str) -> tuple[str, dict]:
    """Start the HITL workflow and return thread_id and current state."""
    with tracer.start_as_current_span("start_hitl_workflow") as span:
        span.set_attribute("user_query", user_query)
        
        thread_id, initial_state = create_hitl_session(user_query)
        config = {"configurable": {"thread_id": thread_id}}
        
        # Run until first interrupt (code review)
        result = hitl_graph.invoke(initial_state, config=config)
        
        span.set_attribute("thread_id", thread_id)
        span.set_attribute("stage", result.get("stage", "unknown"))
        
        return thread_id, result

def resume_hitl_workflow(thread_id: str, decision: DecisionPayload) -> dict:
    """Resume the HITL workflow with human decision."""
    with tracer.start_as_current_span("resume_hitl_workflow") as span:
        span.set_attribute("thread_id", thread_id)
        span.set_attribute("hitl.decision", decision.decision.value)
        
        config = {"configurable": {"thread_id": thread_id}}
        
        try:
            # Get the current state from the checkpointer first
            current_state = hitl_graph.get_state(config)
            if current_state and current_state.values:
                # Merge human decision with existing state
                merged_state = dict(current_state.values)
                merged_state["human_decision"] = decision
                
                # If decision includes edited code, update it
                if decision.decision == HITLDecision.EDIT and decision.code:
                    merged_state["generated_code"] = decision.code
                
                # Continue execution with merged state
                result = hitl_graph.invoke(merged_state, config=config)
            else:
                # Fallback: create minimal state if checkpoint is missing
                print("⚠️  No checkpoint found, creating minimal state")
                minimal_state = {
                    "user_id": "unknown",
                    "thread_id": thread_id,
                    "user_query": "Resume workflow",  # Fallback query
                    "generated_code": "",
                    "execution_result": {},
                    "final_result": "",
                    "stage": HITLStage.CODE_REVIEW,
                    "approval_payload": None,
                    "human_decision": decision,
                    "total_cost": 0.0,
                    "token_usage": {},
                    "error_message": None,
                    "retries": 0
                }
                result = hitl_graph.invoke(minimal_state, config=config)
            
            return result
            
        except Exception as e:
            print(f"❌ Error resuming workflow: {e}")
            # Return error state
            return {
                "thread_id": thread_id,
                "error_message": str(e),
                "stage": "error"
            }

def display_approval_request(state: dict):
    """Display the approval request for human review."""
    if "approval_payload" in state and state["approval_payload"]:
        payload = state["approval_payload"]
        
        print("\n" + "="*60)
        print("🤖 HUMAN REVIEW REQUIRED")
        print("="*60)
        print(f"Task: {payload.task}")
        print(f"Stage: {payload.stage.value}")
        print(f"\nGenerated Code:")
        print("-" * 40)
        print(payload.code)
        print("-" * 40)
        print(f"\n{payload.suggestion}")
        print(f"Options: {', '.join(payload.options)}")
        print("="*60)
        
        return True
    return False

print("✅ HITL helper functions defined (with fixed state management)")

✅ HITL helper functions defined (with fixed state management)


## HITL Demo: Complete Workflow

This demo shows the complete HITL workflow:

1. **Start Session**: Generate code and wait for human review
2. **Human Decision**: Approve, edit, or reject the code  
3. **Execute**: Run approved code in E2B sandbox
4. **Save Artifacts**: Create project structure and tar archive
5. **Trace Analysis**: View spans in console and LangSmith

All with full OpenTelemetry tracing and tar loading capability.

In [106]:
# --- Demo: Start HITL Session ---

# Ensure we have a persistent sandbox for tar operations
if 'PERSIST_SBX' not in locals() or not PERSIST_SBX:
    PERSIST_SBX = new_persistent_sandbox()
    print(f"📦 Created new persistent sandbox: {PERSIST_SBX.sandbox_id}")

# Start the HITL workflow
USER_QUERY = (
    "Create a Python class called DataProcessor that can:"
    " 1) Load CSV data from a file,"
    " 2) Clean missing values,"
    " 3) Calculate basic statistics (mean, median, std),"
    " 4) Export results to JSON."
    " Include proper error handling and docstrings."
)

print("🚀 Starting HITL workflow...")
print(f"Query: {USER_QUERY}")

# This will generate code and interrupt for human review
thread_id, state = start_hitl_workflow(USER_QUERY)

print(f"\n✅ Session created: {thread_id}")
print(f"📊 Current stage: {state.get('stage', 'unknown')}")

# Display the approval request
if display_approval_request(state):
    print("\n⏳ Workflow interrupted - waiting for human decision")
else:
    print("❌ No approval request found - check state:", state.keys())

🚀 Starting HITL workflow...
Query: Create a Python class called DataProcessor that can: 1) Load CSV data from a file, 2) Clean missing values, 3) Calculate basic statistics (mean, median, std), 4) Export results to JSON. Include proper error handling and docstrings.


Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.


📝 Code generated (12710 chars)

✅ Session created: hitl_8ac098fc
📊 Current stage: HITLStage.CODE_REVIEW

🤖 HUMAN REVIEW REQUIRED
Task: Create a Python class called DataProcessor that can: 1) Load CSV data from a file, 2) Clean missing values, 3) Calculate basic statistics (mean, median, std), 4) Export results to JSON. Include proper error handling and docstrings.
Stage: code_review

Generated Code:
----------------------------------------
import csv
import json
import math
import os
import tempfile
from typing import Any, Dict, List, Optional


class DataProcessingError(Exception):
    """Custom exception for data processing errors."""
    pass


class DataProcessor:
    """
    A lightweight data processing utility to:
    1) Load CSV data from a file
    2) Clean missing values
    3) Calculate basic statistics (mean, median, std) for numeric columns
    4) Export results to JSON

    The implementation uses only the Python standard library and stores data
    in-memory as a list of

In [107]:
# --- Self-Contained HITL Workflow (Fixed Implementation) ---

print("🔧 TESTING SELF-CONTAINED HITL WORKFLOW")
print("="*70)

# Create a self-contained workflow that manages its own sandbox
class SelfContainedHITLWorkflow:
    def __init__(self):
        self.sandbox = None
        self.sandbox_id = None
        
    def ensure_sandbox(self):
        """Ensure we have an active sandbox, create one if needed."""
        if not self.sandbox:
            self.sandbox = new_persistent_sandbox()
            self.sandbox_id = self.sandbox.sandbox_id
            print(f"📦 Created new sandbox: {self.sandbox_id}")
        else:
            # Test if sandbox is still alive
            try:
                test_result = self.sandbox.run_code('print("alive")')
                if not test_result:
                    raise Exception("Sandbox returned None")
            except Exception as e:
                print(f"⚠️  Sandbox {self.sandbox_id} is dead, creating new one: {e}")
                self.sandbox = new_persistent_sandbox()
                self.sandbox_id = self.sandbox.sandbox_id
                print(f"📦 Created replacement sandbox: {self.sandbox_id}")
        return self.sandbox
    
    def execute_code_with_artifacts(self, code: str, thread_id: str, user_query: str) -> tuple[dict, str]:
        """Execute code and save artifacts, returning (execution_result, tar_path)."""
        sandbox = self.ensure_sandbox()
        
        # Execute the code
        exec_result = sandbox.run_code(code)
        execution = summarize_execution(exec_result)
        
        # Save artifacts if execution succeeded
        tar_path = None
        if not failed(execution):
            # Write the generated code to a file
            write_code = f'''
import os
project_dir = "/home/user/hitl_project_{thread_id}"
os.makedirs(project_dir, exist_ok=True)

# Write the main code
with open(os.path.join(project_dir, "main.py"), "w") as f:
    f.write({repr(code)})

# Write a README
with open(os.path.join(project_dir, "README.md"), "w") as f:
    f.write(f"# HITL Generated Project\\n\\nTask: {repr(user_query)}\\n\\nGenerated on: {{__import__('datetime').datetime.now()}}\\n")

print(f"Project saved to {{project_dir}}")
'''
            result = sandbox.run_code(write_code)
            output = extract_output_from_execution(result)
            print(f"📁 Project structure created: {output}")
            
            # Create tar archive
            tar_path = download_all_as_tar(
                sandbox, 
                remote_root=f"/home/user/hitl_project_{thread_id}",
                local_tar_path=f"artifacts/hitl_project_{thread_id}.tar.gz"
            )
            print(f"📦 Artifacts saved to: {tar_path}")
        
        return execution, tar_path
    
    def cleanup(self):
        """Clean up the sandbox."""
        if self.sandbox:
            try:
                self.sandbox.kill()
                print(f"🗑️  Cleaned up sandbox: {self.sandbox_id}")
            except:
                pass
            self.sandbox = None

# Create workflow instance
workflow = SelfContainedHITLWorkflow()

try:
    # Clear any existing artifacts
    import glob
    try:
        old_tars = glob.glob("artifacts/hitl_project_*.tar.gz")
        for tar_file in old_tars:
            os.remove(tar_file)
            print(f"🧹 Removed old tar: {tar_file}")
    except Exception as e:
        print(f"⚠️  Could not clean old artifacts: {e}")

    # Test with a simple but complete task
    TEST_QUERY = (
        "Write a Python class called SimpleCalculator with methods for "
        "add, subtract, multiply, divide. Include error handling for "
        "division by zero and a main section with example usage."
    )

    print(f"\n🚀 Starting self-contained HITL workflow...")
    print(f"Query: {TEST_QUERY}")

    # Step 1: Generate code
    print("\n📝 Step 1: Generating code...")
    generated_code = llm_generate_code(TEST_QUERY)
    print(f"Generated {len(generated_code)} characters of code")

    # Step 2: Execute and save artifacts
    print("\n🏃 Step 2: Executing code and saving artifacts...")
    thread_id = f"test_{uuid.uuid4().hex[:8]}"
    execution_result, tar_path = workflow.execute_code_with_artifacts(
        generated_code, thread_id, TEST_QUERY
    )

    # Step 3: Verify results
    print("\n" + "="*50)
    print("📊 SELF-CONTAINED WORKFLOW COMPLETED")
    print("="*50)
    
    success = not failed(execution_result)
    print(f"Status: {'✅ SUCCESS' if success else '❌ ERROR'}")
    print(f"Thread ID: {thread_id}")
    
    if success:
        stdout_lines = execution_result.get('stdout', [])
        if stdout_lines:
            print(f"✅ Execution output: {' '.join(stdout_lines)[:200]}...")
    else:
        print(f"❌ Execution failed: {execution_result.get('stderr', 'Unknown error')}")

    # Step 4: Verify tar artifacts
    print("\n📦 Checking for artifacts...")
    if tar_path and os.path.exists(tar_path):
        file_size = os.path.getsize(tar_path)
        print(f"🎯 ✅ SUCCESS: Artifacts saved: {tar_path} ({file_size:,} bytes)")
        
        # Verify tar contents
        import tarfile
        try:
            with tarfile.open(tar_path, 'r:gz') as tar:
                files = tar.getnames()
                print(f"📁 Tar contains {len(files)} files: {files}")
                
                # Extract and verify content
                extract_dir = Path(f"artifacts/test_extracted_{thread_id}")
                extract_dir.mkdir(parents=True, exist_ok=True)
                tar.extractall(extract_dir)
                
                # Check main.py exists and has content
                main_py = extract_dir / "main.py"
                if main_py.exists():
                    content = main_py.read_text()
                    print(f"📄 main.py exists ({len(content)} chars)")
                    print(f"🔍 Preview: {content[:100]}...")
                    
                    # Verify the content matches what we generated
                    if generated_code.strip() in content:
                        print("✅ Generated code correctly saved in tar archive")
                    else:
                        print("⚠️  Generated code doesn't match saved content")
                else:
                    print("❌ main.py not found in extracted files")
                    
        except Exception as e:
            print(f"❌ Error reading tar: {e}")
    else:
        print("❌ No tar artifacts found - workflow unsuccessful")

finally:
    # Always cleanup
    workflow.cleanup()

print("\n" + "="*70)
print("🎯 SELF-CONTAINED WORKFLOW TEST COMPLETED")
print("="*70)

🔧 TESTING SELF-CONTAINED HITL WORKFLOW
🧹 Removed old tar: artifacts/hitl_project_test_92fed8ad.tar.gz

🚀 Starting self-contained HITL workflow...
Query: Write a Python class called SimpleCalculator with methods for add, subtract, multiply, divide. Include error handling for division by zero and a main section with example usage.

📝 Step 1: Generating code...
Generated 1335 characters of code

🏃 Step 2: Executing code and saving artifacts...
{'sandboxId': 'ifr8nbwmzi72q31s4hy8w', 'timeout_s': 600}
📦 Created new sandbox: ifr8nbwmzi72q31s4hy8w
📁 Project structure created: Project saved to /home/user/hitl_project_test_c41f21f1

Tar creation result: Tar created successfully: 0

📦 Artifacts saved to: artifacts/hitl_project_test_c41f21f1.tar.gz

📊 SELF-CONTAINED WORKFLOW COMPLETED
Status: ✅ SUCCESS
Thread ID: test_c41f21f1
✅ Execution output: Example operations:
3 + 5 = 8
10 - 4 = 6
7 * 6 = 42
8 / 2 = 4.0
Caught error during division: Cannot divide by zero.
All simple tests passed.
...

📦 Che

  tar.extractall(extract_dir)


🗑️  Cleaned up sandbox: ifr8nbwmzi72q31s4hy8w

🎯 SELF-CONTAINED WORKFLOW TEST COMPLETED


In [108]:
# --- Demo: Human Decision - APPROVE ---

# Simulate human decision to approve the code
decision = DecisionPayload(
    decision=HITLDecision.APPROVE
)

print("👍 Human Decision: APPROVE")
print("🔄 Resuming workflow with approval...")

# Resume the workflow
final_state = resume_hitl_workflow(thread_id, decision)

print("\n" + "="*50)
print("📊 WORKFLOW COMPLETED")
print("="*50)
print(f"Status: {'✅ SUCCESS' if not final_state.get('error_message') else '❌ ERROR'}")
print(f"Thread ID: {final_state.get('thread_id')}")
print(f"Final Stage: {final_state.get('stage')}")

if final_state.get('error_message'):
    print(f"Error: {final_state['error_message']}")
else:
    print(f"Final Result: {final_state.get('final_result', 'No output')[:200]}...")
    
print(f"\nExecution Summary:")
exec_result = final_state.get('execution_result', {})
print(f"  - STDOUT lines: {len(exec_result.get('stdout', []))}")
print(f"  - STDERR lines: {len(exec_result.get('stderr', []))}")
print(f"  - Success: {not failed(exec_result)}")

print("\n📦 Checking for artifacts...")
# Look for tar files created
import glob
tar_files = glob.glob(f"artifacts/hitl_project_{thread_id}*.tar.gz")
if tar_files:
    print(f"🎯 Artifacts saved: {tar_files[0]}")
    
    # Verify tar contents
    import tarfile
    try:
        with tarfile.open(tar_files[0], 'r:gz') as tar:
            files = tar.getnames()
            print(f"📁 Tar contains {len(files)} files: {files[:5]}...")
    except Exception as e:
        print(f"❌ Error reading tar: {e}")
else:
    print("❌ No tar artifacts found")

👍 Human Decision: APPROVE
🔄 Resuming workflow with approval...
⚠️  No checkpoint found, creating minimal state


Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.


📝 Code generated (5354 chars)

📊 WORKFLOW COMPLETED
Status: ✅ SUCCESS
Thread ID: test_c41f21f1
Final Stage: HITLStage.CODE_REVIEW
Final Result: ...

Execution Summary:
  - STDOUT lines: 0
  - STDERR lines: 0
  - Success: True

📦 Checking for artifacts...
🎯 Artifacts saved: artifacts/hitl_project_test_c41f21f1.tar.gz
📁 Tar contains 3 files: ['.', './main.py', './README.md']...


In [109]:
# --- Demo: Alternative - Human Decision EDIT ---

# Start another session to demo the EDIT workflow
EDIT_QUERY = "Write a simple calculator function that adds two numbers"
print(f"🚀 Starting second HITL session for EDIT demo...")
print(f"Query: {EDIT_QUERY}")

thread_id2, state2 = start_hitl_workflow(EDIT_QUERY)
print(f"\n✅ Session created: {thread_id2}")

if display_approval_request(state2):
    # Simulate human editing the code
    original_code = state2['approval_payload'].code
    
    # Create improved version
    improved_code = '''
def calculator(a: float, b: float, operation: str = "add") -> float:
    """Enhanced calculator with multiple operations and type hints.
    
    Args:
        a: First number
        b: Second number  
        operation: Operation to perform (add, subtract, multiply, divide)
        
    Returns:
        Result of the calculation
        
    Raises:
        ValueError: If operation is not supported
        ZeroDivisionError: If dividing by zero
    """
    if operation == "add":
        return a + b
    elif operation == "subtract":
        return a - b
    elif operation == "multiply":
        return a * b
    elif operation == "divide":
        if b == 0:
            raise ZeroDivisionError("Cannot divide by zero")
        return a / b
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    # Test all operations
    print(f"Addition: {calculator(5, 3)}")
    print(f"Subtraction: {calculator(5, 3, 'subtract')}")
    print(f"Multiplication: {calculator(5, 3, 'multiply')}")
    print(f"Division: {calculator(6, 3, 'divide')}")
    
    # Test error handling
    try:
        calculator(5, 0, 'divide')
    except ZeroDivisionError as e:
        print(f"Error caught: {e}")
'''
    
    print("\n✏️  Human Decision: EDIT with improved code")
    
    edit_decision = DecisionPayload(
        decision=HITLDecision.EDIT,
        code=improved_code
    )
    
    # Resume with edited code
    final_state2 = resume_hitl_workflow(thread_id2, edit_decision)
    
    print("\n" + "="*50)
    print("📊 EDIT WORKFLOW COMPLETED")
    print("="*50)
    print(f"Status: {'✅ SUCCESS' if not final_state2.get('error_message') else '❌ ERROR'}")
    
    if not final_state2.get('error_message'):
        print("✅ Enhanced calculator executed successfully!")
        print(f"Output preview: {final_state2.get('final_result', '')[:300]}...")
    else:
        print(f"❌ Error: {final_state2.get('error_message')}")
else:
    print("❌ No approval request for second session")

🚀 Starting second HITL session for EDIT demo...
Query: Write a simple calculator function that adds two numbers


Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.


📝 Code generated (768 chars)

✅ Session created: hitl_308518e5

🤖 HUMAN REVIEW REQUIRED
Task: Write a simple calculator function that adds two numbers
Stage: code_review

Generated Code:
----------------------------------------
import numbers

def add(a, b):
    """
    Return the sum of a and b.
    Accepts numeric inputs (int, float, etc.).
    """
    if not isinstance(a, numbers.Real) or not isinstance(b, numbers.Real):
        raise TypeError("add() expects numeric inputs (int or float).")
    return a + b

if __name__ == "__main__":
    # Basic internal tests using assertions
    assert add(0, 0) == 0
    assert add(-2, 2) == 0
    assert add(1.5, 2.5) == 4.0
    assert add(-3, -4) == -7

    # Demonstrate a few additions and print results
    test_pairs = [
        (2, 3),
        (0, 0),
        (-4, 5),
        (1.2, 3.4),
        (10, -7.5),
    ]

    for a, b in test_pairs:
        result = add(a, b)
        print(f"{a} + {b} = {result}")

    print("All tests passed.")
---

Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.
Already shutdown, dropping span.


📝 Code generated (1164 chars)

📊 EDIT WORKFLOW COMPLETED
Status: ✅ SUCCESS
✅ Enhanced calculator executed successfully!
Output preview: ...


In [110]:
# --- Demo: Load and Extract Tar Archives ---

print("📦 TAR ARCHIVE MANAGEMENT DEMO")
print("="*50)

# List all tar files created by HITL sessions
import glob
import os
from pathlib import Path

tar_pattern = "artifacts/hitl_project_*.tar.gz"
tar_files = glob.glob(tar_pattern)

print(f"🔍 Found {len(tar_files)} HITL tar archives:")
for tar_file in tar_files:
    file_size = os.path.getsize(tar_file)
    print(f"  📁 {tar_file} ({file_size:,} bytes)")

if tar_files:
    # Extract and examine the first tar file
    selected_tar = tar_files[0]
    print(f"\n📂 Extracting: {selected_tar}")
    
    extract_dir = Path(f"artifacts/extracted_{Path(selected_tar).stem}")
    extract_dir.mkdir(parents=True, exist_ok=True)
    
    import tarfile
    with tarfile.open(selected_tar, 'r:gz') as tar:
        tar.extractall(extract_dir)
        
    print(f"✅ Extracted to: {extract_dir}")
    
    # List extracted contents
    print("\n📋 Extracted files:")
    for root, dirs, files in os.walk(extract_dir):
        level = root.replace(str(extract_dir), '').count(os.sep)
        indent = '  ' * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = '  ' * (level + 1)
        for file in files:
            file_path = os.path.join(root, file)
            file_size = os.path.getsize(file_path)
            print(f"{subindent}{file} ({file_size} bytes)")
    
    # Read and display main.py if it exists
    main_py = extract_dir / "main.py"
    if main_py.exists():
        print(f"\n📄 Content of {main_py.name}:")
        print("-" * 40)
        print(main_py.read_text()[:500] + "..." if len(main_py.read_text()) > 500 else main_py.read_text())
        print("-" * 40)
    
    # Read README if it exists
    readme_md = extract_dir / "README.md"
    if readme_md.exists():
        print(f"\n📖 Content of {readme_md.name}:")
        print("-" * 40)
        print(readme_md.read_text())
        print("-" * 40)
        
else:
    print("❌ No tar archives found - run the HITL demo first")

print("\n✅ Tar archive demo completed")

📦 TAR ARCHIVE MANAGEMENT DEMO
🔍 Found 1 HITL tar archives:
  📁 artifacts/hitl_project_test_c41f21f1.tar.gz (850 bytes)

📂 Extracting: artifacts/hitl_project_test_c41f21f1.tar.gz
✅ Extracted to: artifacts/extracted_hitl_project_test_c41f21f1.tar

📋 Extracted files:
extracted_hitl_project_test_c41f21f1.tar/
  main.py (1335 bytes)
  README.md (254 bytes)

📄 Content of main.py:
----------------------------------------
class SimpleCalculator:
    """
    A simple calculator with basic arithmetic operations.
    """

    def add(self, a, b):
        """Return the sum of a and b."""
        return a + b

    def subtract(self, a, b):
        """Return the difference of a and b (a - b)."""
        return a - b

    def multiply(self, a, b):
        """Return the product of a and b."""
        return a * b

    def divide(self, a, b):
        """Return the division of a by b.

        Raises:
            ZeroDivis...
----------------------------------------

📖 Content of README.md:
------------

  tar.extractall(extract_dir)


In [111]:
# --- Trace Analysis & LangSmith Integration ---

print("📊 OPENTELEMETRY TRACE ANALYSIS")
print("="*50)

# Force flush all pending spans
try:
    # Get all active span processors and force them to flush
    for processor in tracer_provider._active_span_processor._span_processors:
        if hasattr(processor, 'force_flush'):
            processor.force_flush(timeout_millis=5000)
except Exception as e:
    print(f"⚠️  Warning: Could not flush spans: {e}")

print("✅ Spans flushed to exporters")

# Check LangSmith integration
if langsmith_client and langsmith_api_key:
    print(f"\n🔗 LangSmith Integration Status:")
    print(f"  - API Key: {'✅ Set' if langsmith_api_key else '❌ Missing'}")
    print(f"  - Project: {langsmith_project}")
    print(f"  - OTLP Endpoint: https://api.smith.langchain.com/otel")
    
    # Try to get recent runs
    try:
        # This would require additional LangSmith API calls
        print(f"\n📈 To view traces:")
        print(f"  1. Open https://smith.langchain.com/")
        print(f"  2. Navigate to project: {langsmith_project}")
        print(f"  3. Look for traces with service.name='hitl-coding-agent'")
        print(f"  4. Filter by recent time range")
        
        # Show what traces should contain
        print(f"\n🔍 Expected trace structure:")
        print(f"  📊 Root span: start_hitl_workflow")
        print(f"    ├── 📝 generate_code_node (with gen_ai.* attributes)")
        print(f"    ├── 🔄 resume_hitl_workflow")
        print(f"    │   ├── 🏃 execute_code_node (with sandbox.* attributes)")
        print(f"    │   └── 📦 save_artifacts_node")
        
    except Exception as e:
        print(f"⚠️  Could not query LangSmith: {e}")
else:
    print("\n⚠️  LangSmith not configured - traces only in console")

# Display local trace summary
print(f"\n📋 Local Trace Summary:")
print(f"  - Service: hitl-coding-agent")
print(f"  - Environment: notebook")
print(f"  - Console traces: ✅ Enabled")
print(f"  - OTLP export: {'✅ Enabled' if langsmith_api_key else '❌ Disabled'}")

print("\n🎯 Key Attributes Traced:")
attributes = [
    "gen_ai.system = 'openai'",
    "gen_ai.operation.name = 'responses'", 
    "gen_ai.request.model = 'gpt-5-nano'",
    "gen_ai.prompt = <truncated>",
    "gen_ai.completion = <truncated>",
    "hitl.stage = 'code_review'",
    "hitl.decision = 'approve'",
    "sandbox.success = true/false",
    "latency_ms = <execution_time>",
    "node = <node_name>"
]

for attr in attributes:
    print(f"  📌 {attr}")

print("\n✅ Trace analysis complete")

📊 OPENTELEMETRY TRACE ANALYSIS
✅ Spans flushed to exporters

🔗 LangSmith Integration Status:
  - API Key: ✅ Set
  - Project: pr-majestic-codling-98
  - OTLP Endpoint: https://api.smith.langchain.com/otel

📈 To view traces:
  1. Open https://smith.langchain.com/
  2. Navigate to project: pr-majestic-codling-98
  3. Look for traces with service.name='hitl-coding-agent'
  4. Filter by recent time range

🔍 Expected trace structure:
  📊 Root span: start_hitl_workflow
    ├── 📝 generate_code_node (with gen_ai.* attributes)
    ├── 🔄 resume_hitl_workflow
    │   ├── 🏃 execute_code_node (with sandbox.* attributes)
    │   └── 📦 save_artifacts_node

📋 Local Trace Summary:
  - Service: hitl-coding-agent
  - Environment: notebook
  - Console traces: ✅ Enabled
  - OTLP export: ✅ Enabled

🎯 Key Attributes Traced:
  📌 gen_ai.system = 'openai'
  📌 gen_ai.operation.name = 'responses'
  📌 gen_ai.request.model = 'gpt-5-nano'
  📌 gen_ai.prompt = <truncated>
  📌 gen_ai.completion = <truncated>
  📌 hitl.st

In [112]:
# --- Final HITL Workflow Test ---

print("🧪 HITL WORKFLOW VALIDATION TEST")
print("="*60)

# Test checklist
checklist = {
    "✅ OpenTelemetry configured": tracer_provider is not None,
    "✅ LangSmith integration": langsmith_client is not None,
    "✅ HITL graph compiled": hitl_graph is not None,
    "✅ E2B sandbox active": 'PERSIST_SBX' in locals() and PERSIST_SBX is not None,
    "✅ Checkpointer enabled": checkpointer is not None,
    "✅ Artifacts directory": os.path.exists("artifacts"),
}

print("📋 Pre-flight checklist:")
for check, status in checklist.items():
    print(f"  {check if status else check.replace('✅', '❌')}")

all_ready = all(checklist.values())
print(f"\n🎯 System Status: {'🟢 READY' if all_ready else '🔴 NOT READY'}")

if all_ready:
    print("\n✅ HITL workflow is fully functional!")
    print("\n📖 Usage Summary:")
    print("  1. start_hitl_workflow(query) → returns thread_id, state")
    print("  2. display_approval_request(state) → shows human review UI")
    print("  3. resume_hitl_workflow(thread_id, decision) → continues workflow")
    print("  4. Artifacts saved as tar.gz in artifacts/ directory")
    print("  5. Full OpenTelemetry traces with GenAI semantic conventions")
    
    if langsmith_api_key:
        print(f"  6. Traces visible in LangSmith project: {langsmith_project}")
    
    print("\n🔗 Workflow Features:")
    features = [
        "Human-in-the-loop code review",
        "OpenAI Responses API integration", 
        "E2B sandboxed execution",
        "LangGraph state management",
        "OpenTelemetry + LangSmith tracing",
        "Tar archive artifact management",
        "Memory checkpointing for resumable sessions",
        "GenAI semantic conventions",
        "Error handling and retries"
    ]
    
    for feature in features:
        print(f"    🎯 {feature}")
        
else:
    print("\n❌ Some components are missing - check configuration")
    
print("\n" + "="*60)
print("🎉 HITL CODING AGENT WITH OPENTELEMETRY + LANGSMITH")
print("   Successfully integrated into E2B notebook!")
print("="*60)

# Clean up traces
print("\n🧹 Finalizing traces...")
try:
    for processor in tracer_provider._active_span_processor._span_processors:
        if hasattr(processor, 'shutdown'):
            processor.shutdown()
except Exception as e:
    print(f"⚠️  Trace cleanup warning: {e}")
    
print("✅ HITL implementation complete!")

🧪 HITL WORKFLOW VALIDATION TEST
📋 Pre-flight checklist:
  ✅ OpenTelemetry configured
  ✅ LangSmith integration
  ✅ HITL graph compiled
  ✅ E2B sandbox active
  ✅ Checkpointer enabled
  ✅ Artifacts directory

🎯 System Status: 🟢 READY

✅ HITL workflow is fully functional!

📖 Usage Summary:
  1. start_hitl_workflow(query) → returns thread_id, state
  2. display_approval_request(state) → shows human review UI
  3. resume_hitl_workflow(thread_id, decision) → continues workflow
  4. Artifacts saved as tar.gz in artifacts/ directory
  5. Full OpenTelemetry traces with GenAI semantic conventions
  6. Traces visible in LangSmith project: pr-majestic-codling-98

🔗 Workflow Features:
    🎯 Human-in-the-loop code review
    🎯 OpenAI Responses API integration
    🎯 E2B sandboxed execution
    🎯 LangGraph state management
    🎯 OpenTelemetry + LangSmith tracing
    🎯 Tar archive artifact management
    🎯 Memory checkpointing for resumable sessions
    🎯 GenAI semantic conventions
    🎯 Error handlin